* feat: introduce scheduler lanes for independent per-workload scheduling
Split the single plugin scheduler loop into independent per-lane
goroutines so that volume management, iceberg compaction, and lifecycle
operations never block each other.
Each lane has its own:
- Goroutine (laneSchedulerLoop)
- Wake channel for immediate scheduling
- Admin lock scope (e.g. "plugin scheduler:default")
- Configurable idle sleep duration
- Loop state tracking
Three lanes are defined:
- default: vacuum, volume_balance, ec_balance, erasure_coding, admin_script
- iceberg: iceberg_maintenance
- lifecycle: s3_lifecycle (new, handler coming in a later commit)
Job types are mapped to lanes via a hardcoded map with LaneDefault as
the fallback. The SchedulerJobTypeState and SchedulerStatus types now
include a Lane field for API consumers.
* feat: per-lane execution reservation pools for resource isolation
Each scheduler lane now maintains its own execution reservation map
so that a busy volume lane cannot consume execution slots needed by
iceberg or lifecycle lanes. The per-lane pool is used by default when
dispatching jobs through the lane scheduler; the global pool remains
as a fallback for the public DispatchProposals API.
* feat: add per-lane scheduler status API and lane worker UI pages
- GET /api/plugin/lanes returns all lanes with status and job types
- GET /api/plugin/workers?lane=X filters workers by lane
- GET /api/plugin/scheduler-states?lane=X filters job types by lane
- GET /api/plugin/scheduler-status?lane=X returns lane-scoped status
- GET /plugin/lanes/{lane}/workers renders per-lane worker page
- SchedulerJobTypeState now includes a "lane" field
The lane worker pages show scheduler status, job type configuration,
and connected workers scoped to a single lane, with links back to
the main plugin overview.
* feat: add s3_lifecycle worker handler for object store lifecycle management
Implements a full plugin worker handler for S3 lifecycle management,
assigned to the new "lifecycle" scheduler lane.
Detection phase:
- Reads filer.conf to find buckets with TTL lifecycle rules
- Creates one job proposal per bucket with active lifecycle rules
- Supports bucket_filter wildcard pattern from admin config
Execution phase:
- Walks the bucket directory tree breadth-first
- Identifies expired objects by checking TtlSec + Crtime < now
- Deletes expired objects in configurable batches
- Reports progress with scanned/expired/error counts
- Supports dry_run mode for safe testing
Configurable via admin UI:
- batch_size: entries per filer listing page (default 1000)
- max_deletes_per_bucket: safety cap per run (default 10000)
- dry_run: detect without deleting
- delete_marker_cleanup: clean expired delete markers
- abort_mpu_days: abort stale multipart uploads
The handler integrates with the existing PutBucketLifecycle flow which
sets TtlSec on entries via filer.conf path rules.
* feat: add per-lane submenu items under Workers sidebar menu
Replace the single "Workers" sidebar link with a collapsible submenu
containing three lane entries:
- Default (volume management + admin scripts) -> /plugin
- Iceberg (table compaction) -> /plugin/lanes/iceberg/workers
- Lifecycle (S3 object expiration) -> /plugin/lanes/lifecycle/workers
The submenu auto-expands when on any /plugin page and highlights the
active lane. Icons match each lane's job type descriptor (server,
snowflake, hourglass).
* feat: scope plugin pages to their scheduler lane
The plugin overview, configuration, detection, queue, and execution
pages now filter workers, job types, scheduler states, and scheduler
status to only show data for their lane.
- Plugin() templ function accepts a lane parameter (default: "default")
- JavaScript appends ?lane= to /api/plugin/workers, /job-types,
/scheduler-states, and /scheduler-status API calls
- GET /api/plugin/job-types now supports ?lane= filtering
- When ?job= is provided (e.g. ?job=iceberg_maintenance), the lane is
auto-derived from the job type so the page scopes correctly
This ensures /plugin shows only default-lane workers and
/plugin/configuration?job=iceberg_maintenance scopes to the iceberg lane.
* fix: remove "Lane" from lane worker page titles and capitalize properly
"lifecycle Lane Workers" -> "Lifecycle Workers"
"iceberg Lane Workers" -> "Iceberg Workers"
* refactor: promote lane items to top-level sidebar menu entries
Move Default, Iceberg, and Lifecycle from a collapsible submenu to
direct top-level items under the WORKERS heading. Removes the
intermediate "Workers" parent link and collapse toggle.
* admin: unify plugin lane routes and handlers
* admin: filter plugin jobs and activities by lane
* admin: reuse plugin UI for worker lane pages
* fix: use ServerAddress.ToGrpcAddress() for filer connections in lifecycle handler
ClusterContext addresses use ServerAddress format (host:port.grpcPort).
Convert to the actual gRPC address via ToGrpcAddress() before dialing,
and add a Ping verification after connecting.
Fixes: "dial tcp: lookup tcp/8888.18888: unknown port"
* fix: resolve ServerAddress gRPC port in iceberg and lifecycle filer connections
ClusterContext addresses use ServerAddress format (host:httpPort.grpcPort).
Both the iceberg and lifecycle handlers now detect the compound format
and extract the gRPC port via ToGrpcAddress() before dialing. Plain
host:port addresses (e.g. from tests) are passed through unchanged.
Fixes: "dial tcp: lookup tcp/8888.18888: unknown port"
* align url
* Potential fix for code scanning alert no. 335: Incorrect conversion between integer types
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
* fix: address PR review findings across scheduler lanes and lifecycle handler
- Fix variable shadowing: rename loop var `w` to `worker` in
GetPluginWorkersAPI to avoid shadowing the http.ResponseWriter param
- Fix stale GetSchedulerStatus: aggregate loop states across all lanes
instead of reading never-updated legacy schedulerLoopState
- Scope InProcessJobs to lane in GetLaneSchedulerStatus
- Fix AbortMPUDays=0 treated as unset: change <= 0 to < 0 so 0 disables
- Propagate listing errors in lifecycle bucket walk instead of swallowing
- Implement DeleteMarkerCleanup: scan for S3 delete marker entries and
remove them
- Implement AbortMPUDays: scan .uploads directory and remove stale
multipart uploads older than the configured threshold
- Fix success determination: mark job failed when result.errors > 0
even if no fatal error occurred
- Add regression test for jobTypeLaneMap to catch drift from handler
registrations
* fix: guard against nil result in lifecycle completion and trim filer addresses
- Guard result dereference in completion summary: use local vars
defaulting to 0 when result is nil to prevent panic
- Append trimmed filer addresses instead of originals so whitespace
is not passed to the gRPC dialer
* fix: propagate ctx cancellation from deleteExpiredObjects and add config logging
- deleteExpiredObjects now returns a third error value when the context
is canceled mid-batch; the caller stops processing further batches
and returns the cancellation error to the job completion handler
- readBoolConfig and readInt64Config now log unexpected ConfigValue
types at V(1) for debugging, consistent with readStringConfig
* fix: propagate errors in lifecycle cleanup helpers and use correct delete marker key
- cleanupDeleteMarkers: return error on ctx cancellation and SeaweedList
failures instead of silently continuing
- abortIncompleteMPUs: log SeaweedList errors instead of discarding
- isDeleteMarker: use ExtDeleteMarkerKey ("Seaweed-X-Amz-Delete-Marker")
instead of ExtLatestVersionIsDeleteMarker which is for the parent entry
- batchSize cap: use math.MaxInt instead of math.MaxInt32
* fix: propagate ctx cancellation from abortIncompleteMPUs and log unrecognized bool strings
- abortIncompleteMPUs now returns (aborted, errors, ctxErr) matching
cleanupDeleteMarkers; caller stops on cancellation or listing failure
- readBoolConfig logs unrecognized string values before falling back
* fix: shared per-bucket budget across lifecycle phases and allow cleanup without expired objects
- Thread a shared remaining counter through TTL deletion, delete marker
cleanup, and MPU abort so the total operations per bucket never exceed
MaxDeletesPerBucket
- Remove early return when no TTL-expired objects found so delete marker
cleanup and MPU abort still run
- Add NOTE on cleanupDeleteMarkers about version-safety limitation
---------
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
1441 lines
39 KiB
Go
1441 lines
39 KiB
Go
package plugin
|
|
|
|
import (
|
|
"context"
|
|
"crypto/rand"
|
|
"encoding/hex"
|
|
"errors"
|
|
"fmt"
|
|
"io"
|
|
"sort"
|
|
"strings"
|
|
"sync"
|
|
"time"
|
|
|
|
"github.com/seaweedfs/seaweedfs/weed/glog"
|
|
"github.com/seaweedfs/seaweedfs/weed/pb/plugin_pb"
|
|
"google.golang.org/protobuf/proto"
|
|
"google.golang.org/protobuf/types/known/timestamppb"
|
|
)
|
|
|
|
const (
|
|
defaultOutgoingBuffer = 128
|
|
defaultSendTimeout = 5 * time.Second
|
|
defaultHeartbeatInterval = 30
|
|
defaultReconnectDelay = 5
|
|
defaultPendingSchemaBuffer = 1
|
|
adminScriptJobType = "admin_script"
|
|
)
|
|
|
|
type Options struct {
|
|
DataDir string
|
|
OutgoingBufferSize int
|
|
SendTimeout time.Duration
|
|
SchedulerTick time.Duration
|
|
ClusterContextProvider func(context.Context) (*plugin_pb.ClusterContext, error)
|
|
LockManager LockManager
|
|
// ConfigDefaultsProvider is an optional callback invoked when a job type's
|
|
// config is being bootstrapped from its descriptor defaults. It can enrich
|
|
// or replace the default config before it is persisted. If nil, descriptor
|
|
// defaults are used as-is.
|
|
ConfigDefaultsProvider func(config *plugin_pb.PersistedJobTypeConfig) *plugin_pb.PersistedJobTypeConfig
|
|
}
|
|
|
|
// JobTypeInfo contains metadata about a plugin job type.
|
|
type JobTypeInfo struct {
|
|
JobType string `json:"job_type"`
|
|
DisplayName string `json:"display_name"`
|
|
Weight int32 `json:"weight"`
|
|
}
|
|
|
|
type Plugin struct {
|
|
plugin_pb.UnimplementedPluginControlServiceServer
|
|
|
|
store *ConfigStore
|
|
registry *Registry
|
|
|
|
outgoingBuffer int
|
|
sendTimeout time.Duration
|
|
|
|
schedulerTick time.Duration
|
|
clusterContextProvider func(context.Context) (*plugin_pb.ClusterContext, error)
|
|
configDefaultsProvider func(config *plugin_pb.PersistedJobTypeConfig) *plugin_pb.PersistedJobTypeConfig
|
|
lockManager LockManager
|
|
|
|
schedulerMu sync.Mutex
|
|
nextDetectionAt map[string]time.Time
|
|
detectionInFlight map[string]bool
|
|
|
|
detectorLeaseMu sync.Mutex
|
|
detectorLeases map[string]string
|
|
|
|
lanes map[SchedulerLane]*schedulerLaneState
|
|
|
|
schedulerExecMu sync.Mutex
|
|
schedulerExecReservations map[string]int
|
|
adminScriptRunMu sync.RWMutex
|
|
schedulerDetectionMu sync.Mutex
|
|
schedulerDetection map[string]*schedulerDetectionInfo
|
|
schedulerRunMu sync.Mutex
|
|
schedulerRun map[string]*schedulerRunInfo
|
|
schedulerLoopMu sync.Mutex
|
|
schedulerLoopState schedulerLoopState
|
|
schedulerWakeCh chan struct{}
|
|
|
|
dedupeMu sync.Mutex
|
|
recentDedupeByType map[string]map[string]time.Time
|
|
|
|
sessionsMu sync.RWMutex
|
|
sessions map[string]*streamSession
|
|
|
|
pendingSchemaMu sync.Mutex
|
|
pendingSchema map[string]chan *plugin_pb.ConfigSchemaResponse
|
|
|
|
pendingDetectionMu sync.Mutex
|
|
pendingDetection map[string]*pendingDetectionState
|
|
|
|
pendingExecutionMu sync.Mutex
|
|
pendingExecution map[string]chan *plugin_pb.JobCompleted
|
|
|
|
jobsMu sync.RWMutex
|
|
jobs map[string]*TrackedJob
|
|
// serialize stale job cleanup to avoid duplicate expirations
|
|
staleJobsMu sync.Mutex
|
|
|
|
jobDetailsMu sync.Mutex
|
|
|
|
activitiesMu sync.RWMutex
|
|
activities []JobActivity
|
|
|
|
dirtyJobs bool
|
|
dirtyActivities bool
|
|
persistTicker *time.Ticker
|
|
|
|
ctx context.Context
|
|
ctxCancel context.CancelFunc
|
|
|
|
shutdownCh chan struct{}
|
|
shutdownOnce sync.Once
|
|
wg sync.WaitGroup
|
|
}
|
|
|
|
type streamSession struct {
|
|
workerID string
|
|
outgoing chan *plugin_pb.AdminToWorkerMessage
|
|
closeOnce sync.Once
|
|
}
|
|
|
|
type pendingDetectionState struct {
|
|
proposals []*plugin_pb.JobProposal
|
|
complete chan *plugin_pb.DetectionComplete
|
|
jobType string
|
|
workerID string
|
|
}
|
|
|
|
// DetectionReport captures one detection run including request metadata.
|
|
type DetectionReport struct {
|
|
RequestID string
|
|
JobType string
|
|
WorkerID string
|
|
Proposals []*plugin_pb.JobProposal
|
|
Complete *plugin_pb.DetectionComplete
|
|
}
|
|
|
|
func New(options Options) (*Plugin, error) {
|
|
store, err := NewConfigStore(options.DataDir)
|
|
if err != nil {
|
|
return nil, err
|
|
}
|
|
|
|
bufferSize := options.OutgoingBufferSize
|
|
if bufferSize <= 0 {
|
|
bufferSize = defaultOutgoingBuffer
|
|
}
|
|
sendTimeout := options.SendTimeout
|
|
if sendTimeout <= 0 {
|
|
sendTimeout = defaultSendTimeout
|
|
}
|
|
schedulerTick := options.SchedulerTick
|
|
if schedulerTick <= 0 {
|
|
schedulerTick = defaultSchedulerTick
|
|
}
|
|
|
|
lanes := make(map[SchedulerLane]*schedulerLaneState, len(AllLanes()))
|
|
for _, lane := range AllLanes() {
|
|
lanes[lane] = newLaneState(lane)
|
|
}
|
|
|
|
plugin := &Plugin{
|
|
store: store,
|
|
registry: NewRegistry(),
|
|
outgoingBuffer: bufferSize,
|
|
sendTimeout: sendTimeout,
|
|
schedulerTick: schedulerTick,
|
|
clusterContextProvider: options.ClusterContextProvider,
|
|
configDefaultsProvider: options.ConfigDefaultsProvider,
|
|
lockManager: options.LockManager,
|
|
lanes: lanes,
|
|
sessions: make(map[string]*streamSession),
|
|
pendingSchema: make(map[string]chan *plugin_pb.ConfigSchemaResponse),
|
|
pendingDetection: make(map[string]*pendingDetectionState),
|
|
pendingExecution: make(map[string]chan *plugin_pb.JobCompleted),
|
|
nextDetectionAt: make(map[string]time.Time),
|
|
detectionInFlight: make(map[string]bool),
|
|
detectorLeases: make(map[string]string),
|
|
schedulerExecReservations: make(map[string]int),
|
|
schedulerDetection: make(map[string]*schedulerDetectionInfo),
|
|
schedulerRun: make(map[string]*schedulerRunInfo),
|
|
recentDedupeByType: make(map[string]map[string]time.Time),
|
|
jobs: make(map[string]*TrackedJob),
|
|
activities: make([]JobActivity, 0, 256),
|
|
persistTicker: time.NewTicker(2 * time.Second),
|
|
schedulerWakeCh: make(chan struct{}, 1),
|
|
shutdownCh: make(chan struct{}),
|
|
}
|
|
plugin.ctx, plugin.ctxCancel = context.WithCancel(context.Background())
|
|
|
|
if err := plugin.loadPersistedMonitorState(); err != nil {
|
|
glog.Warningf("Plugin failed to load persisted monitoring state: %v", err)
|
|
}
|
|
|
|
if plugin.clusterContextProvider != nil {
|
|
for _, ls := range plugin.lanes {
|
|
plugin.wg.Add(1)
|
|
go plugin.laneSchedulerLoop(ls)
|
|
}
|
|
}
|
|
plugin.wg.Add(1)
|
|
go plugin.persistenceLoop()
|
|
|
|
return plugin, nil
|
|
}
|
|
|
|
func (r *Plugin) Shutdown() {
|
|
if r.ctxCancel != nil {
|
|
r.ctxCancel()
|
|
}
|
|
if r.persistTicker != nil {
|
|
r.persistTicker.Stop()
|
|
}
|
|
|
|
r.shutdownOnce.Do(func() { close(r.shutdownCh) })
|
|
|
|
r.sessionsMu.Lock()
|
|
for workerID, session := range r.sessions {
|
|
session.close()
|
|
delete(r.sessions, workerID)
|
|
}
|
|
r.sessionsMu.Unlock()
|
|
|
|
r.pendingSchemaMu.Lock()
|
|
for requestID, ch := range r.pendingSchema {
|
|
close(ch)
|
|
delete(r.pendingSchema, requestID)
|
|
}
|
|
r.pendingSchemaMu.Unlock()
|
|
|
|
r.pendingDetectionMu.Lock()
|
|
for requestID, state := range r.pendingDetection {
|
|
close(state.complete)
|
|
delete(r.pendingDetection, requestID)
|
|
}
|
|
r.pendingDetectionMu.Unlock()
|
|
|
|
r.pendingExecutionMu.Lock()
|
|
for requestID, ch := range r.pendingExecution {
|
|
close(ch)
|
|
delete(r.pendingExecution, requestID)
|
|
}
|
|
r.pendingExecutionMu.Unlock()
|
|
|
|
r.wg.Wait()
|
|
}
|
|
|
|
func (r *Plugin) WorkerStream(stream plugin_pb.PluginControlService_WorkerStreamServer) error {
|
|
first, err := stream.Recv()
|
|
if err != nil {
|
|
if errors.Is(err, io.EOF) {
|
|
return nil
|
|
}
|
|
return fmt.Errorf("receive worker hello: %w", err)
|
|
}
|
|
|
|
hello := first.GetHello()
|
|
if hello == nil {
|
|
return fmt.Errorf("first message must be hello")
|
|
}
|
|
if strings.TrimSpace(hello.WorkerId) == "" {
|
|
return fmt.Errorf("worker_id is required")
|
|
}
|
|
|
|
workerID := hello.WorkerId
|
|
r.registry.UpsertFromHello(hello)
|
|
|
|
session := &streamSession{
|
|
workerID: workerID,
|
|
outgoing: make(chan *plugin_pb.AdminToWorkerMessage, r.outgoingBuffer),
|
|
}
|
|
r.putSession(session)
|
|
defer r.cleanupSession(workerID)
|
|
|
|
glog.V(0).Infof("Plugin worker connected: %s (%s)", workerID, hello.Address)
|
|
|
|
sendErrCh := make(chan error, 1)
|
|
r.wg.Add(1)
|
|
go func() {
|
|
defer r.wg.Done()
|
|
sendErrCh <- r.sendLoop(stream.Context(), stream, session)
|
|
}()
|
|
|
|
if err := r.sendAdminHello(workerID); err != nil {
|
|
glog.Warningf("failed to send plugin admin hello to %s: %v", workerID, err)
|
|
}
|
|
go r.prefetchDescriptorsFromHello(hello)
|
|
|
|
recvErrCh := make(chan error, 1)
|
|
r.wg.Add(1)
|
|
go func() {
|
|
defer r.wg.Done()
|
|
for {
|
|
message, recvErr := stream.Recv()
|
|
if recvErr != nil {
|
|
recvErrCh <- recvErr
|
|
return
|
|
}
|
|
r.handleWorkerMessage(workerID, message)
|
|
}
|
|
}()
|
|
|
|
for {
|
|
select {
|
|
case <-r.shutdownCh:
|
|
return nil
|
|
case err := <-sendErrCh:
|
|
if err != nil && !errors.Is(err, context.Canceled) {
|
|
return err
|
|
}
|
|
return nil
|
|
case recvErr := <-recvErrCh:
|
|
if errors.Is(recvErr, io.EOF) || errors.Is(recvErr, context.Canceled) {
|
|
return nil
|
|
}
|
|
return fmt.Errorf("receive plugin message from %s: %w", workerID, recvErr)
|
|
}
|
|
}
|
|
}
|
|
|
|
func (r *Plugin) RequestConfigSchema(ctx context.Context, jobType string, forceRefresh bool) (*plugin_pb.JobTypeDescriptor, error) {
|
|
if !forceRefresh {
|
|
descriptor, err := r.store.LoadDescriptor(jobType)
|
|
if err != nil {
|
|
return nil, err
|
|
}
|
|
if descriptor != nil {
|
|
return descriptor, nil
|
|
}
|
|
}
|
|
|
|
provider, err := r.registry.PickSchemaProvider(jobType)
|
|
if err != nil {
|
|
return nil, err
|
|
}
|
|
|
|
requestID, err := newRequestID("schema")
|
|
if err != nil {
|
|
return nil, err
|
|
}
|
|
|
|
responseCh := make(chan *plugin_pb.ConfigSchemaResponse, defaultPendingSchemaBuffer)
|
|
r.pendingSchemaMu.Lock()
|
|
r.pendingSchema[requestID] = responseCh
|
|
r.pendingSchemaMu.Unlock()
|
|
defer func() {
|
|
r.pendingSchemaMu.Lock()
|
|
delete(r.pendingSchema, requestID)
|
|
r.pendingSchemaMu.Unlock()
|
|
}()
|
|
|
|
requestMessage := &plugin_pb.AdminToWorkerMessage{
|
|
RequestId: requestID,
|
|
SentAt: timestamppb.Now(),
|
|
Body: &plugin_pb.AdminToWorkerMessage_RequestConfigSchema{
|
|
RequestConfigSchema: &plugin_pb.RequestConfigSchema{
|
|
JobType: jobType,
|
|
ForceRefresh: forceRefresh,
|
|
},
|
|
},
|
|
}
|
|
|
|
if err := r.sendToWorker(provider.WorkerID, requestMessage); err != nil {
|
|
return nil, err
|
|
}
|
|
|
|
select {
|
|
case <-ctx.Done():
|
|
return nil, ctx.Err()
|
|
case response, ok := <-responseCh:
|
|
if !ok {
|
|
return nil, fmt.Errorf("schema request %s interrupted", requestID)
|
|
}
|
|
if response == nil {
|
|
return nil, fmt.Errorf("schema request %s returned empty response", requestID)
|
|
}
|
|
if !response.Success {
|
|
return nil, fmt.Errorf("schema request failed for %s: %s", jobType, response.ErrorMessage)
|
|
}
|
|
if response.GetJobTypeDescriptor() == nil {
|
|
return nil, fmt.Errorf("schema request for %s returned no descriptor", jobType)
|
|
}
|
|
return response.GetJobTypeDescriptor(), nil
|
|
}
|
|
}
|
|
|
|
func (r *Plugin) LoadJobTypeConfig(jobType string) (*plugin_pb.PersistedJobTypeConfig, error) {
|
|
return r.store.LoadJobTypeConfig(jobType)
|
|
}
|
|
|
|
func (r *Plugin) SaveJobTypeConfig(config *plugin_pb.PersistedJobTypeConfig) error {
|
|
if err := r.store.SaveJobTypeConfig(config); err != nil {
|
|
return err
|
|
}
|
|
r.wakeScheduler()
|
|
return nil
|
|
}
|
|
|
|
func (r *Plugin) LoadDescriptor(jobType string) (*plugin_pb.JobTypeDescriptor, error) {
|
|
return r.store.LoadDescriptor(jobType)
|
|
}
|
|
|
|
func (r *Plugin) LoadRunHistory(jobType string) (*JobTypeRunHistory, error) {
|
|
return r.store.LoadRunHistory(jobType)
|
|
}
|
|
|
|
func (r *Plugin) IsConfigured() bool {
|
|
return r.store.IsConfigured()
|
|
}
|
|
|
|
func (r *Plugin) BaseDir() string {
|
|
return r.store.BaseDir()
|
|
}
|
|
|
|
func (r *Plugin) acquireAdminLock(reason string) (func(), error) {
|
|
if r == nil || r.lockManager == nil {
|
|
return func() {}, nil
|
|
}
|
|
return r.lockManager.Acquire(reason)
|
|
}
|
|
|
|
// RunDetectionWithReport requests one detector worker and returns proposals with request metadata.
|
|
func (r *Plugin) RunDetectionWithReport(
|
|
ctx context.Context,
|
|
jobType string,
|
|
clusterContext *plugin_pb.ClusterContext,
|
|
maxResults int32,
|
|
) (*DetectionReport, error) {
|
|
releaseGate := r.acquireDetectionExecutionGate(jobType, false)
|
|
defer releaseGate()
|
|
|
|
detector, err := r.pickDetector(jobType)
|
|
if err != nil {
|
|
return nil, err
|
|
}
|
|
|
|
requestID, err := newRequestID("detect")
|
|
if err != nil {
|
|
return nil, err
|
|
}
|
|
|
|
adminRuntime, adminConfigValues, workerConfigValues, err := r.loadJobTypeConfigPayload(jobType)
|
|
if err != nil {
|
|
return nil, err
|
|
}
|
|
lastCompletedRun := r.loadLastSuccessfulRun(jobType)
|
|
if strings.EqualFold(strings.TrimSpace(jobType), adminScriptJobType) {
|
|
lastCompletedRun = r.loadLastCompletedRun(jobType)
|
|
}
|
|
|
|
state := &pendingDetectionState{
|
|
complete: make(chan *plugin_pb.DetectionComplete, 1),
|
|
jobType: jobType,
|
|
workerID: detector.WorkerID,
|
|
}
|
|
r.pendingDetectionMu.Lock()
|
|
r.pendingDetection[requestID] = state
|
|
r.pendingDetectionMu.Unlock()
|
|
defer func() {
|
|
r.pendingDetectionMu.Lock()
|
|
delete(r.pendingDetection, requestID)
|
|
r.pendingDetectionMu.Unlock()
|
|
}()
|
|
|
|
r.appendActivity(JobActivity{
|
|
JobType: jobType,
|
|
RequestID: requestID,
|
|
WorkerID: detector.WorkerID,
|
|
Source: "detector",
|
|
Stage: "requested",
|
|
Message: "detection requested",
|
|
OccurredAt: timeToPtr(time.Now().UTC()),
|
|
Details: map[string]interface{}{
|
|
"max_results": maxResults,
|
|
},
|
|
})
|
|
|
|
message := &plugin_pb.AdminToWorkerMessage{
|
|
RequestId: requestID,
|
|
SentAt: timestamppb.Now(),
|
|
Body: &plugin_pb.AdminToWorkerMessage_RunDetectionRequest{
|
|
RunDetectionRequest: &plugin_pb.RunDetectionRequest{
|
|
RequestId: requestID,
|
|
JobType: jobType,
|
|
DetectionSequence: time.Now().UnixNano(),
|
|
AdminRuntime: adminRuntime,
|
|
AdminConfigValues: adminConfigValues,
|
|
WorkerConfigValues: workerConfigValues,
|
|
ClusterContext: clusterContext,
|
|
LastSuccessfulRun: lastCompletedRun,
|
|
MaxResults: maxResults,
|
|
},
|
|
},
|
|
}
|
|
|
|
if err := r.sendToWorker(detector.WorkerID, message); err != nil {
|
|
r.clearDetectorLease(jobType, detector.WorkerID)
|
|
r.appendActivity(JobActivity{
|
|
JobType: jobType,
|
|
RequestID: requestID,
|
|
WorkerID: detector.WorkerID,
|
|
Source: "detector",
|
|
Stage: "failed_to_send",
|
|
Message: err.Error(),
|
|
OccurredAt: timeToPtr(time.Now().UTC()),
|
|
})
|
|
return nil, err
|
|
}
|
|
|
|
select {
|
|
case <-ctx.Done():
|
|
r.sendCancel(detector.WorkerID, requestID, plugin_pb.WorkKind_WORK_KIND_DETECTION, ctx.Err())
|
|
r.appendActivity(JobActivity{
|
|
JobType: jobType,
|
|
RequestID: requestID,
|
|
WorkerID: detector.WorkerID,
|
|
Source: "detector",
|
|
Stage: "canceled",
|
|
Message: "detection canceled",
|
|
OccurredAt: timeToPtr(time.Now().UTC()),
|
|
})
|
|
return &DetectionReport{
|
|
RequestID: requestID,
|
|
JobType: jobType,
|
|
WorkerID: detector.WorkerID,
|
|
}, ctx.Err()
|
|
case complete, ok := <-state.complete:
|
|
if !ok {
|
|
return &DetectionReport{
|
|
RequestID: requestID,
|
|
JobType: jobType,
|
|
WorkerID: detector.WorkerID,
|
|
}, fmt.Errorf("detection request %s interrupted", requestID)
|
|
}
|
|
proposals := cloneJobProposals(state.proposals)
|
|
report := &DetectionReport{
|
|
RequestID: requestID,
|
|
JobType: jobType,
|
|
WorkerID: detector.WorkerID,
|
|
Proposals: proposals,
|
|
Complete: complete,
|
|
}
|
|
if complete == nil {
|
|
return report, fmt.Errorf("detection request %s returned no completion state", requestID)
|
|
}
|
|
if !complete.Success {
|
|
return report, fmt.Errorf("detection failed for %s: %s", jobType, complete.ErrorMessage)
|
|
}
|
|
return report, nil
|
|
}
|
|
}
|
|
|
|
// RunDetection requests one detector worker to produce job proposals for a job type.
|
|
func (r *Plugin) RunDetection(
|
|
ctx context.Context,
|
|
jobType string,
|
|
clusterContext *plugin_pb.ClusterContext,
|
|
maxResults int32,
|
|
) ([]*plugin_pb.JobProposal, error) {
|
|
report, err := r.RunDetectionWithReport(ctx, jobType, clusterContext, maxResults)
|
|
if report == nil {
|
|
return nil, err
|
|
}
|
|
return report.Proposals, err
|
|
}
|
|
|
|
// ExecuteJob sends one job to a capable executor worker and waits for completion.
|
|
func (r *Plugin) ExecuteJob(
|
|
ctx context.Context,
|
|
job *plugin_pb.JobSpec,
|
|
clusterContext *plugin_pb.ClusterContext,
|
|
attempt int32,
|
|
) (*plugin_pb.JobCompleted, error) {
|
|
if job == nil {
|
|
return nil, fmt.Errorf("job is nil")
|
|
}
|
|
jobType := strings.TrimSpace(job.JobType)
|
|
if jobType == "" {
|
|
return nil, fmt.Errorf("job_type is required")
|
|
}
|
|
releaseGate := r.acquireDetectionExecutionGate(jobType, true)
|
|
defer releaseGate()
|
|
|
|
executor, err := r.registry.PickExecutor(jobType)
|
|
if err != nil {
|
|
return nil, err
|
|
}
|
|
|
|
return r.executeJobWithExecutor(ctx, executor, job, clusterContext, attempt)
|
|
}
|
|
|
|
func (r *Plugin) acquireDetectionExecutionGate(jobType string, execution bool) func() {
|
|
normalizedJobType := strings.ToLower(strings.TrimSpace(jobType))
|
|
if execution && normalizedJobType == adminScriptJobType {
|
|
r.adminScriptRunMu.Lock()
|
|
return func() {
|
|
r.adminScriptRunMu.Unlock()
|
|
}
|
|
}
|
|
if normalizedJobType != adminScriptJobType {
|
|
r.adminScriptRunMu.RLock()
|
|
return func() {
|
|
r.adminScriptRunMu.RUnlock()
|
|
}
|
|
}
|
|
return func() {}
|
|
}
|
|
|
|
func (r *Plugin) executeJobWithExecutor(
|
|
ctx context.Context,
|
|
executor *WorkerSession,
|
|
job *plugin_pb.JobSpec,
|
|
clusterContext *plugin_pb.ClusterContext,
|
|
attempt int32,
|
|
) (*plugin_pb.JobCompleted, error) {
|
|
if executor == nil {
|
|
return nil, fmt.Errorf("executor is nil")
|
|
}
|
|
if job == nil {
|
|
return nil, fmt.Errorf("job is nil")
|
|
}
|
|
if strings.TrimSpace(job.JobType) == "" {
|
|
return nil, fmt.Errorf("job_type is required")
|
|
}
|
|
|
|
if strings.TrimSpace(job.JobId) == "" {
|
|
var err error
|
|
job.JobId, err = newRequestID("job")
|
|
if err != nil {
|
|
return nil, err
|
|
}
|
|
}
|
|
|
|
requestID, err := newRequestID("exec")
|
|
if err != nil {
|
|
return nil, err
|
|
}
|
|
|
|
adminRuntime, adminConfigValues, workerConfigValues, err := r.loadJobTypeConfigPayload(job.JobType)
|
|
if err != nil {
|
|
return nil, err
|
|
}
|
|
|
|
completedCh := make(chan *plugin_pb.JobCompleted, 1)
|
|
r.pendingExecutionMu.Lock()
|
|
r.pendingExecution[requestID] = completedCh
|
|
r.pendingExecutionMu.Unlock()
|
|
defer func() {
|
|
r.pendingExecutionMu.Lock()
|
|
delete(r.pendingExecution, requestID)
|
|
r.pendingExecutionMu.Unlock()
|
|
}()
|
|
|
|
r.trackExecutionStart(requestID, executor.WorkerID, job, attempt)
|
|
|
|
message := &plugin_pb.AdminToWorkerMessage{
|
|
RequestId: requestID,
|
|
SentAt: timestamppb.Now(),
|
|
Body: &plugin_pb.AdminToWorkerMessage_ExecuteJobRequest{
|
|
ExecuteJobRequest: &plugin_pb.ExecuteJobRequest{
|
|
RequestId: requestID,
|
|
Job: job,
|
|
AdminRuntime: adminRuntime,
|
|
AdminConfigValues: adminConfigValues,
|
|
WorkerConfigValues: workerConfigValues,
|
|
ClusterContext: clusterContext,
|
|
Attempt: attempt,
|
|
},
|
|
},
|
|
}
|
|
|
|
if err := r.sendToWorker(executor.WorkerID, message); err != nil {
|
|
return nil, err
|
|
}
|
|
|
|
select {
|
|
case <-ctx.Done():
|
|
r.sendCancel(executor.WorkerID, requestID, plugin_pb.WorkKind_WORK_KIND_EXECUTION, ctx.Err())
|
|
return nil, ctx.Err()
|
|
case completed, ok := <-completedCh:
|
|
if !ok {
|
|
return nil, fmt.Errorf("execution request %s interrupted", requestID)
|
|
}
|
|
if completed == nil {
|
|
return nil, fmt.Errorf("execution request %s returned empty completion", requestID)
|
|
}
|
|
if !completed.Success {
|
|
return completed, fmt.Errorf("job %s failed: %s", job.JobId, completed.ErrorMessage)
|
|
}
|
|
return completed, nil
|
|
}
|
|
}
|
|
|
|
// HasCapableWorker checks if any non-stale worker has a capability for the given job type.
|
|
func (r *Plugin) HasCapableWorker(jobType string) bool {
|
|
return r.registry.HasCapableWorker(jobType)
|
|
}
|
|
|
|
func (r *Plugin) ListWorkers() []*WorkerSession {
|
|
return r.registry.List()
|
|
}
|
|
|
|
func (r *Plugin) ListKnownJobTypes() ([]JobTypeInfo, error) {
|
|
registryJobTypes := r.registry.JobTypes()
|
|
storedJobTypes, err := r.store.ListJobTypes()
|
|
if err != nil {
|
|
return nil, err
|
|
}
|
|
|
|
jobTypeSet := make(map[string]struct{}, len(registryJobTypes)+len(storedJobTypes))
|
|
for _, jobType := range registryJobTypes {
|
|
jobTypeSet[jobType] = struct{}{}
|
|
}
|
|
for _, jobType := range storedJobTypes {
|
|
jobTypeSet[jobType] = struct{}{}
|
|
}
|
|
|
|
jobTypeList := make([]string, 0, len(jobTypeSet))
|
|
for jobType := range jobTypeSet {
|
|
jobTypeList = append(jobTypeList, jobType)
|
|
}
|
|
sort.Strings(jobTypeList)
|
|
|
|
result := make([]JobTypeInfo, 0, len(jobTypeList))
|
|
workers := r.registry.List()
|
|
|
|
// Pre-calculate the best capability for each job type from available workers.
|
|
// Prefer capabilities with non-empty DisplayName, then higher Weight.
|
|
jobTypeToCap := make(map[string]*plugin_pb.JobTypeCapability)
|
|
for _, worker := range workers {
|
|
for jobType, cap := range worker.Capabilities {
|
|
if cap == nil {
|
|
continue
|
|
}
|
|
existing, exists := jobTypeToCap[jobType]
|
|
if !exists || existing == nil {
|
|
jobTypeToCap[jobType] = cap
|
|
continue
|
|
}
|
|
// Preserve existing if it has DisplayName but cap doesn't.
|
|
if existing.DisplayName != "" && cap.DisplayName == "" {
|
|
continue
|
|
}
|
|
// Prefer capabilities with a non-empty DisplayName.
|
|
if existing.DisplayName == "" && cap.DisplayName != "" {
|
|
jobTypeToCap[jobType] = cap
|
|
continue
|
|
}
|
|
// If DisplayName statuses are equal, prefer higher Weight.
|
|
if cap.Weight > existing.Weight {
|
|
jobTypeToCap[jobType] = cap
|
|
}
|
|
}
|
|
}
|
|
|
|
for _, jobType := range jobTypeList {
|
|
info := JobTypeInfo{JobType: jobType}
|
|
|
|
// Get display name and weight from pre-calculated capabilities
|
|
if cap, ok := jobTypeToCap[jobType]; ok && cap != nil {
|
|
if cap.DisplayName != "" {
|
|
info.DisplayName = cap.DisplayName
|
|
}
|
|
info.Weight = cap.Weight
|
|
}
|
|
|
|
// Default display name to job type if not set
|
|
if info.DisplayName == "" {
|
|
info.DisplayName = jobType
|
|
}
|
|
|
|
result = append(result, info)
|
|
}
|
|
|
|
// Sort by weight (descending) then by job type (ascending)
|
|
sort.Slice(result, func(i, j int) bool {
|
|
if result[i].Weight != result[j].Weight {
|
|
return result[i].Weight > result[j].Weight // higher weight first
|
|
}
|
|
return result[i].JobType < result[j].JobType // alphabetical as tiebreaker
|
|
})
|
|
|
|
return result, nil
|
|
}
|
|
|
|
// FilterProposalsWithActiveJobs drops proposals that are already assigned/running.
|
|
func (r *Plugin) FilterProposalsWithActiveJobs(jobType string, proposals []*plugin_pb.JobProposal) ([]*plugin_pb.JobProposal, int) {
|
|
return r.filterProposalsWithActiveJobs(jobType, proposals)
|
|
}
|
|
|
|
func (r *Plugin) PickDetectorWorker(jobType string) (*WorkerSession, error) {
|
|
return r.pickDetector(jobType)
|
|
}
|
|
|
|
func (r *Plugin) PickExecutorWorker(jobType string) (*WorkerSession, error) {
|
|
return r.registry.PickExecutor(jobType)
|
|
}
|
|
|
|
func (r *Plugin) pickDetector(jobType string) (*WorkerSession, error) {
|
|
leasedWorkerID := r.getDetectorLease(jobType)
|
|
if leasedWorkerID != "" {
|
|
if worker, ok := r.registry.Get(leasedWorkerID); ok {
|
|
if capability := worker.Capabilities[jobType]; capability != nil && capability.CanDetect {
|
|
return worker, nil
|
|
}
|
|
}
|
|
r.clearDetectorLease(jobType, leasedWorkerID)
|
|
}
|
|
|
|
detector, err := r.registry.PickDetector(jobType)
|
|
if err != nil {
|
|
return nil, err
|
|
}
|
|
|
|
r.setDetectorLease(jobType, detector.WorkerID)
|
|
return detector, nil
|
|
}
|
|
|
|
func (r *Plugin) getDetectorLease(jobType string) string {
|
|
r.detectorLeaseMu.Lock()
|
|
defer r.detectorLeaseMu.Unlock()
|
|
return r.detectorLeases[jobType]
|
|
}
|
|
|
|
func (r *Plugin) setDetectorLease(jobType string, workerID string) {
|
|
r.detectorLeaseMu.Lock()
|
|
defer r.detectorLeaseMu.Unlock()
|
|
if jobType == "" || workerID == "" {
|
|
return
|
|
}
|
|
r.detectorLeases[jobType] = workerID
|
|
}
|
|
|
|
func (r *Plugin) clearDetectorLease(jobType string, workerID string) {
|
|
r.detectorLeaseMu.Lock()
|
|
defer r.detectorLeaseMu.Unlock()
|
|
|
|
current := r.detectorLeases[jobType]
|
|
if current == "" {
|
|
return
|
|
}
|
|
if workerID != "" && current != workerID {
|
|
return
|
|
}
|
|
delete(r.detectorLeases, jobType)
|
|
}
|
|
|
|
func (r *Plugin) sendCancel(workerID, targetID string, kind plugin_pb.WorkKind, cause error) {
|
|
if strings.TrimSpace(workerID) == "" || strings.TrimSpace(targetID) == "" {
|
|
return
|
|
}
|
|
|
|
requestID, err := newRequestID("cancel")
|
|
if err != nil {
|
|
requestID = ""
|
|
}
|
|
reason := "request canceled"
|
|
if cause != nil {
|
|
reason = cause.Error()
|
|
}
|
|
|
|
message := &plugin_pb.AdminToWorkerMessage{
|
|
RequestId: requestID,
|
|
SentAt: timestamppb.Now(),
|
|
Body: &plugin_pb.AdminToWorkerMessage_CancelRequest{
|
|
CancelRequest: &plugin_pb.CancelRequest{
|
|
TargetId: targetID,
|
|
TargetKind: kind,
|
|
Reason: reason,
|
|
},
|
|
},
|
|
}
|
|
if err := r.sendToWorker(workerID, message); err != nil {
|
|
glog.V(1).Infof("Plugin failed to send cancel request to worker=%s target=%s: %v", workerID, targetID, err)
|
|
}
|
|
}
|
|
|
|
func (r *Plugin) sendAdminHello(workerID string) error {
|
|
msg := &plugin_pb.AdminToWorkerMessage{
|
|
RequestId: "",
|
|
SentAt: timestamppb.Now(),
|
|
Body: &plugin_pb.AdminToWorkerMessage_Hello{
|
|
Hello: &plugin_pb.AdminHello{
|
|
Accepted: true,
|
|
Message: "plugin connected",
|
|
HeartbeatIntervalSeconds: defaultHeartbeatInterval,
|
|
ReconnectDelaySeconds: defaultReconnectDelay,
|
|
},
|
|
},
|
|
}
|
|
return r.sendToWorker(workerID, msg)
|
|
}
|
|
|
|
func (r *Plugin) sendLoop(
|
|
ctx context.Context,
|
|
stream plugin_pb.PluginControlService_WorkerStreamServer,
|
|
session *streamSession,
|
|
) error {
|
|
for {
|
|
select {
|
|
case <-ctx.Done():
|
|
return nil
|
|
case <-r.shutdownCh:
|
|
return nil
|
|
case msg, ok := <-session.outgoing:
|
|
if !ok {
|
|
return nil
|
|
}
|
|
if err := stream.Send(msg); err != nil {
|
|
return err
|
|
}
|
|
}
|
|
}
|
|
}
|
|
|
|
func (r *Plugin) sendToWorker(workerID string, message *plugin_pb.AdminToWorkerMessage) error {
|
|
r.sessionsMu.RLock()
|
|
session, ok := r.sessions[workerID]
|
|
r.sessionsMu.RUnlock()
|
|
if !ok {
|
|
return fmt.Errorf("worker %s is not connected", workerID)
|
|
}
|
|
|
|
select {
|
|
case <-r.shutdownCh:
|
|
return fmt.Errorf("plugin is shutting down")
|
|
case session.outgoing <- message:
|
|
return nil
|
|
case <-time.After(r.sendTimeout):
|
|
return fmt.Errorf("timed out sending message to worker %s", workerID)
|
|
}
|
|
}
|
|
|
|
func (r *Plugin) handleWorkerMessage(workerID string, message *plugin_pb.WorkerToAdminMessage) {
|
|
if message == nil {
|
|
return
|
|
}
|
|
|
|
switch body := message.Body.(type) {
|
|
case *plugin_pb.WorkerToAdminMessage_Hello:
|
|
r.registry.UpsertFromHello(body.Hello)
|
|
r.wakeScheduler()
|
|
case *plugin_pb.WorkerToAdminMessage_Heartbeat:
|
|
r.registry.UpdateHeartbeat(workerID, body.Heartbeat)
|
|
case *plugin_pb.WorkerToAdminMessage_ConfigSchemaResponse:
|
|
r.handleConfigSchemaResponse(body.ConfigSchemaResponse)
|
|
case *plugin_pb.WorkerToAdminMessage_DetectionProposals:
|
|
r.handleDetectionProposals(workerID, body.DetectionProposals)
|
|
case *plugin_pb.WorkerToAdminMessage_DetectionComplete:
|
|
r.handleDetectionComplete(workerID, body.DetectionComplete)
|
|
case *plugin_pb.WorkerToAdminMessage_JobProgressUpdate:
|
|
r.handleJobProgressUpdate(workerID, body.JobProgressUpdate)
|
|
case *plugin_pb.WorkerToAdminMessage_JobCompleted:
|
|
r.handleJobCompleted(body.JobCompleted)
|
|
case *plugin_pb.WorkerToAdminMessage_Acknowledge:
|
|
if !body.Acknowledge.Accepted {
|
|
glog.Warningf("Plugin worker %s rejected request %s: %s", workerID, body.Acknowledge.RequestId, body.Acknowledge.Message)
|
|
}
|
|
default:
|
|
// Keep the transport open even if admin does not yet consume all message variants.
|
|
}
|
|
}
|
|
|
|
func (r *Plugin) handleConfigSchemaResponse(response *plugin_pb.ConfigSchemaResponse) {
|
|
if response == nil {
|
|
return
|
|
}
|
|
|
|
if response.Success && response.GetJobTypeDescriptor() != nil {
|
|
jobType := response.JobType
|
|
if jobType == "" {
|
|
jobType = response.GetJobTypeDescriptor().JobType
|
|
}
|
|
if jobType != "" {
|
|
if err := r.store.SaveDescriptor(jobType, response.GetJobTypeDescriptor()); err != nil {
|
|
glog.Warningf("Plugin failed to persist descriptor for %s: %v", jobType, err)
|
|
}
|
|
if err := r.ensureJobTypeConfigFromDescriptor(jobType, response.GetJobTypeDescriptor()); err != nil {
|
|
glog.Warningf("Plugin failed to bootstrap config for %s: %v", jobType, err)
|
|
}
|
|
}
|
|
}
|
|
|
|
r.safeSendSchemaResponse(response.RequestId, response)
|
|
}
|
|
|
|
func (r *Plugin) safeSendSchemaResponse(requestID string, response *plugin_pb.ConfigSchemaResponse) {
|
|
r.pendingSchemaMu.Lock()
|
|
ch := r.pendingSchema[requestID]
|
|
r.pendingSchemaMu.Unlock()
|
|
safeSendCh(ch, response, r.shutdownCh)
|
|
}
|
|
|
|
func safeSendCh[T any](ch chan T, val T, shutdownCh <-chan struct{}) {
|
|
if ch == nil {
|
|
return
|
|
}
|
|
defer func() {
|
|
recover()
|
|
}()
|
|
select {
|
|
case ch <- val:
|
|
case <-shutdownCh:
|
|
}
|
|
}
|
|
|
|
func (r *Plugin) ensureJobTypeConfigFromDescriptor(jobType string, descriptor *plugin_pb.JobTypeDescriptor) error {
|
|
if descriptor == nil || strings.TrimSpace(jobType) == "" {
|
|
return nil
|
|
}
|
|
|
|
workerDefaults := CloneConfigValueMap(descriptor.WorkerDefaultValues)
|
|
if len(workerDefaults) == 0 && descriptor.WorkerConfigForm != nil {
|
|
workerDefaults = CloneConfigValueMap(descriptor.WorkerConfigForm.DefaultValues)
|
|
}
|
|
|
|
adminDefaults := map[string]*plugin_pb.ConfigValue{}
|
|
if descriptor.AdminConfigForm != nil {
|
|
adminDefaults = CloneConfigValueMap(descriptor.AdminConfigForm.DefaultValues)
|
|
}
|
|
|
|
adminRuntime := &plugin_pb.AdminRuntimeConfig{}
|
|
if descriptor.AdminRuntimeDefaults != nil {
|
|
defaults := descriptor.AdminRuntimeDefaults
|
|
adminRuntime = &plugin_pb.AdminRuntimeConfig{
|
|
Enabled: defaults.Enabled,
|
|
DetectionIntervalSeconds: defaults.DetectionIntervalSeconds,
|
|
DetectionTimeoutSeconds: defaults.DetectionTimeoutSeconds,
|
|
MaxJobsPerDetection: defaults.MaxJobsPerDetection,
|
|
GlobalExecutionConcurrency: defaults.GlobalExecutionConcurrency,
|
|
PerWorkerExecutionConcurrency: defaults.PerWorkerExecutionConcurrency,
|
|
RetryLimit: defaults.RetryLimit,
|
|
RetryBackoffSeconds: defaults.RetryBackoffSeconds,
|
|
JobTypeMaxRuntimeSeconds: defaults.JobTypeMaxRuntimeSeconds,
|
|
}
|
|
}
|
|
|
|
cfg := &plugin_pb.PersistedJobTypeConfig{
|
|
JobType: jobType,
|
|
DescriptorVersion: descriptor.DescriptorVersion,
|
|
AdminConfigValues: adminDefaults,
|
|
WorkerConfigValues: workerDefaults,
|
|
AdminRuntime: adminRuntime,
|
|
UpdatedAt: timestamppb.Now(),
|
|
UpdatedBy: "plugin",
|
|
}
|
|
|
|
// Check existence first to avoid calling configDefaultsProvider unnecessarily
|
|
// (e.g., it may make a blocking gRPC call to fetch master config).
|
|
existing, err := r.store.LoadJobTypeConfig(jobType)
|
|
if err != nil {
|
|
return err
|
|
}
|
|
if existing != nil {
|
|
return nil
|
|
}
|
|
|
|
if r.configDefaultsProvider != nil {
|
|
cfg = r.configDefaultsProvider(cfg)
|
|
}
|
|
|
|
_, err = r.store.SaveJobTypeConfigIfNotExists(cfg)
|
|
return err
|
|
}
|
|
|
|
func (r *Plugin) handleDetectionProposals(workerID string, message *plugin_pb.DetectionProposals) {
|
|
if message == nil || message.RequestId == "" {
|
|
return
|
|
}
|
|
|
|
r.pendingDetectionMu.Lock()
|
|
state := r.pendingDetection[message.RequestId]
|
|
if state != nil {
|
|
state.proposals = append(state.proposals, cloneJobProposals(message.Proposals)...)
|
|
}
|
|
r.pendingDetectionMu.Unlock()
|
|
if state == nil {
|
|
return
|
|
}
|
|
|
|
resolvedWorkerID := strings.TrimSpace(workerID)
|
|
if resolvedWorkerID == "" {
|
|
resolvedWorkerID = state.workerID
|
|
}
|
|
resolvedJobType := strings.TrimSpace(message.JobType)
|
|
if resolvedJobType == "" {
|
|
resolvedJobType = state.jobType
|
|
}
|
|
if resolvedJobType == "" {
|
|
resolvedJobType = "unknown"
|
|
}
|
|
|
|
r.appendActivity(JobActivity{
|
|
JobType: resolvedJobType,
|
|
RequestID: message.RequestId,
|
|
WorkerID: resolvedWorkerID,
|
|
Source: "detector",
|
|
Stage: "proposals_batch",
|
|
Message: fmt.Sprintf("received %d proposal(s)", len(message.Proposals)),
|
|
OccurredAt: timeToPtr(time.Now().UTC()),
|
|
Details: map[string]interface{}{
|
|
"batch_size": len(message.Proposals),
|
|
"has_more": message.HasMore,
|
|
},
|
|
})
|
|
|
|
for _, proposal := range message.Proposals {
|
|
if proposal == nil {
|
|
continue
|
|
}
|
|
details := map[string]interface{}{
|
|
"proposal_id": proposal.ProposalId,
|
|
"dedupe_key": proposal.DedupeKey,
|
|
"priority": proposal.Priority.String(),
|
|
"summary": proposal.Summary,
|
|
"detail": proposal.Detail,
|
|
"labels": proposal.Labels,
|
|
}
|
|
if params := configValueMapToPlain(proposal.Parameters); len(params) > 0 {
|
|
details["parameters"] = params
|
|
}
|
|
|
|
messageText := strings.TrimSpace(proposal.Summary)
|
|
if messageText == "" {
|
|
messageText = fmt.Sprintf("proposal %s", strings.TrimSpace(proposal.ProposalId))
|
|
}
|
|
if messageText == "" {
|
|
messageText = "proposal generated"
|
|
}
|
|
|
|
r.appendActivity(JobActivity{
|
|
JobType: resolvedJobType,
|
|
RequestID: message.RequestId,
|
|
WorkerID: resolvedWorkerID,
|
|
Source: "detector",
|
|
Stage: "proposal",
|
|
Message: messageText,
|
|
OccurredAt: timeToPtr(time.Now().UTC()),
|
|
Details: details,
|
|
})
|
|
}
|
|
}
|
|
|
|
func (r *Plugin) handleDetectionComplete(workerID string, message *plugin_pb.DetectionComplete) {
|
|
if message == nil {
|
|
return
|
|
}
|
|
if !message.Success {
|
|
glog.Warningf("Plugin detection failed job_type=%s: %s", message.JobType, message.ErrorMessage)
|
|
}
|
|
if message.RequestId == "" {
|
|
return
|
|
}
|
|
|
|
r.pendingDetectionMu.Lock()
|
|
state := r.pendingDetection[message.RequestId]
|
|
r.pendingDetectionMu.Unlock()
|
|
if state == nil {
|
|
return
|
|
}
|
|
|
|
resolvedWorkerID := strings.TrimSpace(workerID)
|
|
if resolvedWorkerID == "" {
|
|
resolvedWorkerID = state.workerID
|
|
}
|
|
resolvedJobType := strings.TrimSpace(message.JobType)
|
|
if resolvedJobType == "" {
|
|
resolvedJobType = state.jobType
|
|
}
|
|
if resolvedJobType == "" {
|
|
resolvedJobType = "unknown"
|
|
}
|
|
|
|
stage := "completed"
|
|
messageText := "detection completed"
|
|
if !message.Success {
|
|
stage = "failed"
|
|
messageText = strings.TrimSpace(message.ErrorMessage)
|
|
if messageText == "" {
|
|
messageText = "detection failed"
|
|
}
|
|
}
|
|
r.appendActivity(JobActivity{
|
|
JobType: resolvedJobType,
|
|
RequestID: message.RequestId,
|
|
WorkerID: resolvedWorkerID,
|
|
Source: "detector",
|
|
Stage: stage,
|
|
Message: messageText,
|
|
OccurredAt: timeToPtr(time.Now().UTC()),
|
|
Details: map[string]interface{}{
|
|
"success": message.Success,
|
|
"total_proposals": message.TotalProposals,
|
|
},
|
|
})
|
|
|
|
r.safeSendDetectionComplete(message.RequestId, message)
|
|
}
|
|
|
|
func (r *Plugin) safeSendDetectionComplete(requestID string, message *plugin_pb.DetectionComplete) {
|
|
r.pendingDetectionMu.Lock()
|
|
state, found := r.pendingDetection[requestID]
|
|
var ch chan *plugin_pb.DetectionComplete
|
|
if found && state != nil {
|
|
ch = state.complete
|
|
}
|
|
r.pendingDetectionMu.Unlock()
|
|
|
|
if ch != nil {
|
|
safeSendCh(ch, message, r.shutdownCh)
|
|
}
|
|
}
|
|
|
|
func (r *Plugin) handleJobCompleted(completed *plugin_pb.JobCompleted) {
|
|
if completed == nil || completed.JobType == "" {
|
|
return
|
|
}
|
|
|
|
if completed.RequestId != "" {
|
|
r.safeSendJobCompleted(completed.RequestId, completed)
|
|
}
|
|
|
|
tracked := r.trackExecutionCompletion(completed)
|
|
workerID := ""
|
|
if tracked != nil && tracked.WorkerID != "" {
|
|
workerID = tracked.WorkerID
|
|
}
|
|
|
|
r.trackWorkerActivities(completed.JobType, completed.JobId, completed.RequestId, workerID, completed.Activities)
|
|
|
|
record := &JobRunRecord{
|
|
RunID: completed.RequestId,
|
|
JobID: completed.JobId,
|
|
JobType: completed.JobType,
|
|
WorkerID: "",
|
|
Outcome: RunOutcomeError,
|
|
CompletedAt: timeToPtr(time.Now().UTC()),
|
|
}
|
|
if completed.CompletedAt != nil {
|
|
record.CompletedAt = timeToPtr(completed.CompletedAt.AsTime().UTC())
|
|
}
|
|
if completed.Success {
|
|
record.Outcome = RunOutcomeSuccess
|
|
record.Message = "completed"
|
|
if completed.Result != nil && completed.Result.Summary != "" {
|
|
record.Message = completed.Result.Summary
|
|
}
|
|
} else {
|
|
record.Outcome = RunOutcomeError
|
|
record.Message = completed.ErrorMessage
|
|
}
|
|
|
|
if tracked != nil {
|
|
if workerID != "" {
|
|
record.WorkerID = workerID
|
|
}
|
|
if tracked.CreatedAt != nil && record.CompletedAt != nil && record.CompletedAt.After(*tracked.CreatedAt) {
|
|
record.DurationMs = int64(record.CompletedAt.Sub(*tracked.CreatedAt) / time.Millisecond)
|
|
}
|
|
}
|
|
|
|
if err := r.store.AppendRunRecord(completed.JobType, record); err != nil {
|
|
glog.Warningf("Plugin failed to append run record for %s: %v", completed.JobType, err)
|
|
}
|
|
}
|
|
|
|
func (r *Plugin) safeSendJobCompleted(requestID string, completed *plugin_pb.JobCompleted) {
|
|
r.pendingExecutionMu.Lock()
|
|
ch := r.pendingExecution[requestID]
|
|
r.pendingExecutionMu.Unlock()
|
|
safeSendCh(ch, completed, r.shutdownCh)
|
|
}
|
|
|
|
func (r *Plugin) putSession(session *streamSession) {
|
|
r.sessionsMu.Lock()
|
|
defer r.sessionsMu.Unlock()
|
|
|
|
if old, exists := r.sessions[session.workerID]; exists {
|
|
old.close()
|
|
}
|
|
r.sessions[session.workerID] = session
|
|
}
|
|
|
|
func (r *Plugin) cleanupSession(workerID string) {
|
|
r.registry.Remove(workerID)
|
|
|
|
r.sessionsMu.Lock()
|
|
session, exists := r.sessions[workerID]
|
|
if exists {
|
|
delete(r.sessions, workerID)
|
|
session.close()
|
|
}
|
|
r.sessionsMu.Unlock()
|
|
|
|
glog.V(0).Infof("Plugin worker disconnected: %s", workerID)
|
|
}
|
|
|
|
func newRequestID(prefix string) (string, error) {
|
|
buf := make([]byte, 8)
|
|
if _, err := rand.Read(buf); err != nil {
|
|
return "", fmt.Errorf("generate request id: %w", err)
|
|
}
|
|
if prefix == "" {
|
|
prefix = "req"
|
|
}
|
|
return fmt.Sprintf("%s-%d-%s", prefix, time.Now().UnixNano(), hex.EncodeToString(buf)), nil
|
|
}
|
|
|
|
func (r *Plugin) loadJobTypeConfigPayload(jobType string) (
|
|
*plugin_pb.AdminRuntimeConfig,
|
|
map[string]*plugin_pb.ConfigValue,
|
|
map[string]*plugin_pb.ConfigValue,
|
|
error,
|
|
) {
|
|
config, err := r.store.LoadJobTypeConfig(jobType)
|
|
if err != nil {
|
|
return nil, nil, nil, err
|
|
}
|
|
|
|
if config == nil {
|
|
return &plugin_pb.AdminRuntimeConfig{}, map[string]*plugin_pb.ConfigValue{}, map[string]*plugin_pb.ConfigValue{}, nil
|
|
}
|
|
|
|
adminRuntime := config.AdminRuntime
|
|
if adminRuntime == nil {
|
|
adminRuntime = &plugin_pb.AdminRuntimeConfig{}
|
|
}
|
|
return adminRuntime, CloneConfigValueMap(config.AdminConfigValues), CloneConfigValueMap(config.WorkerConfigValues), nil
|
|
}
|
|
|
|
func cloneJobProposals(in []*plugin_pb.JobProposal) []*plugin_pb.JobProposal {
|
|
if len(in) == 0 {
|
|
return nil
|
|
}
|
|
out := make([]*plugin_pb.JobProposal, 0, len(in))
|
|
for _, proposal := range in {
|
|
if proposal == nil {
|
|
continue
|
|
}
|
|
out = append(out, proto.Clone(proposal).(*plugin_pb.JobProposal))
|
|
}
|
|
return out
|
|
}
|
|
|
|
func (r *Plugin) loadLastSuccessfulRun(jobType string) *timestamppb.Timestamp {
|
|
history, err := r.store.LoadRunHistory(jobType)
|
|
if err != nil {
|
|
glog.Warningf("Plugin failed to load run history for %s: %v", jobType, err)
|
|
return nil
|
|
}
|
|
if history == nil || len(history.SuccessfulRuns) == 0 {
|
|
return nil
|
|
}
|
|
|
|
var latest time.Time
|
|
for i := range history.SuccessfulRuns {
|
|
completedAt := history.SuccessfulRuns[i].CompletedAt
|
|
if completedAt == nil || completedAt.IsZero() {
|
|
continue
|
|
}
|
|
if latest.IsZero() || completedAt.After(latest) {
|
|
latest = *completedAt
|
|
}
|
|
}
|
|
if latest.IsZero() {
|
|
return nil
|
|
}
|
|
return timestamppb.New(latest.UTC())
|
|
}
|
|
|
|
func (r *Plugin) loadLastCompletedRun(jobType string) *timestamppb.Timestamp {
|
|
history, err := r.store.LoadRunHistory(jobType)
|
|
if err != nil {
|
|
glog.Warningf("Plugin failed to load run history for %s: %v", jobType, err)
|
|
return nil
|
|
}
|
|
if history == nil {
|
|
return nil
|
|
}
|
|
|
|
var latest time.Time
|
|
for i := range history.SuccessfulRuns {
|
|
completedAt := history.SuccessfulRuns[i].CompletedAt
|
|
if completedAt == nil || completedAt.IsZero() {
|
|
continue
|
|
}
|
|
if latest.IsZero() || completedAt.After(latest) {
|
|
latest = *completedAt
|
|
}
|
|
}
|
|
for i := range history.ErrorRuns {
|
|
completedAt := history.ErrorRuns[i].CompletedAt
|
|
if completedAt == nil || completedAt.IsZero() {
|
|
continue
|
|
}
|
|
if latest.IsZero() || completedAt.After(latest) {
|
|
latest = *completedAt
|
|
}
|
|
}
|
|
if latest.IsZero() {
|
|
return nil
|
|
}
|
|
return timestamppb.New(latest.UTC())
|
|
}
|
|
|
|
func CloneConfigValueMap(in map[string]*plugin_pb.ConfigValue) map[string]*plugin_pb.ConfigValue {
|
|
if len(in) == 0 {
|
|
return map[string]*plugin_pb.ConfigValue{}
|
|
}
|
|
out := make(map[string]*plugin_pb.ConfigValue, len(in))
|
|
for key, value := range in {
|
|
if value == nil {
|
|
continue
|
|
}
|
|
out[key] = proto.Clone(value).(*plugin_pb.ConfigValue)
|
|
}
|
|
return out
|
|
}
|
|
|
|
func (s *streamSession) close() {
|
|
s.closeOnce.Do(func() {
|
|
close(s.outgoing)
|
|
})
|
|
}
|
|
|
|
// WorkerConnectForTest simulates a worker connecting (test helper).
|
|
func (r *Plugin) WorkerConnectForTest(hello *plugin_pb.WorkerHello) {
|
|
r.registry.UpsertFromHello(hello)
|
|
}
|
|
|
|
// WorkerDisconnectForTest simulates a worker disconnecting (test helper).
|
|
func (r *Plugin) WorkerDisconnectForTest(workerID string) {
|
|
r.registry.Remove(workerID)
|
|
}
|