* Fix: Add -admin.grpc flag to worker for explicit gRPC port configuration
* Fix(helm): Add adminGrpcServer to worker configuration
* Refactor: Support host:port.grpcPort address format, revert -admin.grpc flag
* Helm: Conditionally append grpcPort to worker admin address
* weed/admin: fix "send on closed channel" panic in worker gRPC server
Make unregisterWorker connection-aware to prevent closing channels
belonging to newer connections.
* weed/worker: improve gRPC client stability and logging
- Fix goroutine leak in reconnection logic
- Refactor reconnection loop to exit on success and prevent busy-waiting
- Add session identification and enhanced logging to client handlers
- Use constant for internal reset action and remove unused variables
* weed/worker: fix worker state initialization and add lifecycle logs
- Revert workerState to use running boolean correctly
- Prevent handleStart failing by checking running state instead of startTime
- Add more detailed logs for worker startup events
* feat: Add probes to worker service
* feat: Add probes to worker service
* Merge branch 'master' into pr/7896
* refactor
---------
Co-authored-by: Chris Lu <chris.lu@gmail.com>
* feat(worker): add metrics HTTP server and debug profiling support
- Add -metricsPort flag to enable Prometheus metrics endpoint
- Add -metricsIp flag to configure metrics server bind address
- Implement /metrics endpoint for Prometheus-compatible metrics
- Implement /health endpoint for Kubernetes readiness/liveness probes
- Add -debug flag to enable pprof debugging server
- Add -debug.port flag to configure debug server port
- Fix stats package import naming conflict by using alias
- Update usage examples to show new flags
Fixes#7843
* feat(helm): add worker metrics and health check support
- Update worker readiness probe to use httpGet on /health endpoint
- Update worker liveness probe to use httpGet on /health endpoint
- Add metricsPort flag to worker command in deployment template
- Support both httpGet and tcpSocket probe types for backward compatibility
- Update values.yaml with health check configuration
This enables Kubernetes pod lifecycle management for worker components through
proper health checks on the new metrics HTTP endpoint.
* feat(mini): align all services to share single debug and metrics servers
- Disable S3's separate debug server in mini mode (port 6060 now shared by all)
- Add metrics server startup to embedded worker for health monitoring
- All services now share the single metrics port (9327) and single debug port (6060)
- Consistent pattern with master, filer, volume, webdav services
* fix(worker): fix variable shadowing in health check handler
- Rename http.ResponseWriter parameter from 'w' to 'rw' to avoid shadowing
the outer 'w *worker.Worker' parameter
- Prevents potential bugs if future code tries to use worker state in handler
- Improves code clarity and follows Go best practices
* fix(worker): remove unused worker parameter in metrics server
- Change 'w *worker.Worker' parameter to '_' as it's not used
- Clarifies intent that parameter is intentionally unused
- Follows Go best practices and improves code clarity
* fix(helm): fix trailing backslash syntax errors in worker command
- Fix conditional backslash placement to prevent shell syntax errors
- Only add backslash when metricsPort OR extraArgs are present
- Prevents worker pod startup failures due to malformed command arguments
- Ensures proper shell command parsing regardless of configuration state
* refactor(worker): use standard stats.StartMetricsServer for consistency
- Replace custom metrics server implementation with stats.StartMetricsServer
to match pattern used in master, volume, s3, filer_sync components
- Simplifies code and improves maintainability
- Uses glog.Fatal for errors (consistent with other SeaweedFS components)
- Remove unused net/http and prometheus/promhttp imports
- Automatically provides /metrics and /health endpoints via standard implementation
* initial design
* added simulation as tests
* reorganized the codebase to move the simulation framework and tests into their own dedicated package
* integration test. ec worker task
* remove "enhanced" reference
* start master, volume servers, filer
Current Status
✅ Master: Healthy and running (port 9333)
✅ Filer: Healthy and running (port 8888)
✅ Volume Servers: All 6 servers running (ports 8080-8085)
🔄 Admin/Workers: Will start when dependencies are ready
* generate write load
* tasks are assigned
* admin start wtih grpc port. worker has its own working directory
* Update .gitignore
* working worker and admin. Task detection is not working yet.
* compiles, detection uses volumeSizeLimitMB from master
* compiles
* worker retries connecting to admin
* build and restart
* rendering pending tasks
* skip task ID column
* sticky worker id
* test canScheduleTaskNow
* worker reconnect to admin
* clean up logs
* worker register itself first
* worker can run ec work and report status
but:
1. one volume should not be repeatedly worked on.
2. ec shards needs to be distributed and source data should be deleted.
* move ec task logic
* listing ec shards
* local copy, ec. Need to distribute.
* ec is mostly working now
* distribution of ec shards needs improvement
* need configuration to enable ec
* show ec volumes
* interval field UI component
* rename
* integration test with vauuming
* garbage percentage threshold
* fix warning
* display ec shard sizes
* fix ec volumes list
* Update ui.go
* show default values
* ensure correct default value
* MaintenanceConfig use ConfigField
* use schema defined defaults
* config
* reduce duplication
* refactor to use BaseUIProvider
* each task register its schema
* checkECEncodingCandidate use ecDetector
* use vacuumDetector
* use volumeSizeLimitMB
* remove
remove
* remove unused
* refactor
* use new framework
* remove v2 reference
* refactor
* left menu can scroll now
* The maintenance manager was not being initialized when no data directory was configured for persistent storage.
* saving config
* Update task_config_schema_templ.go
* enable/disable tasks
* protobuf encoded task configurations
* fix system settings
* use ui component
* remove logs
* interface{} Reduction
* reduce interface{}
* reduce interface{}
* avoid from/to map
* reduce interface{}
* refactor
* keep it DRY
* added logging
* debug messages
* debug level
* debug
* show the log caller line
* use configured task policy
* log level
* handle admin heartbeat response
* Update worker.go
* fix EC rack and dc count
* Report task status to admin server
* fix task logging, simplify interface checking, use erasure_coding constants
* factor in empty volume server during task planning
* volume.list adds disk id
* track disk id also
* fix locking scheduled and manual scanning
* add active topology
* simplify task detector
* ec task completed, but shards are not showing up
* implement ec in ec_typed.go
* adjust log level
* dedup
* implementing ec copying shards and only ecx files
* use disk id when distributing ec shards
🎯 Planning: ActiveTopology creates DestinationPlan with specific TargetDisk
📦 Task Creation: maintenance_integration.go creates ECDestination with DiskId
🚀 Task Execution: EC task passes DiskId in VolumeEcShardsCopyRequest
💾 Volume Server: Receives disk_id and stores shards on specific disk (vs.store.Locations[req.DiskId])
📂 File System: EC shards and metadata land in the exact disk directory planned
* Delete original volume from all locations
* clean up existing shard locations
* local encoding and distributing
* Update docker/admin_integration/EC-TESTING-README.md
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
* check volume id range
* simplify
* fix tests
* fix types
* clean up logs and tests
---------
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
* add ui for maintenance
* valid config loading. fix workers page.
* refactor
* grpc between admin and workers
* add a long-running bidirectional grpc call between admin and worker
* use the grpc call to heartbeat
* use the grpc call to communicate
* worker can remove the http client
* admin uses http port + 10000 as its default grpc port
* one task one package
* handles connection failures gracefully with exponential backoff
* grpc with insecure tls
* grpc with optional tls
* fix detecting tls
* change time config from nano seconds to seconds
* add tasks with 3 interfaces
* compiles reducing hard coded
* remove a couple of tasks
* remove hard coded references
* reduce hard coded values
* remove hard coded values
* remove hard coded from templ
* refactor maintenance package
* fix import cycle
* simplify
* simplify
* auto register
* auto register factory
* auto register task types
* self register types
* refactor
* simplify
* remove one task
* register ui
* lazy init executor factories
* use registered task types
* DefaultWorkerConfig remove hard coded task types
* remove more hard coded
* implement get maintenance task
* dynamic task configuration
* "System Settings" should only have system level settings
* adjust menu for tasks
* ensure menu not collapsed
* render job configuration well
* use templ for ui of task configuration
* fix ordering
* fix bugs
* saving duration in seconds
* use value and unit for duration
* Delete WORKER_REFACTORING_PLAN.md
* Delete maintenance.json
* Delete custom_worker_example.go
* remove address from workers
* remove old code from ec task
* remove creating collection button
* reconnect with exponential backoff
* worker use security.toml
* start admin server with tls info from security.toml
* fix "weed admin" cli description