* Add iceberg_maintenance plugin worker handler (Phase 1)
Implement automated Iceberg table maintenance as a new plugin worker job
type. The handler scans S3 table buckets for tables needing maintenance
and executes operations in the correct Iceberg order: expire snapshots,
remove orphan files, and rewrite manifests.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* Add data file compaction to iceberg maintenance handler (Phase 2)
Implement bin-packing compaction for small Parquet data files:
- Enumerate data files from manifests, group by partition
- Merge small files using parquet-go (read rows, write merged output)
- Create new manifest with ADDED/DELETED/EXISTING entries
- Commit new snapshot with compaction metadata
Add 'compact' operation to maintenance order (runs before expire_snapshots),
configurable via target_file_size_bytes and min_input_files thresholds.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* Fix memory exhaustion in mergeParquetFiles by processing files sequentially
Previously all source Parquet files were loaded into memory simultaneously,
risking OOM when a compaction bin contained many small files. Now each file
is loaded, its rows are streamed into the output writer, and its data is
released before the next file is loaded — keeping peak memory proportional
to one input file plus the output buffer.
* Validate bucket/namespace/table names against path traversal
Reject names containing '..', '/', or '\' in Execute to prevent
directory traversal via crafted job parameters.
* Add filer address failover in iceberg maintenance handler
Try each filer address from cluster context in order instead of only
using the first one. This improves resilience when the primary filer
is temporarily unreachable.
* Add separate MinManifestsToRewrite config for manifest rewrite threshold
The rewrite_manifests operation was reusing MinInputFiles (meant for
compaction bin file counts) as its manifest count threshold. Add a
dedicated MinManifestsToRewrite field with its own config UI section
and default value (5) so the two thresholds can be tuned independently.
* Fix risky mtime fallback in orphan removal that could delete new files
When entry.Attributes is nil, mtime defaulted to Unix epoch (1970),
which would always be older than the safety threshold, causing the
file to be treated as eligible for deletion. Skip entries with nil
Attributes instead, matching the safer logic in operations.go.
* Fix undefined function references in iceberg_maintenance_handler.go
Use the exported function names (ShouldSkipDetectionByInterval,
BuildDetectorActivity, BuildExecutorActivity) matching their
definitions in vacuum_handler.go.
* Remove duplicated iceberg maintenance handler in favor of iceberg/ subpackage
The IcebergMaintenanceHandler and its compaction code in the parent
pluginworker package duplicated the logic already present in the
iceberg/ subpackage (which self-registers via init()). The old code
lacked stale-plan guards, proper path normalization, CAS-based xattr
updates, and error-returning parseOperations.
Since the registry pattern (default "all") makes the old handler
unreachable, remove it entirely. All functionality is provided by
iceberg.Handler with the reviewed improvements.
* Fix MinManifestsToRewrite clamping to match UI minimum of 2
The clamp reset values below 2 to the default of 5, contradicting the
UI's advertised MinValue of 2. Clamp to 2 instead.
* Sort entries by size descending in splitOversizedBin for better packing
Entries were processed in insertion order which is non-deterministic
from map iteration. Sorting largest-first before the splitting loop
improves bin packing efficiency by filling bins more evenly.
* Add context cancellation check to drainReader loop
The row-streaming loop in drainReader did not check ctx between
iterations, making long compaction merges uncancellable. Check
ctx.Done() at the top of each iteration.
* Fix splitOversizedBin to always respect targetSize limit
The minFiles check in the split condition allowed bins to grow past
targetSize when they had fewer than minFiles entries, defeating the
OOM protection. Now bins always split at targetSize, and a trailing
runt with fewer than minFiles entries is merged into the previous bin.
* Add integration tests for iceberg table maintenance plugin worker
Tests start a real weed mini cluster, create S3 buckets and Iceberg
table metadata via filer gRPC, then exercise the iceberg.Handler
operations (ExpireSnapshots, RemoveOrphans, RewriteManifests) against
the live filer. A full maintenance cycle test runs all operations in
sequence and verifies metadata consistency.
Also adds exported method wrappers (testing_api.go) so the integration
test package can call the unexported handler methods.
* Fix splitOversizedBin dropping files and add source path to drainReader errors
The runt-merge step could leave leading bins with fewer than minFiles
entries (e.g. [80,80,10,10] with targetSize=100, minFiles=2 would drop
the first 80-byte file). Replace the filter-based approach with an
iterative merge that folds any sub-minFiles bin into its smallest
neighbor, preserving all eligible files.
Also add the source file path to drainReader error messages so callers
can identify which Parquet file caused a read/write failure.
* Harden integration test error handling
- s3put: fail immediately on HTTP 4xx/5xx instead of logging and
continuing
- lookupEntry: distinguish NotFound (return nil) from unexpected RPC
errors (fail the test)
- writeOrphan and orphan creation in FullMaintenanceCycle: check
CreateEntryResponse.Error in addition to the RPC error
* go fmt
---------
Co-authored-by: Copilot <copilot@github.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* feat: auto-disable master vacuum when plugin vacuum worker is active
When a vacuum-capable plugin worker connects to the admin server, the
admin server calls DisableVacuum on the master to prevent the automatic
scheduled vacuum from conflicting with the plugin worker's vacuum. When
the worker disconnects, EnableVacuum is called to restore the default
behavior. A safety net in the topology refresh loop re-enables vacuum
if the admin server disconnects without cleanup.
* rename isAdminServerConnected to isAdminServerConnectedFunc
* add 5s timeout to DisableVacuum/EnableVacuum gRPC calls
Prevents the monitor goroutine from blocking indefinitely if the
master is unresponsive.
* track plugin ownership of vacuum disable to avoid overriding operator
- Add vacuumDisabledByPlugin flag to Topology, set when DisableVacuum
is called while admin server is connected (i.e., by plugin monitor)
- Safety net only re-enables vacuum when it was disabled by plugin,
not when an operator intentionally disabled it via shell command
- EnableVacuum clears the plugin flag
* extract syncVacuumState for testability, add fake toggler tests
Extract the single sync step into syncVacuumState() with a
vacuumToggler interface. Add TestSyncVacuumState with a fake
toggler that verifies disable/enable calls on state transitions.
* use atomic.Bool for isDisableVacuum and vacuumDisabledByPlugin
Both fields are written by gRPC handlers and read by the vacuum
goroutine, causing a data race. Use atomic.Bool with Store/Load
for thread-safe access.
* use explicit by_plugin field instead of connection heuristic
Add by_plugin bool to DisableVacuumRequest proto so the caller
declares intent explicitly. The admin server monitor sets it to
true; shell commands leave it false. This prevents an operator's
intentional disable from being auto-reversed by the safety net.
* use setter for admin server callback instead of function parameter
Move isAdminServerConnected from StartRefreshWritableVolumes
parameter to Topology.SetAdminServerConnectedFunc() setter.
Keeps the function signature stable and decouples the topology
layer from the admin server concept.
* suppress repeated log messages on persistent sync failures
Add retrying parameter to syncVacuumState so the initial
state transition is logged at V(0) but subsequent retries
of the same transition are silent until the call succeeds.
* clear plugin ownership flag on manual DisableVacuum
Prevents stale plugin flag from causing incorrect auto-enable
when an operator manually disables vacuum after a plugin had
previously disabled it.
* add by_plugin to EnableVacuumRequest for symmetric ownership tracking
Plugin-driven EnableVacuum now only re-enables if the plugin was
the one that disabled it. If an operator manually disabled vacuum
after the plugin, the plugin's EnableVacuum is a no-op. This
prevents the plugin monitor from overriding operator intent on
worker disconnect.
* use cancellable context for monitorVacuumWorker goroutine
Replace context.Background() with a cancellable context stored
as bgCancel on AdminServer. Shutdown() calls bgCancel() so
monitorVacuumWorker exits cleanly via ctx.Done().
* track operator and plugin vacuum disables independently
Replace single isDisableVacuum flag with two independent flags:
vacuumDisabledByOperator and vacuumDisabledByPlugin. Each caller
only flips its own flag. The effective disabled state is the OR
of both. This prevents a plugin connect/disconnect cycle from
overriding an operator's manual disable, and vice versa.
* fix safety net to clear plugin flag, not operator flag
The safety net should call EnableVacuumByPlugin() to clear only
the plugin disable flag when the admin server disconnects. The
previous call to EnableVacuum() incorrectly cleared the operator
flag instead.
* master: return 503/Unavailable during topology warmup after leader change
After a master restart or leader change, the topology is empty until
volume servers reconnect and send heartbeats. During this warmup window
(3 heartbeat intervals = 15 seconds), volume lookups that fail now
return 503 Service Unavailable (HTTP) or gRPC Unavailable instead of
404 Not Found, signaling clients to retry with other masters.
* master: skip warmup 503 on fresh start and single-master setups
- Check MaxVolumeId > 0 to distinguish restart from fresh start
(MaxVolumeId is Raft-persisted, so 0 means no prior data)
- Check peer count > 1 so single-master deployments aren't affected
(no point suggesting "retry with other masters" if there are none)
* master: address review feedback and block assigns during warmup
- Protect LastLeaderChangeTime with dedicated mutex (fix data race)
- Extract warmup multiplier as WarmupPulseMultiplier constant
- Derive Retry-After header from pulse config instead of hardcoding
- Only trigger warmup 503 for "not found" errors, not parse errors
- Return nil response (not partial) on gRPC Unavailable
- Add doc comments to IsWarmingUp, getter/setter, WarmupDuration
- Block volume assign requests (HTTP and gRPC) during warmup,
since the topology is incomplete and assignments would be unreliable
- Skip warmup behavior for single-master setups (no peers to retry)
* master: apply warmup to all setups, skip only on fresh start
Single-master restarts still have an empty topology until heartbeats
arrive, so warmup protection should apply there too. The only case
to skip is a fresh cluster start (MaxVolumeId == 0), which already
has no volumes to look up.
- Remove GetMasterCount() > 1 guard from all warmup checks
- Remove now-unused GetMasterCount helper
- Update error messages to "topology is still loading" (not
"retry with other masters" which doesn't apply to single-master)
* master: add client-side retry on Unavailable for lookup and assign
The server-side 503/Unavailable during warmup needs client cooperation.
Previously, LookupVolumeIds and Assign would immediately propagate the
error without retry.
Now both paths retry with exponential backoff (1s -> 1.5s -> ... up to
6s) when receiving Unavailable, respecting context cancellation. This
covers the warmup window where the master's topology is still loading
after a restart or leader change.
* master: seed warmup timestamp in legacy raft path at setup
The legacy raft path only set lastLeaderChangeTime inside the event
listener callback, which could fire after IsLeader() was already
observed as true in SetRaftServer. Seed the timestamp at setup time
(matching the hashicorp path) so IsWarmingUp() is active immediately.
* master: fix assign retry loop to cover full warmup window
The retry loop used waitTime <= maxWaitTime as a stop condition,
causing it to give up after ~13s while warmup lasts 15s. Now cap
each individual sleep at maxWaitTime but keep retrying until the
context is cancelled.
* master: preserve gRPC status in lookup retry and fix retry window
Return the raw gRPC error instead of wrapping with fmt.Errorf so
status.FromError() can extract the status code. Use proper gRPC
status check (codes.Unavailable) instead of string matching. Also
cap individual sleep at maxWaitTime while retrying until ctx is done.
* master: use gRPC status code instead of string matching in assign retry
Use status.FromError/codes.Unavailable instead of brittle
strings.Contains for detecting retriable gRPC errors in the
assign retry loop.
* master: use remaining warmup duration for Retry-After header
Set Retry-After to the remaining warmup time instead of the full
warmup duration, so clients don't wait longer than necessary.
* master: reset ret.Replicas before populating from assign response
Clear Replicas slice before appending to prevent duplicate entries
when the assign response is retried or when alternative requests
are attempted.
* master: add unit tests for warmup retry behavior
Test that Assign() and LookupVolumeIds() retry on codes.Unavailable
and stop promptly when the context is cancelled.
* master: record leader change time before initialization work
Move SetLastLeaderChangeTime() to fire immediately when the leader
change event is received, before DoBarrier(), EnsureTopologyId(),
and updatePeers(), so the warmup clock starts at the true moment
of leadership transition.
* master: use topology warmup duration in volume growth wait loop
Replace hardcoded constants.VolumePulsePeriod * 2 with
topo.IsWarmingUp() and topo.WarmupDuration() so the growth wait
stays in sync with the configured warmup window. Remove unused
constants import.
* master: resolve master before creating RPC timeout context
Move GetMaster() call before context.WithTimeout() so master
resolution blocking doesn't consume the gRPC call timeout.
* master: use NotFound flag instead of string matching for volume lookup
Add a NotFound field to LookupResult and set it in findVolumeLocation
when a volume is genuinely missing. Update HTTP and gRPC warmup
checks to use this flag instead of strings.Contains on the error
message.
* master: bound assign retry loop to 30s for deadline-free contexts
Without a context deadline, the Unavailable retry loop could spin
forever. Add a maxRetryDuration of 30s so the loop gives up even
when no context deadline is set.
* master: strengthen assign retry cancellation test
Verify the retry loop actually retried (callCount > 1) and that
the returned error is context.DeadlineExceeded, not just any error.
* master: extract shared retry-with-backoff utility
Add util.RetryWithBackoff for context-aware, bounded retry with
exponential backoff. Refactor both Assign() and LookupVolumeIds()
to use it instead of duplicating the retry/sleep/backoff logic.
* master: cap waitTime in RetryWithBackoff to prevent unbounded growth
Cap the backoff waitTime at maxWaitTime so it doesn't grow
indefinitely in long-running retry scenarios.
* master: only return Unavailable during warmup when all lookups failed
For batched LookupVolume requests, return partial results when some
volumes are found. Only return codes.Unavailable when no volumes
were successfully resolved, so clients benefit from partial results
instead of retrying unnecessarily.
* master: set retriable error message in 503 response body
When returning 503 during warmup, replace the "not found" error
in the JSON body with "service warming up, please retry" so
clients don't treat it as a permanent error.
* master: guard empty master address in LookupVolumeIds
If GetMaster() returns empty (no master found or ctx cancelled),
return an appropriate error instead of dialing an empty address.
Returns ctx.Err() if context is done, otherwise codes.Unavailable
to trigger retry.
* master: add comprehensive tests for RetryWithBackoff
Test success after retries, non-retryable error handling, context
cancellation, and maxDuration cap with context.Background().
* master: enforce hard maxDuration bound in RetryWithBackoff
Use a deadline instead of elapsed-time check so the last sleep is
capped to remaining time. This prevents the total retry duration
from overshooting maxDuration by up to one full backoff interval.
* master: respect fresh-start bypass in RemainingWarmupDuration
Check IsWarmingUp() first (which returns false when MaxVolumeId==0)
so RemainingWarmupDuration returns 0 on fresh clusters.
* master: round up Retry-After seconds to avoid underestimating
Use math.Ceil so fractional remaining seconds (e.g. 1.9s) round
up to the next integer (2) instead of flooring down (1).
* master: tighten batch lookup warmup to all-NotFound only
Only return codes.Unavailable when every requested volume ID was
a transient not-found. Mixed cases with non-NotFound errors now
return the response with per-volume error details preserved.
* master: reduce retry log noise and fix timer leak
Lower per-attempt retry log from V(0) to V(1) to reduce noise
during warmup. Replace time.After with time.NewTimer to avoid
lingering timers when context is cancelled.
* master: add per-attempt timeout for assign RPC
Use a 10s per-attempt timeout so a single slow RPC can't consume
the entire 30s retry budget when ctx has no deadline.
* master: share single 30s retry deadline across assign request entries
The Assign() function iterates over primary and fallback requests,
previously giving each its own 30s RetryWithBackoff budget. With a
primary + fallback, the total could reach 60s. Compute one deadline
up front and pass the remaining budget to each RetryWithBackoff call
so the entire Assign() call stays within a single 30s cap.
* master: strengthen context-cancel test with DeadlineExceeded and retry assertions
Assert errors.Is(err, context.DeadlineExceeded) to verify the error
is specifically from the context deadline, and check callCount > 1
to prove retries actually occurred before cancellation. Mirrors the
pattern used in TestAssignStopsOnContextCancel.
* master: bound GetMaster with per-attempt timeout in LookupVolumeIds
GetMaster() calls WaitUntilConnected() which can block indefinitely
if no master is available. Previously it used the outer ctx, so a
slow master resolution could consume the entire RetryWithBackoff
budget in a single attempt. Move the per-attempt timeoutCtx creation
before the GetMaster call so both master resolution and the gRPC
LookupVolume RPC share one grpcTimeout-bounded attempt.
* master: use deadline-aware context for assign retry budget
The shared 30s deadline only limited RetryWithBackoff's internal
wall-clock tracking, but per-attempt contexts were still derived
from the original ctx and could run for up to 10s even when the
budget was nearly exhausted. Create a deadlineCtx from the computed
deadline and derive both RetryWithBackoff and per-attempt timeouts
from it so all operations honor the shared 30s cap.
* master: skip warmup gate for empty lookup requests
When VolumeOrFileIds is empty, notFoundCount == len(req.VolumeOrFileIds)
is 0 == 0 which is true, causing empty lookup batches during warmup to
return codes.Unavailable and be retried endlessly. Add a
len(req.VolumeOrFileIds) > 0 guard so empty requests pass through.
* master: validate request fields before warmup gate in Assign
Move Replication and Ttl parsing before the IsWarmingUp() check so
invalid inputs get a proper validation error instead of being masked
by codes.Unavailable during warmup. Pure syntactic validation does
not depend on topology state and should run first.
* master: check deadline and context before starting retry attempt
RetryWithBackoff only checked the deadline and context after an
attempt completed or during the sleep select. If the deadline
expired or context was canceled during sleep, the next iteration
would still call operation() before detecting it. Add pre-operation
checks so no new attempt starts after the budget is exhausted.
* master: always return ctx.Err() on context cancellation in RetryWithBackoff
When ctx.Err() is non-nil, the pre-operation check was returning
lastErr instead of ctx.Err(). This broke callers checking
errors.Is(err, context.DeadlineExceeded) and contradicted the
documented contract. Always return ctx.Err() so the cancellation
reason is properly surfaced.
* master: handle warmup errors in StreamAssign without killing the stream
StreamAssign was returning codes.Unavailable errors from Assign
directly, which terminates the gRPC stream and breaks pooled
connections. Instead, return transient errors as in-band error
responses so the stream survives warmup periods.
Also reset assignClient in doAssign on Send/Recv failures so a
broken stream doesn't leave the proxy permanently dead.
* master: wait for warmup before slot search in findAndGrow
findEmptySlotsForOneVolume was called before the warmup wait loop,
selecting slots from an incomplete topology. Move the warmup wait
before slot search so volume placement uses the fully warmed-up
topology with all servers registered.
* master: add Retry-After header to /dir/assign warmup response
The /dir/lookup handler already sets Retry-After during warmup but
/dir/assign did not, leaving HTTP clients without guidance on when
to retry. Add the same header using RemainingWarmupDuration().
* master: only seed warmup timestamp on leader at startup
SetLastLeaderChangeTime was called unconditionally for both leader
and follower nodes. Followers don't need warmup state, and the
leader change event listener handles real elections. Move the seed
into the IsLeader() block so only the startup leader gets warmup
initialized.
* master: preserve codes.Unavailable for StreamAssign warmup errors in doAssign
StreamAssign returns transient warmup errors as in-band
AssignResponse.Error messages. doAssign was converting these to plain
fmt.Errorf, losing the codes.Unavailable classification needed for
the caller's retry logic. Detect warmup error messages and wrap them
as status.Error(codes.Unavailable) so RetryWithBackoff can retry.
* Fix master leader election startup issue
Fixes #error-log-leader-not-selected-yet
* Fix master leader election startup issue
This change improves server address comparison using the 'Equals' method and handles recursion in topology leader lookup, resolving the 'leader not selected yet' error during master startup.
* Merge user improvements: use MaybeLeader for non-blocking checks
* not useful test
* Address code review: optimize Equals, fix deadlock in IsLeader, safe access in Leader
* Prevent split-brain: Persistent ClusterID and Join Validation
- Persist ClusterId in Raft store to survive restarts.
- Validate ClusterId on Raft command application (piggybacked on MaxVolumeId).
- Prevent masters with conflicting ClusterIds from joining/operating together.
- Update Telemetry to report the persistent ClusterId.
* Refine ClusterID validation based on feedback
- Improved error message in cluster_commands.go.
- Added ClusterId mismatch check in RaftServer.Recovery.
* Handle Raft errors and support Hashicorp Raft for ClusterId
- Check for errors when persisting ClusterId in legacy Raft.
- Implement ClusterId generation and persistence for Hashicorp Raft leader changes.
- Ensure consistent error logging.
* Refactor ClusterId validation
- Centralize ClusterId mismatch check in Topology.SetClusterId.
- Simplify MaxVolumeIdCommand.Apply and RaftServer.Recovery to rely on SetClusterId.
* Fix goroutine leak and add timeout
- Handle channel closure in Hashicorp Raft leader listener.
- Add timeout to Raft Apply call to prevent blocking.
* Fix deadlock in legacy Raft listener
- Wrap ClusterId generation/persistence in a goroutine to avoid blocking the Raft event loop (deadlock).
* Rename ClusterId to SystemId
- Renamed ClusterId to SystemId across the codebase (protobuf, topology, server, telemetry).
- Regenerated telemetry.pb.go with new field.
* Rename SystemId to TopologyId
- Rename to SystemId was intermediate step.
- Final name is TopologyId for the persistent cluster identifier.
- Updated protobuf, topology, raft server, master server, and telemetry.
* Optimize Hashicorp Raft listener
- Integrated TopologyId generation into existing monitorLeaderLoop.
- Removed extra goroutine in master_server.go.
* Fix optimistic TopologyId update
- Removed premature local state update of TopologyId in master_server.go and raft_hashicorp.go.
- State is now solely updated via the Raft state machine Apply/Restore methods after consensus.
* Add explicit log for recovered TopologyId
- Added glog.V(0) info log in RaftServer.Recovery to print the recovered TopologyId on startup.
* Add Raft barrier to prevent TopologyId race condition
- Implement ensureTopologyId helper method
- Send no-op MaxVolumeIdCommand to sync Raft log before checking TopologyId
- Ensures persisted TopologyId is recovered before generating new one
- Prevents race where generation happens during log replay
* Serialize TopologyId generation with mutex
- Add topologyIdGenLock mutex to MasterServer struct
- Wrap ensureTopologyId method with lock to prevent concurrent generation
- Fixes race where event listener and manual leadership check both generate IDs
- Second caller waits for first to complete and sees the generated ID
* Add TopologyId recovery logging to Apply method
- Change log level from V(1) to V(0) for visibility
- Log 'Recovered TopologyId' when applying from Raft log
- Ensures recovery is visible whether from snapshot or log replay
- Matches Recovery() method logging for consistency
* Fix Raft barrier timing issue
- Add 100ms delay after barrier command to ensure log application completes
- Add debug logging to track barrier execution and TopologyId state
- Return early if barrier command fails
- Prevents TopologyId generation before old logs are fully applied
* ensure leader
* address comments
* address comments
* redundant
* clean up
* double check
* refactoring
* comment
`topology.Leader()` was using a backoff that typically
resulted in at least a 5s delay when initially starting
a master and raft server. This changes the backoff
algorithm to use exponential backoff starting with 100ms
and waiting up to 20s for leader selection.
Related to #4307
1, there will be two leader when master server startup in a few seconds
2, raft server will get a leader even there is only one master, so there is no need to do hard code to set the server to be leader