Add iceberg_maintenance plugin worker handler (Phase 1) (#8501)

* Add iceberg_maintenance plugin worker handler (Phase 1)

Implement automated Iceberg table maintenance as a new plugin worker job
type. The handler scans S3 table buckets for tables needing maintenance
and executes operations in the correct Iceberg order: expire snapshots,
remove orphan files, and rewrite manifests.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Fix unsafe int64→int narrowing for MaxSnapshotsToKeep

Use int64(wouldKeep) instead of int(config.MaxSnapshotsToKeep) to
avoid potential truncation on 32-bit platforms (CodeQL high severity).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Fix unsafe int64→int narrowing for MinInputFiles

Use int64(len(manifests)) instead of int(config.MinInputFiles) to
avoid potential truncation on 32-bit platforms (CodeQL high severity).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Fix unsafe int64→int narrowing for MaxCommitRetries

Clamp MaxCommitRetries to [1,20] range and keep as int64 throughout
the retry loop to avoid truncation on 32-bit platforms (CodeQL high
severity).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Sort snapshots explicitly by timestamp in expireSnapshots

The previous logic relied on implicit ordering of the snapshot list.
Now explicitly sorts snapshots by timestamp descending (most recent
first) and uses a simpler keep-count loop: keep the first
MaxSnapshotsToKeep newest snapshots plus the current snapshot
unconditionally, then expire the rest that exceed the retention window.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Handle errors properly in listFilerEntries

Previously all errors from ListEntries and Recv were silently swallowed.
Now: treat "not found" errors as empty directory, propagate other
ListEntries errors, and check for io.EOF explicitly on Recv instead of
breaking on any error.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Fix overly broad HasSuffix check in orphan detection

The bare strings.HasSuffix(ref, entry.Name) could match files with
similar suffixes (e.g. "123.avro" matching "snap-123.avro"). Replaced
with exact relPath match and a "/"-prefixed suffix check to avoid
false positives.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Replace fmt.Sscanf with strconv.Atoi in extractMetadataVersion

strconv.Atoi is more explicit and less fragile than fmt.Sscanf for
parsing a simple integer from a trimmed string.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Recursively traverse directories for orphan file detection

The orphan cleanup only listed a single directory level under data/
and metadata/, skipping IsDirectory entries. Partitioned Iceberg
tables store data files in nested partition directories (e.g.
data/region=us-east/file.parquet) which were never evaluated.

Add walkFilerEntries helper that recursively descends into
subdirectories, and use it in removeOrphans so all nested files
are considered for orphan checks.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Fix manifest path drift from double time.Now() calls

rewriteManifests called time.Now().UnixMilli() twice: once for the
path embedded in WriteManifest and once for the filename passed to
saveFilerFile. These timestamps would differ, causing the manifest's
internal path reference to not match the actual saved filename.

Compute the filename once and reuse it for both WriteManifest and
saveFilerFile so they always reference the same path.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Add TestManifestRewritePathConsistency test

Verifies that WriteManifest returns a ManifestFile whose FilePath()
matches the path passed in, and that path.Base() of that path matches
the filename used for saveFilerFile. This validates the single-
timestamp pattern used in rewriteManifests produces consistent paths.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Make parseOperations return error on unknown operations

Previously parseOperations silently dropped unknown operation names
and could return an empty list. Now validates inputs against the
canonical set and returns a clear error if any unknown operation is
specified. Updated Execute to surface the error instead of proceeding
with an empty operation list.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Use gRPC status codes instead of string matching in listFilerEntries

Replace brittle strings.Contains(err.Error(), "not found") check with
status.Code(err) == codes.NotFound for proper gRPC error handling.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Add stale-plan guard in commit closures for expireSnapshots and rewriteManifests

Both operations plan outside the commit mutation using a snapshot ID
captured from the initial metadata read. If the table head advances
concurrently, the mutation would create a snapshot parented to the
wrong head or remove snapshots based on a stale view.

Add a guard inside each mutation closure that verifies
currentMeta.CurrentSnapshot().SnapshotID still matches the planned
snapshot ID. If it differs, return errStalePlan which propagates
immediately (not retried, since the plan itself is invalid).

Also fix rewriteManifests to derive SequenceNumber from the fresh
metadata (cs.SequenceNumber) instead of the captured currentSnap.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Add compare-and-swap to updateTableMetadataXattr

updateTableMetadataXattr previously re-read the entry but did not
verify the metadataVersion matched what commitWithRetry had loaded.
A concurrent update could be silently clobbered.

Now accepts expectedVersion parameter and compares it against the
stored metadataVersion before writing. Returns errMetadataVersionConflict
on mismatch, which commitWithRetry treats as retryable (deletes the
staged metadata file and retries with fresh state).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Export shared plugin worker helpers for use by sub-packages

Export ShouldSkipDetectionByInterval, BuildExecutorActivity, and
BuildDetectorActivity so the iceberg sub-package can reuse them
without duplicating logic.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Refactor iceberg maintenance handler into weed/plugin/worker/iceberg package

Split the 1432-line iceberg_maintenance_handler.go into focused files
in a new iceberg sub-package: handler.go, config.go, detection.go,
operations.go, filer_io.go, and compact.go (Phase 2 data compaction).

Key changes:
- Rename types to drop stutter (IcebergMaintenanceHandler → Handler, etc.)
- Fix loadFileByIcebergPath to preserve nested directory paths via
  normalizeIcebergPath instead of path.Base which dropped subdirectories
- Check SendProgress errors instead of discarding them
- Add stale-plan guard to compactDataFiles commitWithRetry closure
- Add "compact" operation to parseOperations canonical order
- Duplicate readStringConfig/readInt64Config helpers (~20 lines)
- Update worker_runtime.go to import new iceberg sub-package

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Remove iceberg_maintenance from default plugin worker job types

Iceberg maintenance is not yet ready to be enabled by default.
Workers can still opt in by explicitly listing iceberg_maintenance
in their job types configuration.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Clamp config values to safe minimums in ParseConfig

Prevents misconfiguration by enforcing minimum values using the
default constants for all config fields.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Harden filer I/O: path helpers, strict CAS guard, path traversal prevention

- Use path.Dir/path.Base instead of strings.SplitN in loadCurrentMetadata
- Make CAS guard error on missing or unparseable metadataVersion
- Add path.Clean and traversal validation in loadFileByIcebergPath

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Fix compact: single snapshot ID, oversized bin splitting, ensureFilerDir

- Use single newSnapID for all manifest entries in a compaction run
- Add splitOversizedBin to break bins exceeding targetSize
- Make ensureFilerDir only create on NotFound, propagate other errors

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Add wildcard filters, scan limit, and context cancellation to table scanning

- Use wildcard matchers (*, ?) for bucket/namespace/table filters
- Add limit parameter to scanTablesForMaintenance for early termination
- Add ctx.Done() checks in bucket and namespace scan loops
- Update filter UI descriptions and placeholders for wildcard support

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Remove dead detection interval check and validate namespace parameter

- Remove ineffective ShouldSkipDetectionByInterval call with hardcoded 0
- Add namespace to required parameter validation in Execute

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Improve operations: exponential backoff, orphan matching, full file cleanup

- Use exponential backoff (50ms, 100ms, 200ms, ...) in commitWithRetry
- Use normalizeIcebergPath for orphan matching instead of fragile suffix check
- Add collectSnapshotFiles to traverse manifest lists → manifests → data files
- Delete all unreferenced files after expiring snapshots, not just manifest lists
- Refactor removeOrphans to reuse collectSnapshotFiles

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* iceberg: fix ensureFilerDir to handle filer_pb.ErrNotFound sentinel

filer_pb.LookupEntry converts gRPC NotFound errors to filer_pb.ErrNotFound
(a plain sentinel), so status.Code() never returns codes.NotFound for that
error. This caused ensureFilerDir to return an error instead of creating
the directory when it didn't exist.

* iceberg: clean up orphaned artifacts when compaction commit fails

Track all files written during compaction (merged data files, manifest,
manifest list) and delete them if the commit or any subsequent write step
fails, preventing orphaned files from accumulating in the filer.

* iceberg: derive tablePath from namespace/tableName when empty

An empty table_path parameter would be passed to maintenance operations
unchecked. Default it to path.Join(namespace, tableName) when not provided.

* iceberg: make collectSnapshotFiles return error on read/parse failure

Previously, errors reading manifests were logged and skipped, returning a
partial reference set. This could cause incorrect delete decisions during
snapshot expiration or orphan cleanup. Now the function returns an error
and all callers abort when reference data is incomplete.

* iceberg: include active metadata file in removeOrphans referenced set

The metadataFileName returned by loadCurrentMetadata was discarded, so
the active metadata file could be incorrectly treated as an orphan and
deleted. Capture it and add it to the referencedFiles map.

* iceberg: only retry commitWithRetry on metadata version conflicts

Previously all errors from updateTableMetadataXattr triggered retries.
Now only errMetadataVersionConflict causes retry; other errors (permissions,
transport, malformed xattr) fail immediately.

* iceberg: respect req.Limit in fakeFilerServer.ListEntries mock

The mock ListEntries ignored the Limit field, so tests couldn't exercise
pagination. Now it stops streaming once Limit entries have been sent.

* iceberg: validate parquet schema compatibility before merging files

mergeParquetFiles now compares each source file's schema against the
first file's schema and aborts with a clear error if they differ, instead
of blindly writing rows that could panic or produce corrupt output.

* iceberg: normalize empty JobType to canonical jobType in Execute events

When request.Job.JobType is empty, status events and completion messages
were emitted with a blank job type. Derive a canonical value early and
use it consistently in all outbound events.

* iceberg: log warning on unexpected config value types in read helpers

readStringConfig and readInt64Config now log a V(1) warning when they
encounter an unhandled ConfigValue kind, aiding debugging of unexpected
config types that silently fall back to defaults.

* worker: add iceberg_maintenance to default plugin worker job types

Workers using the default job types list didn't advertise the
iceberg_maintenance handler despite the handler and canonical name
being registered. Add it so workers pick up the handler by default.

* iceberg: use defer and detached context for compaction artifact cleanup

The cleanup closure used the job context which could already be canceled,
and was not called on ctx.Done() early exits. Switch to a deferred
cleanup with a detached context (30s timeout) so artifact deletion
completes on all exit paths including context cancellation.

* iceberg: use proportional jitter in commitWithRetry backoff

Fixed 25ms max jitter becomes insignificant at higher retry attempts.
Use 0-20% of the current backoff value instead so jitter scales with
the exponential delay.

* iceberg: add malformed filename cases to extractMetadataVersion test

Cover edge cases like "invalid.metadata.json", "metadata.json", "",
and "v.metadata.json" to ensure the function returns 0 for unparseable
inputs.

* iceberg: fail compaction on manifest read errors and skip delete manifests

Previously, unreadable manifests were silently skipped during compaction,
which could drop live files from the entry set. Now manifest read/parse
errors are returned as fatal errors.

Also abort compaction when delete manifests exist since the compactor
does not apply deletes — carrying them through unchanged could produce
incorrect results.

* iceberg: use table-relative path for active metadata file in orphan scan

metadataFileName was stored as a basename (e.g. "v1.metadata.json") but
the orphan scanner matches against table-relative paths like
"metadata/v1.metadata.json". Prefix with "metadata/" so the active
metadata file is correctly recognized as referenced.

* iceberg: fix MetadataBuilderFromBase location to use metadata file path

The second argument to MetadataBuilderFromBase records the previous
metadata file in the metadata log. Using meta.Location() (the table
root) was incorrect — it must be the actual metadata file path so
old metadata files can be tracked and eventually cleaned up.

* iceberg: update metadataLocation and versionToken in xattr on commit

updateTableMetadataXattr was only updating metadataVersion,
modifiedAt, and fullMetadata but not metadataLocation or
versionToken. This left catalog state inconsistent after
maintenance commits — the metadataLocation still pointed to the
old metadata file and the versionToken was stale.

Add a newMetadataLocation parameter and regenerate the
versionToken on every commit, matching the S3 Tables handler
behavior.

* iceberg: group manifest entries by partition spec in rewriteManifests

rewriteManifests was writing all entries into a single manifest
using the table's current partition spec. For spec-evolved tables
where manifests reference different partition specs, this produces
an invalid manifest.

Group entries by the source manifest's PartitionSpecID and write
one merged manifest per spec, looking up each spec from the
table's PartitionSpecs list.

* iceberg: remove dead code loop for non-data manifests in compaction

The early abort guard at the top of compactDataFiles already ensures
no delete manifests are present. The loop that copied non-data
manifests into allManifests was unreachable dead code.

* iceberg: use JSON encoding in partitionKey for unambiguous grouping

partitionKey used fmt.Sprintf("%d=%v") joined by commas, which
produces ambiguous keys when partition values contain commas or '='.
Use json.Marshal for values and NUL byte as separator to eliminate
collisions.

* iceberg: precompute normalized reference set in removeOrphans

The orphan check was O(files × refs) because it normalized each
reference path inside the per-file loop. Precompute the normalized
set once for O(1) lookups per candidate file.

* iceberg: add artifact cleanup to rewriteManifests on commit failure

rewriteManifests writes merged manifests and a manifest list to
the filer before committing but did not clean them up on failure.
Add the same deferred cleanup pattern used by compactDataFiles:
track written artifacts and delete them if the commit does not
succeed.

* iceberg: pass isDeleteData=true in deleteFilerFile

deleteFilerFile called DoRemove with isDeleteData=false, which only
removed filer metadata and left chunk data behind on volume servers.
All other data-file deletion callers in the codebase pass true.

* iceberg: clean up test: remove unused snapID, simplify TestDetectWithFakeFiler

Remove unused snapID variable and eliminate the unnecessary second
fake filer + entry copy in TestDetectWithFakeFiler by capturing
the client from the first startFakeFiler call.

* fix: update TestWorkerDefaultJobTypes to expect 5 job types

The test expected 4 default job types but iceberg_maintenance was
added as a 5th default in a previous commit.

* iceberg: document client-side CAS TOCTOU limitation in updateTableMetadataXattr

Add a note explaining the race window where two workers can both
pass the version check and race at UpdateEntry. The proper fix
requires server-side precondition support on UpdateEntryRequest.

* iceberg: remove unused sender variable in TestFullExecuteFlow

* iceberg: abort compaction when multiple partition specs are present

The compactor writes all entries into a single manifest using the
current partition spec, which is invalid for spec-evolved tables.
Detect multiple PartitionSpecIDs and skip compaction until
per-spec compaction is implemented.

* iceberg: validate tablePath to prevent directory traversal

Sanitize the table_path parameter with path.Clean and verify it
matches the expected namespace/tableName prefix to prevent path
traversal attacks via crafted job parameters.

* iceberg: cap retry backoff at 5s and make it context-aware

The exponential backoff could grow unbounded and blocked on
time.Sleep ignoring context cancellation. Cap at 5s and use
a timer with select on ctx.Done so retries respect cancellation.

* iceberg: write manifest list with new snapshot identity in rewriteManifests

The manifest list was written with the old snapshot's ID and sequence
number, but the new snapshot created afterwards used a different
identity. Compute newSnapshotID and newSeqNum before writing
manifests and the manifest list so all artifacts are consistent.

* ec: also remove .vif file in removeEcVolumeFiles

removeEcVolumeFiles cleaned up .ecx, .ecj, and shard files but
not the .vif volume info file, leaving it orphaned. The .vif file
lives in the data directory alongside shard files.

The directory handling for index vs data files was already correct:
.ecx/.ecj are removed from IdxDirectory and shard files from
Directory, matching how NewEcVolume loads them.

Revert "ec: also remove .vif file in removeEcVolumeFiles"

This reverts commit acc82449e12a00115268a5652aef0d6c46d9f2dd.

* iceberg: skip orphan entries with nil Attributes instead of defaulting to epoch

When entry.Attributes is nil, mtime defaulted to Unix epoch (1970),
making unknown-age entries appear ancient and eligible for deletion.
Skip these entries instead to avoid deleting files whose age cannot
be determined.

* iceberg: use unique metadata filenames to prevent concurrent write clobbering

Add timestamp nonce to metadata filenames (e.g. v3-1709766000.metadata.json)
so concurrent writers stage to distinct files. Update extractMetadataVersion
to strip the nonce suffix, and loadCurrentMetadata to read the actual filename
from the metadataLocation xattr field.

* iceberg: defer artifact tracking until data file builder succeeds

Move the writtenArtifacts append to after NewDataFileBuilder succeeds,
so a failed builder doesn't leave a stale entry for an already-deleted
file in the cleanup list.

* iceberg: use detached context for metadata file cleanup

Use context.WithTimeout(context.Background(), 10s) when deleting staged
metadata files after CAS failure, so cleanup runs even if the original
request context is canceled.

* test: update default job types count to include iceberg_maintenance

* iceberg: use parquet.EqualNodes for structural schema comparison

Replace String()-based schema comparison with parquet.EqualNodes which
correctly compares types, repetition levels, and logical types.

* iceberg: add nonce-suffixed filename cases to TestExtractMetadataVersion

* test: assert iceberg_maintenance is present in default job types

* iceberg: validate operations config early in Detect

Call parseOperations in Detect so typos in the operations config fail
fast before emitting proposals, matching the validation already done
in Execute.

* iceberg: detect chunked files in loadFileByIcebergPath

Return an explicit error when a file has chunks but no inline content,
rather than silently returning empty data. Data files uploaded via S3
are stored as chunks, so compaction would otherwise produce corrupt
merged files.

---------

Co-authored-by: Copilot <copilot@github.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

This commit is contained in:

Chris Lu

2026-03-07 12:45:33 -08:00

committed by

GitHub

parent d89eb8267f

commit 72c2c7ef8b

16 changed files with 3976 additions and 47 deletions

									
										8

weed/command/plugin_worker_test.go
									
												View File
												
				@@ -6,6 +6,7 @@ import (

					"net/http/httptest"

					"os"

					"path/filepath"

					"slices"

					"strings"

					"testing"

				@@ -138,8 +139,11 @@ func TestPluginWorkerDefaultJobTypes(t *testing.T) {

					if err != nil {

						t.Fatalf("parsePluginWorkerJobTypes(default setting) err = %v", err)

					}

					if len(jobTypes) != 4 {

						t.Fatalf("expected default job types to include 4 handlers, got %v", jobTypes)

					if len(jobTypes) != 5 {

						t.Fatalf("expected default job types to include 5 handlers, got %v", jobTypes)

					}

					if !slices.Contains(jobTypes, "iceberg_maintenance") {

						t.Fatalf("expected iceberg_maintenance in default job types, got %v", jobTypes)

					}

				}

									
										7

weed/command/worker_runtime.go
									
												View File
												
				@@ -15,6 +15,7 @@ import (

					"github.com/prometheus/client_golang/prometheus/promhttp"

					"github.com/seaweedfs/seaweedfs/weed/glog"

					pluginworker "github.com/seaweedfs/seaweedfs/weed/plugin/worker"

					icebergworker "github.com/seaweedfs/seaweedfs/weed/plugin/worker/iceberg"

					"github.com/seaweedfs/seaweedfs/weed/security"

					statsCollect "github.com/seaweedfs/seaweedfs/weed/stats"

					"github.com/seaweedfs/seaweedfs/weed/util"

				@@ -23,7 +24,7 @@ import (

					"google.golang.org/grpc"

				)

				const defaultPluginWorkerJobTypes = "vacuum,volume_balance,erasure_coding,admin_script"

				const defaultPluginWorkerJobTypes = "vacuum,volume_balance,erasure_coding,admin_script,iceberg_maintenance"

				type pluginWorkerRunOptions struct {

					AdminServer string

				@@ -158,6 +159,8 @@ func buildPluginWorkerHandler(jobType string, dialOption grpc.DialOption, maxExe

						return pluginworker.NewErasureCodingHandler(dialOption, workingDir), nil

					case "admin_script":

						return pluginworker.NewAdminScriptHandler(dialOption), nil

					case "iceberg_maintenance":

						return icebergworker.NewHandler(dialOption), nil

					default:

						return nil, fmt.Errorf("unsupported plugin job type %q", canonicalJobType)

					}

				@@ -224,6 +227,8 @@ func canonicalPluginWorkerJobType(jobType string) (string, error) {

						return "erasure_coding", nil

					case "admin_script", "admin-script", "admin.script", "script", "admin":

						return "admin_script", nil

					case "iceberg_maintenance", "iceberg-maintenance", "iceberg.maintenance", "iceberg":

						return "iceberg_maintenance", nil

					default:

						return "", fmt.Errorf("unsupported plugin job type %q", jobType)

					}

									
										4

weed/command/worker_test.go
									
												View File
												
				@@ -7,7 +7,7 @@ func TestWorkerDefaultJobTypes(t *testing.T) {

					if err != nil {

						t.Fatalf("parsePluginWorkerJobTypes(default worker flag) err = %v", err)

					}

					if len(jobTypes) != 4 {

						t.Fatalf("expected default worker job types to include 4 handlers, got %v", jobTypes)

					if len(jobTypes) != 5 {

						t.Fatalf("expected default worker job types to include 5 handlers, got %v", jobTypes)

					}

				}

Add iceberg_maintenance plugin worker handler (Phase 1) (#8501)

8 weed/command/plugin_worker_test.go Unescape Escape View File

7 weed/command/worker_runtime.go Unescape Escape View File

4 weed/command/worker_test.go Unescape Escape View File

8

weed/command/plugin_worker_test.go

View File

7

weed/command/worker_runtime.go

View File

4

weed/command/worker_test.go

View File