seaweedFS

Author	SHA1	Message	Date
Chris Lu	8cde3d4486	Add data file compaction to iceberg maintenance (Phase 2) (#8503 ) * Add iceberg_maintenance plugin worker handler (Phase 1) Implement automated Iceberg table maintenance as a new plugin worker job type. The handler scans S3 table buckets for tables needing maintenance and executes operations in the correct Iceberg order: expire snapshots, remove orphan files, and rewrite manifests. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Add data file compaction to iceberg maintenance handler (Phase 2) Implement bin-packing compaction for small Parquet data files: - Enumerate data files from manifests, group by partition - Merge small files using parquet-go (read rows, write merged output) - Create new manifest with ADDED/DELETED/EXISTING entries - Commit new snapshot with compaction metadata Add 'compact' operation to maintenance order (runs before expire_snapshots), configurable via target_file_size_bytes and min_input_files thresholds. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Fix memory exhaustion in mergeParquetFiles by processing files sequentially Previously all source Parquet files were loaded into memory simultaneously, risking OOM when a compaction bin contained many small files. Now each file is loaded, its rows are streamed into the output writer, and its data is released before the next file is loaded — keeping peak memory proportional to one input file plus the output buffer. * Validate bucket/namespace/table names against path traversal Reject names containing '..', '/', or '\' in Execute to prevent directory traversal via crafted job parameters. * Add filer address failover in iceberg maintenance handler Try each filer address from cluster context in order instead of only using the first one. This improves resilience when the primary filer is temporarily unreachable. * Add separate MinManifestsToRewrite config for manifest rewrite threshold The rewrite_manifests operation was reusing MinInputFiles (meant for compaction bin file counts) as its manifest count threshold. Add a dedicated MinManifestsToRewrite field with its own config UI section and default value (5) so the two thresholds can be tuned independently. * Fix risky mtime fallback in orphan removal that could delete new files When entry.Attributes is nil, mtime defaulted to Unix epoch (1970), which would always be older than the safety threshold, causing the file to be treated as eligible for deletion. Skip entries with nil Attributes instead, matching the safer logic in operations.go. * Fix undefined function references in iceberg_maintenance_handler.go Use the exported function names (ShouldSkipDetectionByInterval, BuildDetectorActivity, BuildExecutorActivity) matching their definitions in vacuum_handler.go. * Remove duplicated iceberg maintenance handler in favor of iceberg/ subpackage The IcebergMaintenanceHandler and its compaction code in the parent pluginworker package duplicated the logic already present in the iceberg/ subpackage (which self-registers via init()). The old code lacked stale-plan guards, proper path normalization, CAS-based xattr updates, and error-returning parseOperations. Since the registry pattern (default "all") makes the old handler unreachable, remove it entirely. All functionality is provided by iceberg.Handler with the reviewed improvements. * Fix MinManifestsToRewrite clamping to match UI minimum of 2 The clamp reset values below 2 to the default of 5, contradicting the UI's advertised MinValue of 2. Clamp to 2 instead. * Sort entries by size descending in splitOversizedBin for better packing Entries were processed in insertion order which is non-deterministic from map iteration. Sorting largest-first before the splitting loop improves bin packing efficiency by filling bins more evenly. * Add context cancellation check to drainReader loop The row-streaming loop in drainReader did not check ctx between iterations, making long compaction merges uncancellable. Check ctx.Done() at the top of each iteration. * Fix splitOversizedBin to always respect targetSize limit The minFiles check in the split condition allowed bins to grow past targetSize when they had fewer than minFiles entries, defeating the OOM protection. Now bins always split at targetSize, and a trailing runt with fewer than minFiles entries is merged into the previous bin. * Add integration tests for iceberg table maintenance plugin worker Tests start a real weed mini cluster, create S3 buckets and Iceberg table metadata via filer gRPC, then exercise the iceberg.Handler operations (ExpireSnapshots, RemoveOrphans, RewriteManifests) against the live filer. A full maintenance cycle test runs all operations in sequence and verifies metadata consistency. Also adds exported method wrappers (testing_api.go) so the integration test package can call the unexported handler methods. * Fix splitOversizedBin dropping files and add source path to drainReader errors The runt-merge step could leave leading bins with fewer than minFiles entries (e.g. [80,80,10,10] with targetSize=100, minFiles=2 would drop the first 80-byte file). Replace the filter-based approach with an iterative merge that folds any sub-minFiles bin into its smallest neighbor, preserving all eligible files. Also add the source file path to drainReader error messages so callers can identify which Parquet file caused a read/write failure. * Harden integration test error handling - s3put: fail immediately on HTTP 4xx/5xx instead of logging and continuing - lookupEntry: distinguish NotFound (return nil) from unexpected RPC errors (fail the test) - writeOrphan and orphan creation in FullMaintenanceCycle: check CreateEntryResponse.Error in addition to the RPC error * go fmt --------- Co-authored-by: Copilot <copilot@github.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-15 11:27:42 -07:00
Chris Lu	f3c5ba3cd6	feat(filer): add lazy directory listing for remote mounts (#8615 ) * feat(filer): add lazy directory listing for remote mounts Directory listings on remote mounts previously only queried the local filer store. With lazy mounts the listing was empty; with eager mounts it went stale over time. Add on-demand directory listing that fetches from remote and caches results with a 5-minute TTL: - Add `ListDirectory` to `RemoteStorageClient` interface (delimiter-based, single-level listing, separate from recursive `Traverse`) - Implement in S3, GCS, and Azure backends using each platform's hierarchical listing API - Add `maybeLazyListFromRemote` to filer: before each directory listing, check if the directory is under a remote mount with an expired cache, fetch from remote, persist entries to the local store, then let existing listing logic run on the populated store - Use singleflight to deduplicate concurrent requests for the same directory - Skip local-only entries (no RemoteEntry) to avoid overwriting unsynced uploads - Errors are logged and swallowed (availability over consistency) * refactor: extract xattr key to constant xattrRemoteListingSyncedAt * feat: make listing cache TTL configurable per mount via listing_cache_ttl_seconds Add listing_cache_ttl_seconds field to RemoteStorageLocation protobuf. When 0 (default), lazy directory listing is disabled for that mount. When >0, enables on-demand directory listing with the specified TTL. Expose as -listingCacheTTL flag on remote.mount command. * refactor: address review feedback for lazy directory listing - Add context.Context to ListDirectory interface and all implementations - Capture startTime before remote call for accurate TTL tracking - Simplify S3 ListDirectory using ListObjectsV2PagesWithContext - Make maybeLazyListFromRemote return void (errors always swallowed) - Remove redundant trailing-slash path manipulation in caller - Update tests to match new signatures * When an existing entry has Remote != nil, we should merge remote metadata into it rather than replacing it. * fix(gcs): wrap ListDirectory iterator error with context The raw iterator error was returned without bucket/path context, making it harder to debug. Wrap it consistently with the S3 pattern. * fix(s3): guard against nil pointer dereference in Traverse and ListDirectory Some S3-compatible backends may return nil for LastModified, Size, or ETag fields. Check for nil before dereferencing to prevent panics. * fix(filer): remove blanket 2-minute timeout from lazy listing context Individual SDK operations (S3, GCS, Azure) already have per-request timeouts and retry policies. The blanket timeout could cut off large directory listings mid-operation even though individual pages were succeeding. * fix(filer): preserve trace context in lazy listing with WithoutCancel Use context.WithoutCancel(ctx) instead of context.Background() so trace/span values from the incoming request are retained for distributed tracing, while still decoupling cancellation. * fix(filer): use Store.FindEntry for internal lookups, add Uid/Gid to files, fix updateDirectoryListingSyncedAt - Use f.Store.FindEntry instead of f.FindEntry for staleness check and child lookups to avoid unnecessary lazy-fetch overhead - Set OS_UID/OS_GID on new file entries for consistency with directories - In updateDirectoryListingSyncedAt, use Store.UpdateEntry for existing directories instead of CreateEntry to avoid deleteChunksIfNotNew and NotifyUpdateEvent side effects * fix(filer): distinguish not-found from store errors in lazy listing Previously, any error from Store.FindEntry was treated as "not found," which could cause entry recreation/overwrite on transient DB failures. Now check for filer_pb.ErrNotFound explicitly and skip entries or bail out on real store errors. * refactor(filer): use errors.Is for ErrNotFound comparisons	2026-03-13 09:36:54 -07:00
Peter Dodd	146a090754	filer: propagate lazy metadata deletes to remote mounts (#8522 ) * filer: propagate lazy metadata deletes to remote mounts Delete operations now call the remote backend for mounted remote-only entries before removing filer metadata, keeping remote state aligned and preserving retry semantics on remote failures. Made-with: Cursor * filer: harden remote delete metadata recovery Persist remote-delete metadata pendings so local entry removal can be retried after failures, and return explicit errors when remote client resolution fails to prevent silent local-only deletes. Made-with: Cursor * filer: streamline remote delete client lookup and logging Avoid a redundant mount trie traversal by resolving the remote client directly from the matched mount location, and add parity logging for successful remote directory deletions. Made-with: Cursor * filer: harden pending remote metadata deletion flow Retry pending-marker writes before local delete, fail closed when marking cannot be persisted, and start remote pending reconciliation only after the filer store is initialised to avoid nil store access. Made-with: Cursor * filer: avoid lazy fetch in pending metadata reconciliation Use a local-only entry lookup during pending remote metadata reconciliation so cache misses do not trigger remote lazy fetches. Made-with: Cursor * filer: serialise concurrent index read-modify-write in pending metadata deletion Add remoteMetadataDeletionIndexMu to Filer and acquire it for the full read→mutate→commit sequence in markRemoteMetadataDeletionPending and clearRemoteMetadataDeletionPending, preventing concurrent goroutines from overwriting each other's index updates. Made-with: Cursor * filer: start remote deletion reconciliation loop in NewFiler Move the background goroutine for pending remote metadata deletion reconciliation from SetStore (where it was gated by sync.Once) to NewFiler alongside the existing loopProcessingDeletion goroutine. The sync.Once approach was problematic: it buried a goroutine launch as a side effect of a setter, was unrecoverable if the goroutine panicked, could race with store initialisation, and coupled its lifecycle to unrelated shutdown machinery. The existing nil-store guard in reconcilePendingRemoteMetadataDeletions handles the window before SetStore is called. * filer: skip remote delete for replicated deletes from other filers When isFromOtherCluster is true the delete was already propagated to the remote backend by the originating filer. Repeating the remote delete on every replica doubles API calls, and a transient remote failure on the replica would block local metadata cleanup — leaving filers inconsistent. * filer: skip pending marking for directory remote deletes Directory remote deletes are idempotent and do not need the pending/reconcile machinery that was designed for file deletes where the local metadata delete might fail after the remote object is already removed. * filer: propagate remote deletes for children in recursive folder deletion doBatchDeleteFolderMetaAndData iterated child files but only called NotifyUpdateEvent and collected chunks — it never called maybeDeleteFromRemote for individual children. This left orphaned objects in the remote backend when a directory containing remote-only files was recursively deleted. Also fix isFromOtherCluster being hardcoded to false in the recursive call to doBatchDeleteFolderMetaAndData for subdirectories. * filer: simplify pending remote deletion tracking to single index key Replace the double-bookkeeping scheme (individual KV entry per path + newline-delimited index key) with a single index key that stores paths directly. This removes the per-path KV writes/deletes, the base64 encoding round-trip, and the transaction overhead that was only needed to keep the two representations in sync. * filer: address review feedback on remote deletion flow - Distinguish missing remote config from client initialization failure in maybeDeleteFromRemote error messages. - Use a detached context (30s timeout) for pending-mark and pending-clear KV writes so they survive request cancellation after the remote object has already been deleted. - Emit NotifyUpdateEvent in reconcilePendingRemoteMetadataDeletions after a successful retry deletion so downstream watchers and replicas learn about the eventual metadata removal. * filer: remove background reconciliation for pending remote deletions The pending-mark/reconciliation machinery (KV index, mutex, background loop, detached contexts) handled the narrow case where the remote object was deleted but the subsequent local metadata delete failed. The client already receives the error and can retry — on retry the remote not-found is treated as success and the local delete proceeds normally. The added complexity (and new edge cases around NotifyUpdateEvent, multi-filer consistency during reconciliation, and context lifetime) is not justified for a transient store failure the caller already handles. Remove: loopProcessingRemoteMetadataDeletionPending, reconcilePendingRemoteMetadataDeletions, markRemoteMetadataDeletionPending, clearRemoteMetadataDeletionPending, listPendingRemoteMetadataDeletionPaths, encodePendingRemoteMetadataDeletionIndex, FindEntryLocal, and all associated constants, fields, and test infrastructure. * filer: fix test stubs and add early exit on child remote delete error - Refactor stubFilerStore to release lock before invoking callbacks and propagate callback errors, preventing potential deadlocks in tests - Implement ListDirectoryPrefixedEntries with proper prefix filtering instead of delegating to the unfiltered ListDirectoryEntries - Add continue after setting err on child remote delete failure in doBatchDeleteFolderMetaAndData to skip further processing of the failed entry * filer: propagate child remote delete error instead of silently continuing Replace `continue` with early `break` when maybeDeleteFromRemote fails for a child entry during recursive folder deletion. The previous `continue` skipped the error check at the end of the loop body, so a subsequent successful entry would overwrite err and the remote delete error was silently lost. Now the loop breaks, the existing error check returns the error, and NotifyUpdateEvent / chunk collection are correctly skipped for the failed entry. * filer: delete remote file when entry has Remote pointer, not only when remote-only Replace IsInRemoteOnly() guard with entry.Remote == nil check in maybeDeleteFromRemote. IsInRemoteOnly() requires zero local chunks and RemoteSize > 0, which incorrectly skips remote deletion for cached files (local chunks exist) and zero-byte remote objects (RemoteSize 0). The correct condition is whether the entry has a remote backing object at all. --------- Co-authored-by: Chris Lu <chris.lu@gmail.com>	2026-03-12 13:16:28 -07:00
Peter Dodd	16f2269a33	feat(filer): lazy metadata pulling (#8454 ) * Add remote storage index for lazy metadata pull Introduces remoteStorageIndex, which maintains a map of filer directory to remote storage client/location, refreshed periodically from the filer's mount mappings. Provides lazyFetchFromRemote, ensureRemoteEntryInFiler, and isRemoteBacked on S3ApiServer as integration points for handler-level work in a follow-up PR. Nothing is wired into the server yet. Made-with: Cursor * Add unit tests for remote storage index and wire field into S3ApiServer Adds tests covering isEmpty, findForPath (including longest-prefix resolution), and isRemoteBacked. Also removes a stray PR review annotation from the index file and adds the remoteStorageIdx field to S3ApiServer so the package compiles ahead of the wiring PR. Made-with: Cursor * Address review comments on remote storage index - Use filer_pb.CreateEntry helper so resp.Error is checked, not just the RPC error - Extract keepPrev closure to remove duplicated error-handling in refresh loop - Add comment explaining availability-over-consistency trade-off on filer save failure Made-with: Cursor * Move lazy metadata pull from S3 API to filer - Add maybeLazyFetchFromRemote in filer: on FindEntry miss, stat remote and CreateEntry when path is under a remote mount - Use singleflight for dedup; context guard prevents CreateEntry recursion - Availability-over-consistency: return in-memory entry if CreateEntry fails - Add longest-prefix test for nested mounts in remote_storage_test.go - Remove remoteStorageIndex, lazyFetchFromRemote, ensureRemoteEntryInFiler, doLazyFetch from s3api; filer now owns metadata operations - Add filer_lazy_remote_test.go with tests for hit, miss, not-found, CreateEntry failure, longest-prefix, and FindEntry integration Made-with: Cursor * Address review: fix context guard test, add FindMountDirectory comment, remove dead code Made-with: Cursor * Nitpicks: restore prev maker in registerStubMaker, instance-scope lazyFetchGroup, nil-check remoteEntry Made-with: Cursor * Fix remotePath when mountDir is root: ensure relPath has leading slash Made-with: Cursor * filer: decouple lazy-fetch persistence from caller context Use context.Background() inside the singleflight closure for CreateEntry so persistence is not cancelled when the winning request's context is cancelled. Fixes CreateEntry failing for all waiters when the first caller times out. Made-with: Cursor * filer: remove redundant Mode bitwise OR with zero Made-with: Cursor * filer: use bounded context for lazy-fetch persistence Replace context.Background() with context.WithTimeout(30s) and defer cancel() to prevent indefinite blocking and release resources. Made-with: Cursor * filer: use checked type assertion for singleflight result Made-with: Cursor * filer: rename persist context vars to avoid shadowing function parameter Made-with: Cursor	2026-03-03 13:01:10 -08:00

4 Commits