Commit Graph

8438 Commits

Author SHA1 Message Date
Peter Dodd
0e570d6a8f feat(remote.mount): add -metadataStrategy flag to control metadata caching (#8568)
* feat(remote): add -noSync flag to skip upfront metadata pull on mount

Made-with: Cursor

* refactor(remote): split mount setup from metadata sync

Extract ensureMountDirectory for create/validate; call pullMetadata
directly when sync is needed. Caller controls sync step for -noSync.

Made-with: Cursor

* fix(remote): validate mount root when -noSync so bad bucket/creds fail fast

When -noSync is used, perform a cheap remote check (ListBuckets and
verify bucket exists) instead of skipping all remote I/O. Invalid
buckets or credentials now fail at mount time.

Made-with: Cursor

* test(remote): add TestRemoteMountNoSync for -noSync mount and persisted mapping

Made-with: Cursor

* test(remote): assert no upfront metadata after -noSync mount

After remote.mount -noSync, run fs.ls on the mount dir and assert empty
listing so the test fails if pullMetadata was invoked eagerly.

Made-with: Cursor

* fix(remote): propagate non-ErrNotFound lookup errors in ensureMountDirectory

Return lookupErr immediately for any LookupDirectoryEntry failure that
is not filer_pb.ErrNotFound, so only the not-found case creates the
entry and other lookup failures are reported to the caller.

Made-with: Cursor

* fix(remote): use errors.Is for ErrNotFound in ensureMountDirectory

Replace fragile strings.Contains(lookupErr.Error(), ...) with
errors.Is(lookupErr, filer_pb.ErrNotFound) before calling CreateEntry.

Made-with: Cursor

* fix(remote): use LookupEntry so ErrNotFound is recognised after gRPC

Raw gRPC LookupDirectoryEntry returns a status error, not the sentinel,
so errors.Is(lookupErr, filer_pb.ErrNotFound) was always false. Use
filer_pb.LookupEntry which normalises not-found to ErrNotFound so the
mount directory is created when missing.

Made-with: Cursor

* test(remote): ignore weed shell banner in TestRemoteMountNoSync fs.ls count

Exclude master/filer and prompt lines from entry count so the assertion
checks only actual fs.ls output for empty -noSync mount.

Made-with: Cursor

* fix(remote.mount): use 0755 for mount dir, document bucket-less early return

Made-with: Cursor

* feat(remote.mount): replace -noSync with -metadataStrategy=lazy|eager

- Add -metadataStrategy flag (eager default, lazy skips upfront metadata pull)
- Accept lazy/eager case-insensitively; reject invalid values with clear error
- Rename TestRemoteMountNoSync to TestRemoteMountMetadataStrategyLazy
- Add TestRemoteMountMetadataStrategyEager and TestRemoteMountMetadataStrategyInvalid

Made-with: Cursor

* fix(remote.mount): validate strategy and remote before creating mount directory

Move strategy validation and validateMountRoot (lazy path) before
ensureMountDirectory so that invalid strategies or bad bucket/credentials
fail without leaving orphaned directory entries in the filer.

* refactor(remote.mount): remove unused remote param from ensureMountDirectory

The remote *RemoteStorageLocation parameter was left over from the old
syncMetadata signature. Only remoteConf.Name is used inside the function.

* doc(remote.mount): add TODO for HeadBucket-style validation

validateMountRoot currently lists all buckets to verify one exists.
Note the need for a targeted BucketExists method in the interface.

* refactor(remote.mount): use MetadataStrategy type and constants

Replace raw string comparisons with a MetadataStrategy type and
MetadataStrategyEager/MetadataStrategyLazy constants for clarity
and compile-time safety.

* refactor(remote.mount): rename MetadataStrategy to MetadataCacheStrategy

More precisely describes the purpose: controlling how metadata is
cached from the remote, not metadata handling in general.

* fix(remote.mount): remove validateMountRoot from lazy path

Lazy mount's purpose is to skip remote I/O. Validating via ListBuckets
contradicts that, especially on accounts with many buckets. Invalid
buckets or credentials will surface on first lazy access instead.

* fix(test): handle shell exit 0 in TestRemoteMountMetadataStrategyInvalid

The weed shell process exits with code 0 even when individual commands
fail — errors appear in stdout. Check output instead of requiring a
non-nil error.

* test(remote.mount): remove metadataStrategy shell integration tests

These tests only verify string output from a shell process that always
exits 0 — they cannot meaningfully validate eager vs lazy behavior
without a real remote backend.

---------

Co-authored-by: Chris Lu <chris.lu@gmail.com>
2026-03-12 15:21:07 -07:00
Peter Dodd
146a090754 filer: propagate lazy metadata deletes to remote mounts (#8522)
* filer: propagate lazy metadata deletes to remote mounts

Delete operations now call the remote backend for mounted remote-only entries before removing filer metadata, keeping remote state aligned and preserving retry semantics on remote failures.

Made-with: Cursor

* filer: harden remote delete metadata recovery

Persist remote-delete metadata pendings so local entry removal can be retried after failures, and return explicit errors when remote client resolution fails to prevent silent local-only deletes.

Made-with: Cursor

* filer: streamline remote delete client lookup and logging

Avoid a redundant mount trie traversal by resolving the remote client directly from the matched mount location, and add parity logging for successful remote directory deletions.

Made-with: Cursor

* filer: harden pending remote metadata deletion flow

Retry pending-marker writes before local delete, fail closed when marking cannot be persisted, and start remote pending reconciliation only after the filer store is initialised to avoid nil store access.

Made-with: Cursor

* filer: avoid lazy fetch in pending metadata reconciliation

Use a local-only entry lookup during pending remote metadata reconciliation so cache misses do not trigger remote lazy fetches.

Made-with: Cursor

* filer: serialise concurrent index read-modify-write in pending metadata deletion

Add remoteMetadataDeletionIndexMu to Filer and acquire it for the full
read→mutate→commit sequence in markRemoteMetadataDeletionPending and
clearRemoteMetadataDeletionPending, preventing concurrent goroutines
from overwriting each other's index updates.

Made-with: Cursor

* filer: start remote deletion reconciliation loop in NewFiler

Move the background goroutine for pending remote metadata deletion
reconciliation from SetStore (where it was gated by sync.Once) to
NewFiler alongside the existing loopProcessingDeletion goroutine.

The sync.Once approach was problematic: it buried a goroutine launch
as a side effect of a setter, was unrecoverable if the goroutine
panicked, could race with store initialisation, and coupled its
lifecycle to unrelated shutdown machinery. The existing nil-store
guard in reconcilePendingRemoteMetadataDeletions handles the window
before SetStore is called.

* filer: skip remote delete for replicated deletes from other filers

When isFromOtherCluster is true the delete was already propagated to
the remote backend by the originating filer. Repeating the remote
delete on every replica doubles API calls, and a transient remote
failure on the replica would block local metadata cleanup — leaving
filers inconsistent.

* filer: skip pending marking for directory remote deletes

Directory remote deletes are idempotent and do not need the
pending/reconcile machinery that was designed for file deletes where
the local metadata delete might fail after the remote object is
already removed.

* filer: propagate remote deletes for children in recursive folder deletion

doBatchDeleteFolderMetaAndData iterated child files but only called
NotifyUpdateEvent and collected chunks — it never called
maybeDeleteFromRemote for individual children. This left orphaned
objects in the remote backend when a directory containing remote-only
files was recursively deleted.

Also fix isFromOtherCluster being hardcoded to false in the recursive
call to doBatchDeleteFolderMetaAndData for subdirectories.

* filer: simplify pending remote deletion tracking to single index key

Replace the double-bookkeeping scheme (individual KV entry per path +
newline-delimited index key) with a single index key that stores paths
directly. This removes the per-path KV writes/deletes, the base64
encoding round-trip, and the transaction overhead that was only needed
to keep the two representations in sync.

* filer: address review feedback on remote deletion flow

- Distinguish missing remote config from client initialization failure
  in maybeDeleteFromRemote error messages.
- Use a detached context (30s timeout) for pending-mark and
  pending-clear KV writes so they survive request cancellation after
  the remote object has already been deleted.
- Emit NotifyUpdateEvent in reconcilePendingRemoteMetadataDeletions
  after a successful retry deletion so downstream watchers and replicas
  learn about the eventual metadata removal.

* filer: remove background reconciliation for pending remote deletions

The pending-mark/reconciliation machinery (KV index, mutex, background
loop, detached contexts) handled the narrow case where the remote
object was deleted but the subsequent local metadata delete failed.
The client already receives the error and can retry — on retry the
remote not-found is treated as success and the local delete proceeds
normally. The added complexity (and new edge cases around
NotifyUpdateEvent, multi-filer consistency during reconciliation, and
context lifetime) is not justified for a transient store failure the
caller already handles.

Remove: loopProcessingRemoteMetadataDeletionPending,
reconcilePendingRemoteMetadataDeletions, markRemoteMetadataDeletionPending,
clearRemoteMetadataDeletionPending, listPendingRemoteMetadataDeletionPaths,
encodePendingRemoteMetadataDeletionIndex, FindEntryLocal, and all
associated constants, fields, and test infrastructure.

* filer: fix test stubs and add early exit on child remote delete error

- Refactor stubFilerStore to release lock before invoking callbacks and
  propagate callback errors, preventing potential deadlocks in tests
- Implement ListDirectoryPrefixedEntries with proper prefix filtering
  instead of delegating to the unfiltered ListDirectoryEntries
- Add continue after setting err on child remote delete failure in
  doBatchDeleteFolderMetaAndData to skip further processing of the
  failed entry

* filer: propagate child remote delete error instead of silently continuing

Replace `continue` with early `break` when maybeDeleteFromRemote fails
for a child entry during recursive folder deletion. The previous
`continue` skipped the error check at the end of the loop body, so a
subsequent successful entry would overwrite err and the remote delete
error was silently lost. Now the loop breaks, the existing error check
returns the error, and NotifyUpdateEvent / chunk collection are
correctly skipped for the failed entry.

* filer: delete remote file when entry has Remote pointer, not only when remote-only

Replace IsInRemoteOnly() guard with entry.Remote == nil check in
maybeDeleteFromRemote. IsInRemoteOnly() requires zero local chunks and
RemoteSize > 0, which incorrectly skips remote deletion for cached
files (local chunks exist) and zero-byte remote objects (RemoteSize 0).
The correct condition is whether the entry has a remote backing object
at all.

---------

Co-authored-by: Chris Lu <chris.lu@gmail.com>
2026-03-12 13:16:28 -07:00
Chris Lu
92a76fc1a2 fix(filer): limit concurrent proxy reads per volume server (#8608)
* fix(filer): limit concurrent proxy reads per volume server

Add a per-volume-server semaphore (default 16) to proxyToVolumeServer
to prevent replication bursts from overwhelming individual volume
servers with hundreds of concurrent connections, which causes them
to drop connections with "unexpected EOF".

Excess requests queue up and respect the client's context, returning
503 if the client disconnects while waiting.

Also log io.CopyBuffer errors that were previously silently discarded.

* Apply suggestion from @gemini-code-assist[bot]

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* fix(filer): use non-blocking release for proxy semaphore

Prevents a goroutine from blocking forever if releaseProxySemaphore
is ever called without a matching acquire.

* test(filer): clean up proxySemaphores entries in all proxy tests

---------

Co-authored-by: Copilot <copilot@github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-03-11 23:32:09 -07:00
Chris Lu
b665c329bc fix(replication): resume partial chunk reads on EOF instead of re-downloading (#8607)
* fix(replication): resume partial chunk reads on EOF instead of re-downloading

When replicating chunks and the source connection drops mid-transfer,
accumulate the bytes already received and retry with a Range header
to fetch only the remaining bytes. This avoids re-downloading
potentially large chunks from scratch on each retry, reducing load
on busy source servers and speeding up recovery.

* test(replication): add tests for downloadWithRange including gzip partial reads

Tests cover:
- No offset (no Range header sent)
- With offset (Range header verified)
- Content-Disposition filename extraction
- Partial read + resume: server drops connection mid-transfer, client
  resumes with Range from the offset of received bytes
- Gzip partial read + resume: first response is gzip-encoded (Go auto-
  decompresses), connection drops, resume request gets decompressed data
  (Go doesn't add Accept-Encoding when Range is set, so the server
  decompresses), combined bytes match original

* fix(replication): address PR review comments

- Consolidate downloadWithRange into DownloadFile with optional offset
  parameter (variadic), eliminating code duplication (DRY)
- Validate HTTP response status: require 206 + correct Content-Range
  when offset > 0, reject when server ignores Range header
- Use if/else for fullData assignment for clarity
- Add test for rejected Range (server returns 200 instead of 206)

* refactor(replication): remove unused ReplicationSource interface

The interface was never referenced and its signature didn't match
the actual FilerSource.ReadPart method.

---------

Co-authored-by: Copilot <copilot@github.com>
2026-03-11 22:38:22 -07:00
Chris Lu
e4a77b8b16 feat(admin): support env var and security.toml for credentials (#8606)
* feat(security): add [admin] section to security.toml scaffold

Add admin credential fields (user, password, readonly.user,
readonly.password) to security.toml. Via viper's WEED_ env prefix and
AutomaticEnv(), these are automatically overridable as WEED_ADMIN_USER,
WEED_ADMIN_PASSWORD, etc.

Ref: https://github.com/seaweedfs/seaweedfs/discussions/8586

* feat(admin): support env var and security.toml fallbacks for credentials

Add applyViperFallback() to read admin credentials from security.toml /
WEED_* environment variables when CLI flags are not explicitly set.
This allows systems like NixOS to pass secrets via env vars instead of
CLI flags, which appear in process listings.

Precedence: CLI flag > env var / security.toml > default value.

Also change -adminUser default from "admin" to "" so that credentials
are fully opt-in.

Ref: https://github.com/seaweedfs/seaweedfs/discussions/8586

* feat(helm): use WEED_ env vars for admin credentials instead of CLI flags

Rename SEAWEEDFS_ADMIN_USER/PASSWORD to WEED_ADMIN_USER/PASSWORD so
viper picks them up natively. Remove -adminUser/-adminPassword shell
expansion from command args since the Go binary now reads these
directly via viper.

* docs(admin): document env var and security.toml credential support

Add environment variable mapping table, security.toml example, and
precedence rules to the admin README.

* style(security): use nested [admin.readonly] table in security.toml

Use a nested TOML table instead of dotted keys for the readonly
credentials. More idiomatic and easier to read; no change in how
Viper parses it.

* fix(admin): use util.GetViper() for env var support and fix README example

applyViperFallback() was using viper.GetString() directly, which
bypasses the WEED_ env prefix and AutomaticEnv setup that only
happens in util.GetViper(). Switch to util.GetViper().GetString()
so WEED_ADMIN_* environment variables are actually picked up.

Also fix the README example to include WEED_ADMIN_USER alongside
WEED_ADMIN_PASSWORD, since runAdmin() rejects an empty username
when a password is set.

* fix(admin): restore default adminUser to "admin"

Defaulting adminUser to "" broke the common flow of setting only
WEED_ADMIN_PASSWORD — runAdmin() rejects an empty username when a
password is set. Restore "admin" as the default so that setting
only the password works out of the box.

* docs(admin): align README security.toml example with scaffold format

Use nested [admin.readonly] table instead of flat dotted keys to
match the format in weed/command/scaffold/security.toml.

* docs(admin): remove README.md in favor of wiki page

Admin documentation lives at the wiki (Admin-UI.md). Remove the
in-repo README to avoid maintaining duplicate docs.

---------

Co-authored-by: Copilot <copilot@github.com>
2026-03-11 17:40:24 -07:00
Copilot
013362d2d3 fix(shell): show planned size in fs.mergeVolumes log to clarify size limit check (#8553)
The log message was comparing against the planned size of the destination
volume (including volumes already planned to merge into it) but only
displaying the raw volume size, making the output confusing when the
displayed sizes clearly didn't add up to exceed the limit.
2026-03-11 13:56:13 -07:00
Chris Lu
8ac4caf930 fix(s3api): return no-encryption instead of error when bucket metadata is missing
When getEncryptionConfiguration encounters a not-found error (e.g.,
during bucket recreation after a partial delete), return
ErrNoSuchBucketEncryptionConfiguration instead of ErrInternalError.
This prevents uploads from failing with 500 errors during recovery.
2026-03-11 13:49:21 -07:00
Chris Lu
ab85f46529 fix(s3api): clear negative cache in autoCreateBucket when bucket exists
When autoCreateBucket finds the bucket already exists, remove it from
the negative cache so subsequent requests don't unnecessarily trigger
another auto-create attempt.
2026-03-11 13:49:15 -07:00
Chris Lu
5208c7c727 fix(s3api): improve PutBucketHandler comment for orphaned collection recovery
Clarify the comment and log message for the case where a collection
exists but the bucket directory is missing, explaining the root cause
(partial deletion) more precisely.
2026-03-11 13:49:11 -07:00
Chris Lu
12b360f499 fix(s3api): delete bucket directory before collection to prevent inconsistent state
Reorder DeleteBucketHandler to remove the bucket directory first, then
delete the collection. If collection deletion fails, the bucket is still
effectively deleted and can be recreated. Previously, if directory
deletion succeeded but collection deletion failed, the bucket was left
in an unrecoverable state.
2026-03-11 13:49:03 -07:00
Chris Lu
d1a631123f fix(s3api): allow bucket recreation when orphaned collection exists (#8605)
* fix(s3api): allow bucket recreation when orphaned collection exists (#8601)

When a bucket is deleted, its filer directory is removed but the
underlying collection/volumes may not be fully cleaned up yet. If the
bucket is immediately recreated, PutBucketHandler was returning
ErrBucketAlreadyExists due to the orphaned collection, blocking bucket
recreation and causing subsequent uploads to fail with InternalError.

Allow bucket creation to proceed when a collection exists without a
corresponding bucket directory, since this is a transient orphaned state
from a previous deletion.

* fix(s3api): handle concurrent bucket creation race in mkdir

On mkdir failure, re-check whether the bucket directory now exists
and return BucketAlreadyExists instead of InternalError when another
request created the bucket concurrently.
2026-03-11 13:42:43 -07:00
Chris Lu
b799650357 fix(shell): set LastLocalSyncTsNs in remote.copy.local so remote.uncache works (#8604)
remote.uncache checks LastLocalSyncTsNs to determine if a file has been
synced to remote. remote.copy.local was not setting this field, leaving
it at 0, which caused uncache to skip all files uploaded via
remote.copy.local.

Fixes #8602
2026-03-11 12:55:45 -07:00
Chris Lu
2ff4a07544 Reduce task logger glog noise and remove per-write fsync (#8603)
* Reduce task logger noise: stop duplicating every log entry to glog and stderr

Every task log entry was being tripled: written to the task log file,
forwarded to glog (which writes to /tmp by default with no rotation),
and echoed to stderr. This caused glog files to fill /tmp on long-running
workers.

- Remove INFO/DEBUG forwarding to glog (only ERROR/WARNING remain)
- Remove stderr echo of every log line
- Remove fsync on every single log write (unnecessary for log files)

* Fix glog call depth for correct source file attribution

The call stack is: caller → Error() → log() → writeLogEntry() →
glog.ErrorDepth(), so depth=4 is needed for glog to report the
original caller's file and line number.
2026-03-11 12:42:18 -07:00
Chris Lu
4a5243886a 4.17 2026-03-11 02:29:24 -07:00
Chris Lu
e1e4c9437a fix(s3api): ListObjects with trailing-slash prefix matches sibling directories (#8599)
fix(s3api): ListObjects with trailing-slash prefix returns wrong results

When ListObjectsV2 is called with a prefix ending in "/" (e.g., "foo/"),
normalizePrefixMarker strips the trailing slash and splits into
dir="parent" and prefix="foo". The filer then lists entries matching
prefix "foo", which returns both directory "foo" and "foo1000".

The prefixEndsOnDelimiter guard correctly identifies directory "foo" as
the target and recurses into it, but then resets the guard to false.
The loop continues and incorrectly recurses into "foo1000" as well,
causing the listing to return objects from unrelated directories.

Fix: after recursing into the exact directory targeted by the
trailing-slash prefix, return immediately from the listing loop.
There is no reason to process sibling entries since the original
prefix specifically targeted one directory.
2026-03-11 02:28:34 -07:00
Chris Lu
f950a941e3 Fix trust policy validation for specific AWS user principals (#8597)
* Add tests for AWS user principal in AssumeRole trust policies

Add test cases that verify trust policy validation when using specific
AWS user principals (e.g., "arn:aws:iam::000000000000:user/backend")
in the Principal field of trust policies for AssumeRole.

Covers single user, multiple users (array), wildcard, and plain string
principal formats. These tests demonstrate the bug reported in #8588
where specific user principals always fail validation.

* Populate RequestContext in ValidateTrustPolicyForPrincipal

ValidateTrustPolicyForPrincipal was creating an EvaluationContext with
a nil RequestContext. The policy engine's principal matching logic looks
up "aws:PrincipalArn" in RequestContext for non-wildcard principals,
so specific user ARNs like "arn:aws:iam::000000000000:user/backend"
always failed to match, while wildcard "*" worked because it
short-circuits before the lookup.

Populate RequestContext with both "principal" and "aws:PrincipalArn"
keys, consistent with how IsActionAllowed already does it.

Fixes #8588

* Remove GitHub discussion URL from source code comments

* Add specific error message assertions in trust policy tests
2026-03-10 21:19:40 -07:00
Chris Lu
ac579c1746 Fix plugin configuration tab layout overflow (#8596)
Fix plugin configuration tab layout overflow (#8587)

Remove h-100 from Job Scheduling Settings card, which caused it to
stretch to 100% of the row height and push the Next Run card below
the row boundary, overflowing into the Detection Results section.
2026-03-10 19:19:42 -07:00
Chris Lu
0a5c5ed4ce Persist S3 bucket counter metrics across idle periods (#8595)
* Stop deleting counter metrics during bucket TTL cleanup

Counter metrics (traffic bytes, request counts, object counts) are
monotonically increasing by design. Deleting them after 10 minutes of
bucket inactivity causes them to vanish from /metrics output and reset
to zero when traffic resumes, breaking Prometheus rate()/increase()
queries and making historical traffic reporting impossible.

Only delete gauges and histograms in the TTL cleanup loop, as these
represent current state and are safely re-populated on next activity.

Fixes https://github.com/seaweedfs/seaweedfs/issues/8521

* Clean up all bucket metrics on bucket deletion

Add DeleteBucketMetrics() to delete all metrics (including counters)
for a bucket when it is explicitly deleted. This prevents unbounded
label cardinality from accumulating for buckets that no longer exist.

Called from DeleteBucketHandler after successful bucket deletion.

* Reduce mutex scope in bucket metrics TTL sweep

Collect expired bucket names under the lock, then release before
calling DeletePartialMatch on Prometheus metrics. This prevents
RecordBucketActiveTime from blocking during the expensive cleanup.
2026-03-10 19:00:40 -07:00
Chris Lu
0a2dac1e56 Reduce mutex scope in bucket metrics TTL sweep
Collect expired bucket names under the lock, then release before
calling DeletePartialMatch on Prometheus metrics. This prevents
RecordBucketActiveTime from blocking during the expensive cleanup.
2026-03-10 18:43:35 -07:00
Chris Lu
737116e83c fix port probing 2026-03-10 18:30:19 -07:00
Chris Lu
47cad59c70 Remove misleading Workers sub-menu items from admin sidebar (#8594)
* Remove misleading Workers sub-menu items from admin sidebar

The sidebar sub-items (Job Detection, Job Queue, Job Execution,
Configuration) always navigated to the first job type's tabs
(typically EC Encoding) rather than showing cross-job-type views.
This was confusing as noted in #8590. Since the in-page tabs already
provide this navigation, remove the redundant sidebar sub-items and
keep only the top-level Workers link.

Fixes #8590

* Update layout_templ.go
2026-03-10 15:55:14 -07:00
Chris Lu
b17e2b411a Add dynamic timeouts to plugin worker vacuum gRPC calls (#8593)
* add dynamic timeouts to plugin worker vacuum gRPC calls

All vacuum gRPC calls used context.Background() with no deadline,
so the plugin scheduler's execution timeout could kill a job while
a large volume compact was still in progress. Use volume-size-scaled
timeouts matching the topology vacuum approach: 3 min/GB for compact,
1 min/GB for check, commit, and cleanup.

Fixes #8591

* scale scheduler execution timeout by volume size

The scheduler's per-job execution timeout (default 240s) would kill
vacuum jobs on large volumes before they finish. Three changes:

1. Vacuum detection now includes estimated_runtime_seconds in job
   proposals, computed as 5 min/GB of volume size.

2. The scheduler checks for estimated_runtime_seconds in job
   parameters and uses it as the execution timeout when larger than
   the default — a generic mechanism any handler can use.

3. Vacuum task gRPC calls now use the passed-in ctx as parent
   instead of context.Background(), so scheduler cancellation
   propagates to in-flight RPCs.

* extend job type runtime when proposals need more time

The JobTypeMaxRuntime (default 30 min) wraps both detection and
execution. Its context is the parent of all per-job execution
contexts, so even with per-job estimated_runtime_seconds, jobCtx
would cancel everything when it expires.

After detection, scan proposals for the maximum
estimated_runtime_seconds. If any proposal needs more time than
the remaining JobTypeMaxRuntime, create a new execution context
with enough headroom. This lets large vacuum jobs complete without
being killed by the job type deadline while still respecting the
configured limit for normal-sized jobs.

* log missing volume size metric, remove dead minimum runtime guard

Add a debug log in vacuumTimeout when t.volumeSize is 0 so
operators can investigate why metrics are missing for a volume.

Remove the unreachable estimatedRuntimeSeconds < 180 check in
buildVacuumProposal — volumeSizeGB always >= 1 (due to +1 floor),
so estimatedRuntimeSeconds is always >= 300.

* cap estimated runtime and fix status check context

- Cap maxEstimatedRuntime and per-job timeout overrides to 8 hours
  to prevent unbounded timeouts from bad metrics.
- Check execCtx.Err() instead of jobCtx.Err() for status reporting,
  since dispatch runs under execCtx which may have a longer deadline.
  A successful dispatch under execCtx was misreported as "timeout"
  when jobCtx had expired.
2026-03-10 13:48:42 -07:00
Chris Lu
4c88fbfd5e Fix nil pointer crash during concurrent vacuum compaction (#8592)
* check for nil needle map before compaction sync

When CommitCompact runs concurrently, it sets v.nm = nil under
dataFileAccessLock. CompactByIndex does not hold that lock, so
v.nm.Sync() can hit a nil pointer. Add an early nil check to
return an error instead of crashing.

Fixes #8591

* guard copyDataBasedOnIndexFile size check against nil needle map

The post-compaction size validation at line 538 accesses
v.nm.ContentSize() and v.nm.DeletedSize(). If CommitCompact has
concurrently set v.nm to nil, this causes a SIGSEGV. Skip the
validation when v.nm is nil since the actual data copy uses local
needle maps (oldNm/newNm) and is unaffected.

Fixes #8591

* use atomic.Bool for compaction flags to prevent concurrent vacuum races

The isCompacting and isCommitCompacting flags were plain bools
read and written from multiple goroutines without synchronization.
This allowed concurrent vacuums on the same volume to pass the
guard checks and run simultaneously, leading to the nil pointer
crash. Using atomic.Bool with CompareAndSwap ensures only one
compaction or commit can run per volume at a time.

Fixes #8591

* use go-version-file in CI workflows instead of hardcoded versions

Use go-version-file: 'go.mod' so CI automatically picks up the Go
version from go.mod, avoiding future version drift. Reordered
checkout before setup-go in go.yml and e2e.yml so go.mod is
available. Removed the now-unused GO_VERSION env vars.

* capture v.nm locally in CompactByIndex to close TOCTOU race

A bare nil check on v.nm followed by v.nm.Sync() has a race window
where CommitCompact can set v.nm = nil between the two. Snapshot
the pointer into a local variable so the nil check and Sync operate
on the same reference.

* add dynamic timeouts to plugin worker vacuum gRPC calls

All vacuum gRPC calls used context.Background() with no deadline,
so the plugin scheduler's execution timeout could kill a job while
a large volume compact was still in progress. Use volume-size-scaled
timeouts matching the topology vacuum approach: 3 min/GB for compact,
1 min/GB for check, commit, and cleanup.

Fixes #8591

* Revert "add dynamic timeouts to plugin worker vacuum gRPC calls"

This reverts commit 80951934c37416bc4f6c1472a5d3f8d204a637d9.

* unify compaction lifecycle into single atomic flag

Replace separate isCompacting and isCommitCompacting flags with a
single isCompactionInProgress atomic.Bool. This ensures CompactBy*,
CommitCompact, Close, and Destroy are mutually exclusive — only one
can run at a time per volume.

Key changes:
- All entry points use CompareAndSwap(false, true) to claim exclusive
  access. CompactByVolumeData and CompactByIndex now also guard v.nm
  and v.DataBackend with local captures.
- Close() waits for the flag outside dataFileAccessLock to avoid
  deadlocking with CommitCompact (which holds the flag while waiting
  for the lock). It claims the flag before acquiring the lock so no
  new compaction can start.
- Destroy() uses CAS instead of a racy Load check, preventing
  concurrent compaction from racing with volume teardown.
- unmountVolumeByCollection no longer deletes from the map;
  DeleteCollectionFromDiskLocation removes entries only after
  successful Destroy, preventing orphaned volumes on failure.

Fixes #8591
2026-03-10 13:31:45 -07:00
Chris Lu
d4d2e511ed for mini, default to bind all 2026-03-10 00:56:40 -07:00
Chris Lu
d89a78d9e3 reduce logs 2026-03-09 22:42:03 -07:00
Chris Lu
00000ec006 Update s3_buckets_templ.go 2026-03-09 22:41:07 -07:00
Chris Lu
1bd7a98a4a simplify plugin scheduler: remove configurable IdleSleepSeconds, use constant 61s
The SchedulerConfig struct and its persistence/API were unnecessary
indirection. Replace with a simple constant (reduced from 613s to 61s)
so the scheduler re-checks for detectable job types promptly after
going idle, improving the clean-install experience.
2026-03-09 22:41:03 -07:00
Chris Lu
8ad58e7002 4.16 2026-03-09 21:52:43 -07:00
Chris Lu
cf3693651c fix: add IdxFileSize check to pre-delete volume verification
The verification step checked DatFileSize and FileCount but not
IdxFileSize, leaving a gap in the copy validation before source
deletion.
2026-03-09 19:33:02 -07:00
Chris Lu
5f85bf5e8a Batch volume balance: run multiple moves per job (#8561)
* proto: add BalanceMoveSpec and batch fields to BalanceTaskParams

Add BalanceMoveSpec message for encoding individual volume moves,
and max_concurrent_moves + repeated moves fields to BalanceTaskParams
to support batching multiple volume moves in a single job.

* balance handler: add batch execution with concurrent volume moves

Refactor Execute() into executeSingleMove() (backward compatible) and
executeBatchMoves() which runs multiple volume moves concurrently using
a semaphore-bounded goroutine pool. When BalanceTaskParams.Moves is
populated, the batch path is taken; otherwise the single-move path.

Includes aggregate progress reporting across concurrent moves,
per-move error collection, and partial failure support.

* balance handler: add batch config fields to Descriptor and worker config

Add max_concurrent_moves and batch_size fields to the worker config
form and deriveBalanceWorkerConfig(). These control how many volume
moves run concurrently within a batch job and the maximum batch size.

* balance handler: group detection proposals into batch jobs

When batch_size > 1, the Detect method groups detection results into
batch proposals where each proposal encodes multiple BalanceMoveSpec
entries in BalanceTaskParams.Moves. Single-result batches fall back
to the existing single-move proposal format for backward compatibility.

* admin UI: add volume balance execution plan and batch badge

Add renderBalanceExecutionPlan() for rich rendering of volume balance
jobs in the job detail modal. Single-move jobs show source/target/volume
info; batch jobs show a moves table with all volume moves.

Add batch badge (e.g., "5 moves") next to job type in the execution
jobs table when the job has batch=true label.

* Update plugin_templ.go

* fix: detection algorithm uses greedy target instead of divergent topology scores

The detection loop tracked effective volume counts via an adjustments map,
but createBalanceTask independently called planBalanceDestination which used
the topology's LoadCount — a separate, unadjusted source of truth. This
divergence caused multiple moves to pile onto the same server.

Changes:
- Add resolveBalanceDestination to resolve the detection loop's greedy
  target (minServer) rather than independently picking a destination
- Add oscillation guard: stop when max-min <= 1 since no single move
  can improve the balance beyond that point
- Track unseeded destinations: if a target server wasn't in the initial
  serverVolumeCounts, add it so subsequent iterations include it
- Add TestDetection_UnseededDestinationDoesNotOverload

* fix: handler force_move propagation, partial failure, deterministic dedupe

- Propagate ForceMove from outer BalanceTaskParams to individual move
  TaskParams so batch moves respect the force_move flag
- Fix partial failure: mark job successful if at least one move
  succeeded (succeeded > 0 || failed == 0) to avoid re-running
  already-completed moves on retry
- Use SHA-256 hash for deterministic dedupe key fallback instead of
  time.Now().UnixNano() which is non-deterministic
- Remove unused successDetails variable
- Extract maxProposalStringLength constant to replace magic number 200

* admin UI: use template literals in balance execution plan rendering

* fix: integration test handles batch proposals from batched detection

With batch_size=20, all moves are grouped into a single proposal
containing BalanceParams.Moves instead of top-level Sources/Targets.
Update assertions to handle both batch and single-move proposal formats.

* fix: verify volume size on target before deleting source during balance

Add a pre-delete safety check that reads the volume file status on both
source and target, then compares .dat file size and file count. If they
don't match, the move is aborted — leaving the source intact rather than
risking irreversible data loss.

Also removes the redundant mountVolume call since VolumeCopy already
mounts the volume on the target server.

* fix: clamp maxConcurrent, serialize progress sends, validate config as int64

- Clamp maxConcurrentMoves to defaultMaxConcurrentMoves before creating
  the semaphore so a stale or malicious job cannot request unbounded
  concurrent volume moves
- Extend progressMu to cover sender.SendProgress calls since the
  underlying gRPC stream is not safe for concurrent writes
- Perform bounds checks on max_concurrent_moves and batch_size in int64
  space before casting to int, avoiding potential overflow on 32-bit

* fix: check disk capacity in resolveBalanceDestination

Skip disks where VolumeCount >= MaxVolumeCount so the detection loop
does not propose moves to a full disk that would fail at execution time.

* test: rename unseeded destination test to match actual behavior

The test exercises a server with 0 volumes that IS seeded from topology
(matching disk type), not an unseeded destination. Rename to
TestDetection_ZeroVolumeServerIncludedInBalance and fix comments.

* test: tighten integration test to assert exactly one batch proposal

With default batch_size=20, all moves should be grouped into a single
batch proposal. Assert len(proposals)==1 and require BalanceParams with
Moves, removing the legacy single-move else branch.

* fix: propagate ctx to RPCs and restore source writability on abort

- All helper methods (markVolumeReadonly, copyVolume, tailVolume,
  readVolumeFileStatus, deleteVolume) now accept a context parameter
  instead of using context.Background(), so Execute's ctx propagates
  cancellation and timeouts into every volume server RPC
- Add deferred cleanup that restores the source volume to writable if
  any step after markVolumeReadonly fails, preventing the source from
  being left permanently readonly on abort
- Add markVolumeWritable helper using VolumeMarkWritableRequest

* fix: deep-copy protobuf messages in test recording sender

Use proto.Clone in recordingExecutionSender to store immutable snapshots
of JobProgressUpdate and JobCompleted, preventing assertions from
observing mutations if the handler reuses message pointers.

* fix: add VolumeMarkWritable and ReadVolumeFileStatus to fake volume server

The balance task now calls ReadVolumeFileStatus for pre-delete
verification and VolumeMarkWritable to restore writability on abort.
Add both RPCs to the test fake, and drop the mountCalls assertion since
BalanceTask no longer calls VolumeMount directly (VolumeCopy handles it).

* fix: use maxConcurrentMovesLimit (50) for clamp, not defaultMaxConcurrentMoves

defaultMaxConcurrentMoves (5) is the fallback when the field is unset,
not an upper bound. Clamping to it silently overrides valid config
values like 10/20/50. Introduce maxConcurrentMovesLimit (50) matching
the descriptor's MaxValue and clamp to that instead.

* fix: cancel batch moves on progress stream failure

Derive a cancellable batchCtx from the caller's ctx. If
sender.SendProgress returns an error (client disconnect, context
cancelled), capture it, skip further sends, and cancel batchCtx so
in-flight moves abort via their propagated context rather than running
blind to completion.

* fix: bound cleanup timeout and validate batch move fields

- Use a 30-second timeout for the deferred markVolumeWritable cleanup
  instead of context.Background() which can block indefinitely if the
  volume server is unreachable
- Validate required fields (VolumeID, SourceNode, TargetNode) before
  appending moves to a batch proposal, skipping invalid entries
- Fall back to a single-move proposal when filtering leaves only one
  valid move in a batch

* fix: cancel task execution on SendProgress stream failure

All handler progress callbacks previously ignored SendProgress errors,
allowing tasks to continue executing after the client disconnected.
Now each handler creates a derived cancellable context and cancels it
on the first SendProgress error, stopping the in-flight task promptly.

Handlers fixed: erasure_coding, vacuum, volume_balance (single-move),
and admin_script (breaks command loop on send failure).

* fix: validate batch moves before scheduling in executeBatchMoves

Reject empty batches, enforce a hard upper bound (100 moves), and
filter out nil or incomplete move specs (missing source/target/volume)
before allocating progress tracking and launching goroutines.

* test: add batch balance execution integration test

Tests the batch move path with 3 volumes, max concurrency 2, using
fake volume servers. Verifies all moves complete with correct readonly,
copy, tail, and delete RPC counts.

* test: add MarkWritableCount and ReadFileStatusCount accessors

Expose the markWritableCalls and readFileStatusCalls counters on the
fake volume server, following the existing MarkReadonlyCount pattern.

* fix: oscillation guard uses global effective counts for heterogeneous capacity

The oscillation guard (max-min <= 1) previously used maxServer/minServer
which are determined by utilization ratio. With heterogeneous capacity,
maxServer by utilization can have fewer raw volumes than minServer,
producing a negative diff and incorrectly triggering the guard.

Now scans all servers' effective counts to find the true global max/min
volume counts, so the guard works correctly regardless of whether
utilization-based or raw-count balancing is used.

* fix: admin script handler breaks outer loop on SendProgress failure

The break on SendProgress error inside the shell.Commands scan only
exited the inner loop, letting the outer command loop continue
executing commands on a broken stream. Use a sendBroken flag to
propagate the break to the outer execCommands loop.
2026-03-09 19:30:08 -07:00
Chris Lu
b991acf634 fix: paginate bucket listing in Admin UI to show all buckets (#8585)
* fix: paginate bucket listing in Admin UI to show all buckets

The Admin UI's GetS3Buckets() had a hardcoded Limit of 1000 in the
ListEntries request, causing the Total Buckets count to cap at 1000
even when more buckets exist. This adds pagination to iterate through
all buckets by continuing from the last entry name when a full page
is returned.

Fixes seaweedfs/seaweedfs#8564

* feat: add server-side pagination and sorting to S3 buckets page

Add pagination controls, page size selector, and sortable column
headers to the Admin UI's Object Store buckets page, following the
same pattern used by the Cluster Volumes page. This ensures the UI
remains responsive with thousands of buckets.

- Add CurrentPage, TotalPages, PageSize, SortBy, SortOrder to S3BucketsData
- Accept page/pageSize/sortBy/sortOrder query params in ShowS3Buckets handler
- Sort buckets by name, owner, created, objects, logical/physical size
- Paginate results server-side (default 100 per page)
- Add pagination nav, page size dropdown, and sort indicators to template

* Update s3_buckets_templ.go

* Update object_store_users_templ.go

* fix: use errors.Is(err, io.EOF) instead of string comparison

Replace brittle err.Error() == "EOF" string comparison with idiomatic
errors.Is(err, io.EOF) for checking stream end in bucket listing.

* fix: address PR review findings for bucket pagination

- Clamp page to totalPages when page exceeds total, preventing empty
  results with misleading pagination state
- Fix sort comparator to use explicit ascending/descending comparisons
  with a name tie-breaker, satisfying strict weak ordering for sort.Slice
- Capture SnapshotTsNs from first ListEntries response and pass it to
  subsequent requests for consistent pagination across pages
- Replace non-focusable <th onclick> sort headers with <a> tags and
  reuse getSortIcon, matching the cluster_volumes accessibility pattern
- Change exportBucketList() to fetch all buckets from /api/s3/buckets
  instead of scraping DOM rows (which now only contain the current page)
2026-03-09 18:55:47 -07:00
Chris Lu
02d3e3195c Update object_store_users_templ.go 2026-03-09 18:34:58 -07:00
Chris Lu
470075dd90 admin/balance: fix Max Volumes display and balancer source selection (#8583)
* admin: fix Max Volumes column always showing 0

GetClusterVolumeServers() computed DiskCapacity from
diskInfo.MaxVolumeCount but never populated the MaxVolumes field
on the VolumeServer struct, causing the column to always display 0.

* balance: use utilization ratio for source server selection

The balancer selected the source server (to move volumes FROM) by raw
volume count. In clusters with heterogeneous MaxVolumeCount settings,
the server with the highest capacity naturally holds the most volumes
and was always picked as the source, even when it had the lowest
utilization ratio.

Change source selection and imbalance calculation to use utilization
ratio (effectiveCount / maxVolumeCount) so servers are compared by how
full they are relative to their capacity, not by absolute volume count.

This matches how destination scoring already works via
calculateBalanceScore().
2026-03-09 18:34:11 -07:00
Lars Lehtonen
f8b7357350 weed/server: fix dropped error (#8584)
* weed/server: fix dropped error

* Removed the redundant check.

---------

Co-authored-by: Chris Lu <chrislusf@users.noreply.github.com>
Co-authored-by: Chris Lu <chris.lu@gmail.com>
2026-03-09 18:04:12 -07:00
Chris Lu
6dab90472b admin: fix access key creation UX (#8579)
* admin: remove misleading "secret key only shown once" warning

The access key details modal already allows viewing both the access key
and secret key at any time, so the warning about the secret key only
being displayed once is incorrect and misleading.

* admin: allow specifying custom access key and secret key

Add optional access_key and secret_key fields to the create access key
API. When provided, the specified keys are used instead of generating
random ones. The UI now shows a form with optional fields when creating
a new key, with a note that leaving them blank auto-generates keys.

* admin: check access key uniqueness before creating

Access keys must be globally unique across all users since S3 auth
looks them up in a single global map. Add an explicit check using
GetUserByAccessKey before creating, so the user gets a clear error
("access key is already in use") rather than a generic store error.

* Update object_store_users_templ.go

* admin: address review feedback for access key creation

Handler:
- Use decodeJSONBody/newJSONMaxReader instead of raw json.Decode to
  enforce request size limits and handle malformed JSON properly
- Return 409 Conflict for duplicate access keys, 400 Bad Request for
  validation errors, instead of generic 500

Backend:
- Validate access key length (4-128 chars) and secret key length
  (8-128 chars) when user-provided

Frontend:
- Extract resetCreateKeyForm() helper to avoid duplicated cleanup logic
- Wire resetCreateKeyForm to accessKeysModal hidden.bs.modal event so
  form state is always cleared when modal is dismissed
- Change secret key input to type="password" with a visibility toggle

* admin: guard against nil request and handle GetUserByAccessKey errors

- Add nil check for the CreateAccessKeyRequest pointer before
  dereferencing, defaulting to an empty request (auto-generate both
  keys).
- Handle non-"not found" errors from GetUserByAccessKey explicitly
  instead of silently proceeding, so store errors (e.g. db connection
  failures) surface rather than being swallowed.

* Update object_store_users_templ.go

* admin: fix access key uniqueness check with gRPC store

GetUserByAccessKey returns a gRPC NotFound status error (not the
sentinel credential.ErrAccessKeyNotFound) when using the gRPC store,
causing the uniqueness check to fail with a spurious error.

Treat the lookup as best-effort: only reject when a user is found
(err == nil). Any error (not-found via any store, connectivity issues)
falls through to the store's own CreateAccessKey which enforces
uniqueness definitively.

* admin: fix error handling and input validation for access key creation

Backend:
- Remove access key value from the duplicate-key error message to avoid
  logging the caller-supplied identifier.

Handler:
- Handle empty POST body (io.EOF) as a valid request that auto-generates
  both keys, instead of rejecting it as malformed JSON.
- Return 404 for "not found" errors (e.g. non-existent user) instead of
  collapsing them into a 500.

Frontend:
- Add minlength/maxlength attributes matching backend constraints
  (access key 4-128, secret key 8-128).
- Call reportValidity() before submitting so invalid lengths are caught
  client-side without a round trip.

* admin: use sentinel errors and fix GetUserByAccessKey error handling

Backend (user_management.go):
- Define sentinel errors (ErrAccessKeyInUse, ErrUserNotFound,
  ErrInvalidInput) and wrap them in returned errors so callers can use
  errors.Is.
- Handle GetUserByAccessKey errors properly: check the sentinel
  credential.ErrAccessKeyNotFound first, then fall back to string
  matching for stores (gRPC) that return non-sentinel not-found errors.
  Surface unexpected errors instead of silently proceeding.

Handler (user_handlers.go):
- Replace fragile strings.Contains error matching with errors.Is
  against the new dash sentinels.

Frontend (object_store_users.templ):
- Add double-submit guard (isCreatingKey flag + button disabling) to
  prevent duplicate access key creation requests.
2026-03-09 14:03:41 -07:00
Chris Lu
f8d783f80e fix: ListObjectVersions interleave Version and DeleteMarker in sort order (#8567)
* fix: ListObjectVersions interleave Version and DeleteMarker in sort order

Go's default xml.Marshal serializes struct fields in definition order,
causing all <Version> elements to appear before all <DeleteMarker>
elements. The S3 API contract requires these elements to be interleaved
in the correct global sort order (by key ascending, then newest version
first within each key).

This broke clients that validate version list ordering within a single
key — an older Version would appear before a newer DeleteMarker for the
same object.

Fix: Replace the separate Versions/DeleteMarkers/CommonPrefixes arrays
with a single Entries []VersionListEntry slice. Each VersionListEntry
uses a per-element MarshalXML that outputs the correct XML tag name
(<Version>, <DeleteMarker>, or <CommonPrefixes>) based on which field
is populated. Since the entries are already in their correct sorted
order from buildSortedCombinedList, the XML output is automatically
interleaved correctly.

Also removes the unused ListObjectVersionsResult struct.

Note: The reporter also mentioned a cross-key timestamp ordering issue
when paginating with max-keys=1, but that is correct S3 behavior —
ListObjectVersions sorts by key name (ascending), not by timestamp.
Different keys having non-monotonic timestamps is expected.

* test: add CommonPrefixes XML marshaling coverage for ListObjectVersions

* fix: validate VersionListEntry has exactly one field set in MarshalXML

Return an error instead of silently emitting an empty <Version> element
when no field (or multiple fields) are populated. Also clean up the
misleading xml:"Version" struct tag on the Entries field.
2026-03-09 12:37:59 -07:00
Chris Lu
55bce53953 reduce logs 2026-03-09 12:14:25 -07:00
Chris Lu
992db11d2b iam: add IAM group management (#8560)
* iam: add Group message to protobuf schema

Add Group message (name, members, policy_names, disabled) and
add groups field to S3ApiConfiguration for IAM group management
support (issue #7742).

* iam: add group CRUD to CredentialStore interface and all backends

Add group management methods (CreateGroup, GetGroup, DeleteGroup,
ListGroups, UpdateGroup) to the CredentialStore interface with
implementations for memory, filer_etc, postgres, and grpc stores.
Wire group loading/saving into filer_etc LoadConfiguration and
SaveConfiguration.

* iam: add group IAM response types

Add XML response types for group management IAM actions:
CreateGroup, DeleteGroup, GetGroup, ListGroups, AddUserToGroup,
RemoveUserFromGroup, AttachGroupPolicy, DetachGroupPolicy,
ListAttachedGroupPolicies, ListGroupsForUser.

* iam: add group management handlers to embedded IAM API

Add CreateGroup, DeleteGroup, GetGroup, ListGroups, AddUserToGroup,
RemoveUserFromGroup, AttachGroupPolicy, DetachGroupPolicy,
ListAttachedGroupPolicies, and ListGroupsForUser handlers with
dispatch in ExecuteAction.

* iam: add group management handlers to standalone IAM API

Add group handlers (CreateGroup, DeleteGroup, GetGroup, ListGroups,
AddUserToGroup, RemoveUserFromGroup, AttachGroupPolicy, DetachGroupPolicy,
ListAttachedGroupPolicies, ListGroupsForUser) and wire into DoActions
dispatch. Also add helper functions for user/policy side effects.

* iam: integrate group policies into authorization

Add groups and userGroups reverse index to IdentityAccessManagement.
Populate both maps during ReplaceS3ApiConfiguration and
MergeS3ApiConfiguration. Modify evaluateIAMPolicies to evaluate
policies from user's enabled groups in addition to user policies.
Update VerifyActionPermission to consider group policies when
checking hasAttachedPolicies.

* iam: add group side effects on user deletion and rename

When a user is deleted, remove them from all groups they belong to.
When a user is renamed, update group membership references. Applied
to both embedded and standalone IAM handlers.

* iam: watch /etc/iam/groups directory for config changes

Add groups directory to the filer subscription watcher so group
file changes trigger IAM configuration reloads.

* admin: add group management page to admin UI

Add groups page with CRUD operations, member management, policy
attachment, and enable/disable toggle. Register routes in admin
handlers and add Groups entry to sidebar navigation.

* test: add IAM group management integration tests

Add comprehensive integration tests for group CRUD, membership,
policy attachment, policy enforcement, disabled group behavior,
user deletion side effects, and multi-group membership. Add
"group" test type to CI matrix in s3-iam-tests workflow.

* iam: address PR review comments for group management

- Fix XSS vulnerability in groups.templ: replace innerHTML string
  concatenation with DOM APIs (createElement/textContent) for rendering
  member and policy lists
- Use userGroups reverse index in embedded IAM ListGroupsForUser for
  O(1) lookup instead of iterating all groups
- Add buildUserGroupsIndex helper in standalone IAM handlers; use it
  in ListGroupsForUser and removeUserFromAllGroups for efficient lookup
- Add note about gRPC store load-modify-save race condition limitation

* iam: add defensive copies, validation, and XSS fixes for group management

- Memory store: clone groups on store/retrieve to prevent mutation
- Admin dash: deep copy groups before mutation, validate user/policy exists
- HTTP handlers: translate credential errors to proper HTTP status codes,
  use *bool for Enabled field to distinguish missing vs false
- Groups templ: use data attributes + event delegation instead of inline
  onclick for XSS safety, prevent stale async responses

* iam: add explicit group methods to PropagatingCredentialStore

Add CreateGroup, GetGroup, DeleteGroup, ListGroups, and UpdateGroup
methods instead of relying on embedded interface fallthrough. Group
changes propagate via filer subscription so no RPC propagation needed.

* iam: detect postgres unique constraint violation and add groups index

Return ErrGroupAlreadyExists when INSERT hits SQLState 23505 instead of
a generic error. Add index on groups(disabled) for filtered queries.

* iam: add Marker field to group list response types

Add Marker string field to GetGroupResult, ListGroupsResult,
ListAttachedGroupPoliciesResult, and ListGroupsForUserResult to
match AWS IAM pagination response format.

* iam: check group attachment before policy deletion

Reject DeletePolicy if the policy is attached to any group, matching
AWS IAM behavior. Add PolicyArn to ListAttachedGroupPolicies response.

* iam: include group policies in IAM authorization

Merge policy names from user's enabled groups into the IAMIdentity
used for authorization, so group-attached policies are evaluated
alongside user-attached policies.

* iam: check for name collision before renaming user in UpdateUser

Scan identities and inline policies for newUserName before mutating,
returning EntityAlreadyExists if a collision is found. Reuse the
already-loaded policies instead of loading them again inside the loop.

* test: use t.Cleanup for bucket cleanup in group policy test

* iam: wrap ErrUserNotInGroup sentinel in RemoveGroupMember error

Wrap credential.ErrUserNotInGroup so errors.Is works in
groupErrorToHTTPStatus, returning proper 400 instead of 500.

* admin: regenerate groups_templ.go with XSS-safe data attributes

Regenerated from groups.templ which uses data-group-name attributes
instead of inline onclick with string interpolation.

* iam: add input validation and persist groups during migration

- Validate nil/empty group name in CreateGroup and UpdateGroup
- Save groups in migrateToMultiFile so they survive legacy migration

* admin: use groupErrorToHTTPStatus in GetGroupMembers and GetGroupPolicies

* iam: short-circuit UpdateUser when newUserName equals current name

* iam: require empty PolicyNames before group deletion

Reject DeleteGroup when group has attached policies, matching the
existing members check. Also fix GetGroup error handling in
DeletePolicy to only skip ErrGroupNotFound, not all errors.

* ci: add weed/pb/** to S3 IAM test trigger paths

* test: replace time.Sleep with require.Eventually for propagation waits

Use polling with timeout instead of fixed sleeps to reduce flakiness
in integration tests waiting for IAM policy propagation.

* fix: use credentialManager.GetPolicy for AttachGroupPolicy validation

Policies created via CreatePolicy through credentialManager are stored
in the credential store, not in s3cfg.Policies (which only has static
config policies). Change AttachGroupPolicy to use credentialManager.GetPolicy()
for policy existence validation.

* feat: add UpdateGroup handler to embedded IAM API

Add UpdateGroup action to enable/disable groups and rename groups
via the IAM API. This is a SeaweedFS extension (not in AWS SDK) used
by tests to toggle group disabled status.

* fix: authenticate raw IAM API calls in group tests

The embedded IAM endpoint rejects anonymous requests. Replace
callIAMAPI with callIAMAPIAuthenticated that uses JWT bearer token
authentication via the test framework.

* feat: add UpdateGroup handler to standalone IAM API

Mirror the embedded IAM UpdateGroup handler in the standalone IAM API
for parity.

* fix: add omitempty to Marker XML tags in group responses

Non-truncated responses should not emit an empty <Marker/> element.

* fix: distinguish backend errors from missing policies in AttachGroupPolicy

Return ServiceFailure for credential manager errors instead of masking
them as NoSuchEntity. Also switch ListGroupsForUser to use s3cfg.Groups
instead of in-memory reverse index to avoid stale data. Add duplicate
name check to UpdateGroup rename.

* fix: standalone IAM AttachGroupPolicy uses persisted policy store

Check managed policies from GetPolicies() instead of s3cfg.Policies
so dynamically created policies are found. Also add duplicate name
check to UpdateGroup rename.

* fix: rollback inline policies on UpdateUser PutPolicies failure

If PutPolicies fails after moving inline policies to the new username,
restore both the identity name and the inline policies map to their
original state to avoid a partial-write window.

* fix: correct test cleanup ordering for group tests

Replace scattered defers with single ordered t.Cleanup in each test
to ensure resources are torn down in reverse-creation order:
remove membership, detach policies, delete access keys, delete users,
delete groups, delete policies. Move bucket cleanup to parent test
scope and delete objects before bucket.

* fix: move identity nil check before map lookup and refine hasAttachedPolicies

Move the nil check on identity before accessing identity.Name to
prevent panic. Also refine hasAttachedPolicies to only consider groups
that are enabled and have actual policies attached, so membership in
a no-policy group doesn't incorrectly trigger IAM authorization.

* fix: fail group reload on unreadable or corrupt group files

Return errors instead of logging and continuing when group files
cannot be read or unmarshaled. This prevents silently applying a
partial IAM config with missing group memberships or policies.

* fix: use errors.Is for sql.ErrNoRows comparison in postgres group store

* docs: explain why group methods skip propagateChange

Group changes propagate to S3 servers via filer subscription
(watching /etc/iam/groups/) rather than gRPC RPCs, since there
are no group-specific RPCs in the S3 cache protocol.

* fix: remove unused policyNameFromArn and strings import

* fix: update service account ParentUser on user rename

When renaming a user via UpdateUser, also update ParentUser references
in service accounts to prevent them from becoming orphaned after the
next configuration reload.

* fix: wrap DetachGroupPolicy error with ErrPolicyNotAttached sentinel

Use credential.ErrPolicyNotAttached so groupErrorToHTTPStatus maps
it to 400 instead of falling back to 500.

* fix: use admin S3 client for bucket cleanup in enforcement test

The user S3 client may lack permissions by cleanup time since the
user is removed from the group in an earlier subtest. Use the admin
S3 client to ensure bucket and object cleanup always succeeds.

* fix: add nil guard for group param in propagating store log calls

Prevent potential nil dereference when logging group.Name in
CreateGroup and UpdateGroup of PropagatingCredentialStore.

* fix: validate Disabled field in UpdateGroup handlers

Reject values other than "true" or "false" with InvalidInputException
instead of silently treating them as false.

* fix: seed mergedGroups from existing groups in MergeS3ApiConfiguration

Previously the merge started with empty group maps, dropping any
static-file groups. Now seeds from existing iam.groups before
overlaying dynamic config, and builds the reverse index after
merging to avoid stale entries from overridden groups.

* fix: use errors.Is for filer_pb.ErrNotFound comparison in group loading

Replace direct equality (==) with errors.Is() to correctly match
wrapped errors, consistent with the rest of the codebase.

* fix: add ErrUserNotFound and ErrPolicyNotFound to groupErrorToHTTPStatus

Map these sentinel errors to 404 so AddGroupMember and
AttachGroupPolicy return proper HTTP status codes.

* fix: log cleanup errors in group integration tests

Replace fire-and-forget cleanup calls with error-checked versions
that log failures via t.Logf for debugging visibility.

* fix: prevent duplicate group test runs in CI matrix

The basic lane's -run "TestIAM" regex also matched TestIAMGroup*
tests, causing them to run in both the basic and group lanes.
Replace with explicit test function names.

* fix: add GIN index on groups.members JSONB for membership lookups

Without this index, ListGroupsForUser and membership queries
require full table scans on the groups table.

* fix: handle cross-directory moves in IAM config subscription

When a file is moved out of an IAM directory (e.g., /etc/iam/groups),
the dir variable was overwritten with NewParentPath, causing the
source directory change to be missed. Now also notifies handlers
about the source directory for cross-directory moves.

* fix: validate members/policies before deleting group in admin handler

AdminServer.DeleteGroup now checks for attached members and policies
before delegating to credentialManager, matching the IAM handler guards.

* fix: merge groups by name instead of blind append during filer load

Match the identity loader's merge behavior: find existing group
by name and replace, only append when no match exists. Prevents
duplicates when legacy and multi-file configs overlap.

* fix: check DeleteEntry response error when cleaning obsolete group files

Capture and log resp.Error from filer DeleteEntry calls during
group file cleanup, matching the pattern used in deleteGroupFile.

* fix: verify source user exists before no-op check in UpdateUser

Reorder UpdateUser to find the source identity first and return
NoSuchEntityException if not found, before checking if the rename
is a no-op. Previously a non-existent user renamed to itself
would incorrectly return success.

* fix: update service account parent refs on user rename in embedded IAM

The embedded IAM UpdateUser handler updated group membership but
not service account ParentUser fields, unlike the standalone handler.

* fix: replay source-side events for all handlers on cross-dir moves

Pass nil newEntry to bucket, IAM, and circuit-breaker handlers for
the source directory during cross-directory moves, so all watchers
can clear caches for the moved-away resource.

* fix: don't seed mergedGroups from existing iam.groups in merge

Groups are always dynamic (from filer), never static (from s3.config).
Seeding from iam.groups caused stale deleted groups to persist.
Now only uses config.Groups from the dynamic filer config.

* fix: add deferred user cleanup in TestIAMGroupUserDeletionSideEffect

Register t.Cleanup for the created user so it gets cleaned up
even if the test fails before the inline DeleteUser call.

* fix: assert UpdateGroup HTTP status in disabled group tests

Add require.Equal checks for 200 status after UpdateGroup calls
so the test fails immediately on API errors rather than relying
on the subsequent Eventually timeout.

* fix: trim whitespace from group name in filer store operations

Trim leading/trailing whitespace from group.Name before validation
in CreateGroup and UpdateGroup to prevent whitespace-only filenames.
Also merge groups by name during multi-file load to prevent duplicates.

* fix: add nil/empty group validation in gRPC store

Guard CreateGroup and UpdateGroup against nil group or empty name
to prevent panics and invalid persistence.

* fix: add nil/empty group validation in postgres store

Guard CreateGroup and UpdateGroup against nil group or empty name
to prevent panics from nil member access and empty-name row inserts.

* fix: add name collision check in embedded IAM UpdateUser

The embedded IAM handler renamed users without checking if the
target name already existed, unlike the standalone handler.

* fix: add ErrGroupNotEmpty sentinel and map to HTTP 409

AdminServer.DeleteGroup now wraps conflict errors with
ErrGroupNotEmpty, and groupErrorToHTTPStatus maps it to
409 Conflict instead of 500.

* fix: use appropriate error message in GetGroupDetails based on status

Return "Group not found" only for 404, use "Failed to retrieve group"
for other error statuses instead of always saying "Group not found".

* fix: use backend-normalized group.Name in CreateGroup response

After credentialManager.CreateGroup may normalize the name (e.g.,
trim whitespace), use group.Name instead of the raw input for
the returned GroupData to ensure consistency.

* fix: add nil/empty group validation in memory store

Guard CreateGroup and UpdateGroup against nil group or empty name
to prevent panics from nil pointer dereference on map access.

* fix: reorder embedded IAM UpdateUser to verify source first

Find the source identity before checking for collisions, matching
the standalone handler's logic. Previously a non-existent user
renamed to an existing name would get EntityAlreadyExists instead
of NoSuchEntity.

* fix: handle same-directory renames in metadata subscription

Replay a delete event for the old entry name during same-directory
renames so handlers like onBucketMetadataChange can clean up stale
state for the old name.

* fix: abort GetGroups on non-ErrGroupNotFound errors

Only skip groups that return ErrGroupNotFound. Other errors (e.g.,
transient backend failures) now abort the handler and return the
error to the caller instead of silently producing partial results.

* fix: add aria-label and title to icon-only group action buttons

Add accessible labels to View and Delete buttons so screen readers
and tooltips provide meaningful context.

* fix: validate group name in saveGroup to prevent invalid filenames

Trim whitespace and reject empty names before writing group JSON
files, preventing creation of files like ".json".

* fix: add /etc/iam/groups to filer subscription watched directories

The groups directory was missing from the watched directories list,
so S3 servers in a cluster would not detect group changes made by
other servers via filer. The onIamConfigChange handler already had
code to handle group directory changes but it was never triggered.

* add direct gRPC propagation for group changes to S3 servers

Groups now have the same dual propagation as identities and policies:
direct gRPC push via propagateChange + async filer subscription.

- Add PutGroup/RemoveGroup proto messages and RPCs
- Add PutGroup/RemoveGroup in-memory cache methods on IAM
- Add PutGroup/RemoveGroup gRPC server handlers
- Update PropagatingCredentialStore to call propagateChange on group mutations

* reduce log verbosity for config load summary

Change ReplaceS3ApiConfiguration log from Infof to V(1).Infof
to avoid noisy output on every config reload.

* admin: show user groups in view and edit user modals

- Add Groups field to UserDetails and populate from credential manager
- Show groups as badges in user details view modal
- Add group management to edit user modal: display current groups,
  add to group via dropdown, remove from group via badge x button

* fix: remove duplicate showAlert that broke modal-alerts.js

admin.js defined showAlert(type, message) which overwrote the
modal-alerts.js version showAlert(message, type), causing broken
unstyled alert boxes. Remove the duplicate and swap all callers
in admin.js to use the correct (message, type) argument order.

* fix: unwrap groups API response in edit user modal

The /api/groups endpoint returns {"groups": [...]}, not a bare array.

* Update object_store_users_templ.go

* test: assert AccessDenied error code in group denial tests

Replace plain assert.Error checks with awserr.Error type assertion
and AccessDenied code verification, matching the pattern used in
other IAM integration tests.

* fix: propagate GetGroups errors in ShowGroups handler

getGroupsPageData was swallowing errors and returning an empty page
with 200 status. Now returns the error so ShowGroups can respond
with a proper error status.

* fix: reject AttachGroupPolicy when credential manager is nil

Previously skipped policy existence validation when credentialManager
was nil, allowing attachment of nonexistent policies. Now returns
a ServiceFailureException error.

* fix: preserve groups during partial MergeS3ApiConfiguration updates

UpsertIdentity calls MergeS3ApiConfiguration with a partial config
containing only the updated identity (nil Groups). This was wiping
all in-memory group state. Now only replaces groups when
config.Groups is non-nil (full config reload).

* fix: propagate errors from group lookup in GetObjectStoreUserDetails

ListGroups and GetGroup errors were silently ignored, potentially
showing incomplete group data in the UI.

* fix: use DOM APIs for group badge remove button to prevent XSS

Replace innerHTML with onclick string interpolation with DOM
createElement + addEventListener pattern. Also add aria-label
and title to the add-to-group button.

* fix: snapshot group policies under RLock to prevent concurrent map access

evaluateIAMPolicies was copying the map reference via groupMap :=
iam.groups under RLock then iterating after RUnlock, while PutGroup
mutates the map in-place. Now copies the needed policy names into
a slice while holding the lock.

* fix: add nil IAM check to PutGroup and RemoveGroup gRPC handlers

Match the nil guard pattern used by PutPolicy/DeletePolicy to
prevent nil pointer dereference when IAM is not initialized.
2026-03-09 11:54:32 -07:00
Chris Lu
78a3441b30 fix: volume balance detection returns multiple tasks per run (#8559)
* fix: volume balance detection now returns multiple tasks per run (#8551)

Previously, detectForDiskType() returned at most 1 balance task per disk
type, making the MaxJobsPerDetection setting ineffective. The detection
loop now iterates within each disk type, planning multiple moves until
the imbalance drops below threshold or maxResults is reached. Effective
volume counts are adjusted after each planned move so the algorithm
correctly re-evaluates which server is overloaded.

* fix: factor pending tasks into destination scoring and use UnixNano for task IDs

- Use UnixNano instead of Unix for task IDs to avoid collisions when
  multiple tasks are created within the same second
- Adjust calculateBalanceScore to include LoadCount (pending + assigned
  tasks) in the utilization estimate, so the destination picker avoids
  stacking multiple planned moves onto the same target disk

* test: add comprehensive balance detection tests for complex scenarios

Cover multi-server convergence, max-server shifting, destination
spreading, pre-existing pending task skipping, no-duplicate-volume
invariant, and parameterized convergence verification across different
cluster shapes and thresholds.

* fix: address PR review findings in balance detection

- hasMore flag: compute from len(results) >= maxResults so the scheduler
  knows more pages may exist, matching vacuum/EC handler pattern
- Exhausted server fallthrough: when no eligible volumes remain on the
  current maxServer (all have pending tasks) or destination planning
  fails, mark the server as exhausted and continue to the next
  overloaded server instead of stopping the entire detection loop
- Return canonical destination server ID directly from createBalanceTask
  instead of resolving via findServerIDByAddress, eliminating the
  fragile address→ID lookup for adjustment tracking
- Fix bestScore sentinel: use math.Inf(-1) instead of -1.0 so disks
  with negative scores (high pending load, same rack/DC) are still
  selected as the best available destination
- Add TestDetection_ExhaustedServerFallsThrough covering the scenario
  where the top server's volumes are all blocked by pre-existing tasks

* test: fix computeEffectiveCounts and add len guard in no-duplicate test

- computeEffectiveCounts now takes a servers slice to seed counts for all
  known servers (including empty ones) and uses an address→ID map from
  the topology spec instead of scanning metrics, so destination servers
  with zero initial volumes are tracked correctly
- TestDetection_NoDuplicateVolumesAcrossIterations now asserts len > 1
  before checking duplicates, so the test actually fails if Detection
  regresses to returning a single task

* fix: remove redundant HasAnyTask check in createBalanceTask

The HasAnyTask check in createBalanceTask duplicated the same check
already performed in detectForDiskType's volume selection loop.
Since detection runs single-threaded (MaxDetectionConcurrency: 1),
no race can occur between the two points.

* fix: consistent hasMore pattern and remove double-counted LoadCount in scoring

- Adopt vacuum_handler's hasMore pattern: over-fetch by 1, check
  len > maxResults, and truncate — consistent truncation semantics
- Remove direct LoadCount penalty in calculateBalanceScore since
  LoadCount is already factored into effectiveVolumeCount for
  utilization scoring; bump utilization weight from 40 to 50 to
  compensate for the removed 10-point load penalty

* fix: handle zero maxResults as no-cap, emit trace after trim, seed empty servers

- When MaxResults is 0 (omitted), treat as no explicit cap instead of
  defaulting to 1; only apply the +1 over-fetch probe when caller
  supplies a positive limit
- Move decision trace emission after hasMore/trim so the trace
  accurately reflects the returned proposals
- Seed serverVolumeCounts from ActiveTopology so servers that have a
  matching disk type but zero volumes are included in the imbalance
  calculation and MinServerCount check

* fix: nil-guard clusterInfo, uncap legacy DetectionFunc, deterministic disk type order

- Add early nil guard for clusterInfo in Detection to prevent panics
  in downstream helpers (detectForDiskType, createBalanceTask)
- Change register.go DetectionFunc wrapper from maxResults=1 to 0
  (no cap) so the legacy code path returns all detected tasks
- Sort disk type keys before iteration so results are deterministic
  when maxResults spans multiple disk types (HDD/SSD)

* fix: don't over-fetch in stateful detection to avoid orphaned pending tasks

Detection registers planned moves in ActiveTopology via AddPendingTask,
so requesting maxResults+1 would create an extra pending task that gets
discarded during trim. Use len(results) >= maxResults as the hasMore
signal instead, which is correct since Detection already caps internally.

* fix: return explicit truncated flag from Detection instead of approximating

Detection now returns (results, truncated, error) where truncated is true
only when the loop stopped because it hit maxResults, not when it ran out
of work naturally. This eliminates false hasMore signals when detection
happens to produce exactly maxResults results by resolving the imbalance.

* cleanup: simplify detection logic and remove redundancies

- Remove redundant clusterInfo nil check in detectForDiskType since
  Detection already guards against nil clusterInfo
- Remove adjustments loop for destination servers not in
  serverVolumeCounts — topology seeding ensures all servers with
  matching disk type are already present
- Merge two-loop min/max calculation into a single loop: min across
  all servers, max only among non-exhausted servers
- Replace magic number 100 with len(metrics) for minC initialization
  in convergence test

* fix: accurate truncation flag, deterministic server order, indexed volume lookup

- Track balanced flag to distinguish "hit maxResults cap" from "cluster
  balanced at exactly maxResults" — truncated is only true when there's
  genuinely more work to do
- Sort servers for deterministic iteration and tie-breaking when
  multiple servers have equal volume counts
- Pre-index volumes by server with per-server cursors to avoid
  O(maxResults * volumes) rescanning on each iteration
- Add truncation flag assertions to RespectsMaxResults test: true when
  capped, false when detection finishes naturally

* fix: seed trace server counts from ActiveTopology to match detection logic

The decision trace was building serverVolumeCounts only from metrics,
missing zero-volume servers seeded from ActiveTopology by Detection.
This could cause the trace to report wrong server counts, incorrect
imbalance ratios, or spurious "too few servers" messages. Pass
activeTopology into the trace function and seed server counts the
same way Detection does.

* fix: don't exhaust server on per-volume planning failure, sort volumes by ID

- When createBalanceTask returns nil, continue to the next volume on
  the same server instead of marking the entire server as exhausted.
  The failure may be volume-specific (not found in topology, pending
  task registration failed) and other volumes on the server may still
  be viable candidates.
- Sort each server's volume slice by VolumeID after pre-indexing so
  volume selection is fully deterministic regardless of input order.

* fix: use require instead of assert to prevent nil dereference panic in CORS test

The test used assert.NoError (non-fatal) for GetBucketCors, then
immediately accessed getResp.CORSRules. When the API returns an error,
getResp is nil causing a panic. Switch to require.NoError/NotNil/Len
so the test stops before dereferencing a nil response.

* fix: deterministic disk tie-breaking and stronger pre-existing task test

- Sort available disks by NodeID then DiskID before scoring so
  destination selection is deterministic when two disks score equally
- Add task count bounds assertion to SkipsPreExistingPendingTasks test:
  with 15 of 20 volumes already having pending tasks, at most 5 new
  tasks should be created and at least 1 (imbalance still exists)

* fix: seed adjustments from existing pending/assigned tasks to prevent over-scheduling

Detection now calls ActiveTopology.GetTaskServerAdjustments() to
initialize the adjustments map with source/destination deltas from
existing pending and assigned balance tasks. This ensures
effectiveCounts reflects in-flight moves, preventing the algorithm
from planning additional moves in the same direction when prior
moves already address the imbalance.

Added GetTaskServerAdjustments(taskType) to ActiveTopology which
iterates pending and assigned tasks, decrementing source servers
and incrementing destination servers for the given task type.
2026-03-08 21:34:03 -07:00
Chris Lu
2ec0a67ee3 master: return 503/Unavailable during topology warmup after leader change (#8529)
* master: return 503/Unavailable during topology warmup after leader change

After a master restart or leader change, the topology is empty until
volume servers reconnect and send heartbeats. During this warmup window
(3 heartbeat intervals = 15 seconds), volume lookups that fail now
return 503 Service Unavailable (HTTP) or gRPC Unavailable instead of
404 Not Found, signaling clients to retry with other masters.

* master: skip warmup 503 on fresh start and single-master setups

- Check MaxVolumeId > 0 to distinguish restart from fresh start
  (MaxVolumeId is Raft-persisted, so 0 means no prior data)
- Check peer count > 1 so single-master deployments aren't affected
  (no point suggesting "retry with other masters" if there are none)

* master: address review feedback and block assigns during warmup

- Protect LastLeaderChangeTime with dedicated mutex (fix data race)
- Extract warmup multiplier as WarmupPulseMultiplier constant
- Derive Retry-After header from pulse config instead of hardcoding
- Only trigger warmup 503 for "not found" errors, not parse errors
- Return nil response (not partial) on gRPC Unavailable
- Add doc comments to IsWarmingUp, getter/setter, WarmupDuration
- Block volume assign requests (HTTP and gRPC) during warmup,
  since the topology is incomplete and assignments would be unreliable
- Skip warmup behavior for single-master setups (no peers to retry)

* master: apply warmup to all setups, skip only on fresh start

Single-master restarts still have an empty topology until heartbeats
arrive, so warmup protection should apply there too. The only case
to skip is a fresh cluster start (MaxVolumeId == 0), which already
has no volumes to look up.

- Remove GetMasterCount() > 1 guard from all warmup checks
- Remove now-unused GetMasterCount helper
- Update error messages to "topology is still loading" (not
  "retry with other masters" which doesn't apply to single-master)

* master: add client-side retry on Unavailable for lookup and assign

The server-side 503/Unavailable during warmup needs client cooperation.
Previously, LookupVolumeIds and Assign would immediately propagate the
error without retry.

Now both paths retry with exponential backoff (1s -> 1.5s -> ... up to
6s) when receiving Unavailable, respecting context cancellation. This
covers the warmup window where the master's topology is still loading
after a restart or leader change.

* master: seed warmup timestamp in legacy raft path at setup

The legacy raft path only set lastLeaderChangeTime inside the event
listener callback, which could fire after IsLeader() was already
observed as true in SetRaftServer. Seed the timestamp at setup time
(matching the hashicorp path) so IsWarmingUp() is active immediately.

* master: fix assign retry loop to cover full warmup window

The retry loop used waitTime <= maxWaitTime as a stop condition,
causing it to give up after ~13s while warmup lasts 15s. Now cap
each individual sleep at maxWaitTime but keep retrying until the
context is cancelled.

* master: preserve gRPC status in lookup retry and fix retry window

Return the raw gRPC error instead of wrapping with fmt.Errorf so
status.FromError() can extract the status code. Use proper gRPC
status check (codes.Unavailable) instead of string matching. Also
cap individual sleep at maxWaitTime while retrying until ctx is done.

* master: use gRPC status code instead of string matching in assign retry

Use status.FromError/codes.Unavailable instead of brittle
strings.Contains for detecting retriable gRPC errors in the
assign retry loop.

* master: use remaining warmup duration for Retry-After header

Set Retry-After to the remaining warmup time instead of the full
warmup duration, so clients don't wait longer than necessary.

* master: reset ret.Replicas before populating from assign response

Clear Replicas slice before appending to prevent duplicate entries
when the assign response is retried or when alternative requests
are attempted.

* master: add unit tests for warmup retry behavior

Test that Assign() and LookupVolumeIds() retry on codes.Unavailable
and stop promptly when the context is cancelled.

* master: record leader change time before initialization work

Move SetLastLeaderChangeTime() to fire immediately when the leader
change event is received, before DoBarrier(), EnsureTopologyId(),
and updatePeers(), so the warmup clock starts at the true moment
of leadership transition.

* master: use topology warmup duration in volume growth wait loop

Replace hardcoded constants.VolumePulsePeriod * 2 with
topo.IsWarmingUp() and topo.WarmupDuration() so the growth wait
stays in sync with the configured warmup window. Remove unused
constants import.

* master: resolve master before creating RPC timeout context

Move GetMaster() call before context.WithTimeout() so master
resolution blocking doesn't consume the gRPC call timeout.

* master: use NotFound flag instead of string matching for volume lookup

Add a NotFound field to LookupResult and set it in findVolumeLocation
when a volume is genuinely missing. Update HTTP and gRPC warmup
checks to use this flag instead of strings.Contains on the error
message.

* master: bound assign retry loop to 30s for deadline-free contexts

Without a context deadline, the Unavailable retry loop could spin
forever. Add a maxRetryDuration of 30s so the loop gives up even
when no context deadline is set.

* master: strengthen assign retry cancellation test

Verify the retry loop actually retried (callCount > 1) and that
the returned error is context.DeadlineExceeded, not just any error.

* master: extract shared retry-with-backoff utility

Add util.RetryWithBackoff for context-aware, bounded retry with
exponential backoff. Refactor both Assign() and LookupVolumeIds()
to use it instead of duplicating the retry/sleep/backoff logic.

* master: cap waitTime in RetryWithBackoff to prevent unbounded growth

Cap the backoff waitTime at maxWaitTime so it doesn't grow
indefinitely in long-running retry scenarios.

* master: only return Unavailable during warmup when all lookups failed

For batched LookupVolume requests, return partial results when some
volumes are found. Only return codes.Unavailable when no volumes
were successfully resolved, so clients benefit from partial results
instead of retrying unnecessarily.

* master: set retriable error message in 503 response body

When returning 503 during warmup, replace the "not found" error
in the JSON body with "service warming up, please retry" so
clients don't treat it as a permanent error.

* master: guard empty master address in LookupVolumeIds

If GetMaster() returns empty (no master found or ctx cancelled),
return an appropriate error instead of dialing an empty address.
Returns ctx.Err() if context is done, otherwise codes.Unavailable
to trigger retry.

* master: add comprehensive tests for RetryWithBackoff

Test success after retries, non-retryable error handling, context
cancellation, and maxDuration cap with context.Background().

* master: enforce hard maxDuration bound in RetryWithBackoff

Use a deadline instead of elapsed-time check so the last sleep is
capped to remaining time. This prevents the total retry duration
from overshooting maxDuration by up to one full backoff interval.

* master: respect fresh-start bypass in RemainingWarmupDuration

Check IsWarmingUp() first (which returns false when MaxVolumeId==0)
so RemainingWarmupDuration returns 0 on fresh clusters.

* master: round up Retry-After seconds to avoid underestimating

Use math.Ceil so fractional remaining seconds (e.g. 1.9s) round
up to the next integer (2) instead of flooring down (1).

* master: tighten batch lookup warmup to all-NotFound only

Only return codes.Unavailable when every requested volume ID was
a transient not-found. Mixed cases with non-NotFound errors now
return the response with per-volume error details preserved.

* master: reduce retry log noise and fix timer leak

Lower per-attempt retry log from V(0) to V(1) to reduce noise
during warmup. Replace time.After with time.NewTimer to avoid
lingering timers when context is cancelled.

* master: add per-attempt timeout for assign RPC

Use a 10s per-attempt timeout so a single slow RPC can't consume
the entire 30s retry budget when ctx has no deadline.

* master: share single 30s retry deadline across assign request entries

The Assign() function iterates over primary and fallback requests,
previously giving each its own 30s RetryWithBackoff budget. With a
primary + fallback, the total could reach 60s. Compute one deadline
up front and pass the remaining budget to each RetryWithBackoff call
so the entire Assign() call stays within a single 30s cap.

* master: strengthen context-cancel test with DeadlineExceeded and retry assertions

Assert errors.Is(err, context.DeadlineExceeded) to verify the error
is specifically from the context deadline, and check callCount > 1
to prove retries actually occurred before cancellation. Mirrors the
pattern used in TestAssignStopsOnContextCancel.

* master: bound GetMaster with per-attempt timeout in LookupVolumeIds

GetMaster() calls WaitUntilConnected() which can block indefinitely
if no master is available. Previously it used the outer ctx, so a
slow master resolution could consume the entire RetryWithBackoff
budget in a single attempt. Move the per-attempt timeoutCtx creation
before the GetMaster call so both master resolution and the gRPC
LookupVolume RPC share one grpcTimeout-bounded attempt.

* master: use deadline-aware context for assign retry budget

The shared 30s deadline only limited RetryWithBackoff's internal
wall-clock tracking, but per-attempt contexts were still derived
from the original ctx and could run for up to 10s even when the
budget was nearly exhausted. Create a deadlineCtx from the computed
deadline and derive both RetryWithBackoff and per-attempt timeouts
from it so all operations honor the shared 30s cap.

* master: skip warmup gate for empty lookup requests

When VolumeOrFileIds is empty, notFoundCount == len(req.VolumeOrFileIds)
is 0 == 0 which is true, causing empty lookup batches during warmup to
return codes.Unavailable and be retried endlessly. Add a
len(req.VolumeOrFileIds) > 0 guard so empty requests pass through.

* master: validate request fields before warmup gate in Assign

Move Replication and Ttl parsing before the IsWarmingUp() check so
invalid inputs get a proper validation error instead of being masked
by codes.Unavailable during warmup. Pure syntactic validation does
not depend on topology state and should run first.

* master: check deadline and context before starting retry attempt

RetryWithBackoff only checked the deadline and context after an
attempt completed or during the sleep select. If the deadline
expired or context was canceled during sleep, the next iteration
would still call operation() before detecting it. Add pre-operation
checks so no new attempt starts after the budget is exhausted.

* master: always return ctx.Err() on context cancellation in RetryWithBackoff

When ctx.Err() is non-nil, the pre-operation check was returning
lastErr instead of ctx.Err(). This broke callers checking
errors.Is(err, context.DeadlineExceeded) and contradicted the
documented contract. Always return ctx.Err() so the cancellation
reason is properly surfaced.

* master: handle warmup errors in StreamAssign without killing the stream

StreamAssign was returning codes.Unavailable errors from Assign
directly, which terminates the gRPC stream and breaks pooled
connections. Instead, return transient errors as in-band error
responses so the stream survives warmup periods.

Also reset assignClient in doAssign on Send/Recv failures so a
broken stream doesn't leave the proxy permanently dead.

* master: wait for warmup before slot search in findAndGrow

findEmptySlotsForOneVolume was called before the warmup wait loop,
selecting slots from an incomplete topology. Move the warmup wait
before slot search so volume placement uses the fully warmed-up
topology with all servers registered.

* master: add Retry-After header to /dir/assign warmup response

The /dir/lookup handler already sets Retry-After during warmup but
/dir/assign did not, leaving HTTP clients without guidance on when
to retry. Add the same header using RemainingWarmupDuration().

* master: only seed warmup timestamp on leader at startup

SetLastLeaderChangeTime was called unconditionally for both leader
and follower nodes. Followers don't need warmup state, and the
leader change event listener handles real elections. Move the seed
into the IsLeader() block so only the startup leader gets warmup
initialized.

* master: preserve codes.Unavailable for StreamAssign warmup errors in doAssign

StreamAssign returns transient warmup errors as in-band
AssignResponse.Error messages. doAssign was converting these to plain
fmt.Errorf, losing the codes.Unavailable classification needed for
the caller's retry logic. Detect warmup error messages and wrap them
as status.Error(codes.Unavailable) so RetryWithBackoff can retry.
2026-03-08 16:05:45 -07:00
Chris Lu
0647f66bb5 filer.sync: add exponential backoff on unexpected EOF during replication (#8557)
* filer.sync: add exponential backoff on unexpected EOF during replication

When the source volume server drops connections under high traffic,
filer.sync retries aggressively (every 1-6s), hammering the already
overloaded source. This adds a longer exponential backoff (10s to 2min)
specifically for "unexpected EOF" errors, reducing pressure on the
source while still retrying indefinitely until success.

Also adds more logging throughout the replication path:
- Log source URL and error at V(0) when ReadPart or io.ReadAll fails
- Log content-length and byte counts at V(4) on success
- Log backoff duration in retry messages

Fixes #8542

* filer.sync: extract backoff helper and fix 2-minute cap

- Extract nextEofBackoff() and isEofError() helpers to deduplicate
  the backoff logic between fetchAndWrite and uploadManifestChunk
- Fix the cap: previously 80s would double to 160s and pass the
  < 2min check uncapped. Now doubles first, then clamps to 2min.

* filer.sync: log source URL instead of empty upload URL on read errors

UploadUrl is not populated until after the reader is consumed, so the
V(0) and V(4) logs were printing an empty string. Add SourceUrl field
to UploadOption and populate it from the HTTP response in fetchAndWrite.

* filer.sync: guard isEofError against nil error

* filer.sync: use errors.Is for EOF detection, fix log wording

- Replace broad substring matching ("read input", "unexpected EOF")
  with errors.Is(err, io.ErrUnexpectedEOF) and errors.Is(err, io.EOF)
  so only actual EOF errors trigger the longer backoff
- Fix awkward log phrasing: "interrupted replicate" → "interrupted
  while replicating"

* filer.sync: remove EOF backoff from uploadManifestChunk

uploadManifestChunk reads from an in-memory bytes.Reader, so any EOF
errors there are from the destination side, not a broken source stream.
The long source-oriented backoff is inappropriate; let RetryUntil
handle destination retries at its normal cadence.

---------

Co-authored-by: Copilot <copilot@github.com>
2026-03-08 14:33:37 -07:00
Chris Lu
ba66411337 Update plugin_templ.go 2026-03-08 14:29:06 -07:00
Chris Lu
7808b301ef admin: remove Scheduler Settings cards from plugin UI (#8558)
* admin: remove Scheduler Settings cards, make Next Run full-width

Remove the two "Scheduler Settings" placeholder cards from the plugin
UI (overview page and scheduler tab). They only contained a text note
saying detection intervals are configured per job type, which is
self-evident from the per-job-type settings form.

Make the "Next Run" card full-width on the overview page since it no
longer shares a row with the removed card.

* plugin UI: promote Next Run to top summary card row

Move "Next Run" from a standalone card into the top row alongside
Workers, Active Jobs, and Activities as a compact stat card.
2026-03-08 14:27:57 -07:00
Chris Lu
fa7da0f57e template 2026-03-08 14:05:42 -07:00
Chris Lu
961c270aba admin: expose per-job-type detection interval in plugin UI (#8552)
* admin: expose per-job-type detection interval in plugin UI

The detection_interval_seconds field was not editable in the admin UI.
collectAdminSettings() silently preserved the existing value, making it
impossible for users to change how often a job type checks for new work.
Users would change the global "Sleep Between Iterations" setting expecting
it to control job scheduling frequency, but that only controls the
scheduler loop's idle polling rate.

Add a "Detection Interval (s)" input to the per-job-type admin settings
form so users can actually configure it.

Fixes #8549

* admin: remove global Sleep Between Iterations setting

Now that per-job-type detection intervals are exposed in the UI, the
global IdleSleepSeconds setting is redundant and confusing. It only
controlled the scheduler loop's idle polling rate, which is always
overridden by earliestNextDetectionAt() when job types exist.

Replace the three usages with simpler alternatives:
- Scheduler loop sleep: use defaultSchedulerIdleSleep constant
- Initial delay for new job types: use policy.DetectionInterval/2
  (more logical since it's already per-job-type)
- Status fallback: use the constant

The API endpoints are kept for backward compatibility but the UI
no longer exposes or calls them.

* admin: restore configurable idle sleep in scheduler loop

The EC integration test sets idle_sleep_seconds=1 via the scheduler
config API so the scheduler wakes quickly after workers connect. The
previous commit replaced this with a hardcoded 613s constant, causing
the scheduler to sleep through the entire test window.

Restore GetSchedulerConfig().IdleSleepDuration() in the scheduler loop
and status reporting. The UI removal of the setting is still correct —
the API endpoint remains for programmatic use (e.g., tests).

* admin: cap first-run initial delay to 5s instead of DetectionInterval/2

The initial delay for first-run job types was set to
policy.DetectionInterval/2, which creates unbounded first-run latency
(e.g., 1 hour for vacuum with a 2-hour detection interval). A small
fixed 5-second delay provides sufficient stagger without penalizing
startup time.
2026-03-08 14:03:51 -07:00
Chris Lu
e25558e4d8 admin: fix mobile sidebar menu inaccessible in portrait mode (#8556)
* admin: fix mobile sidebar menu inaccessible in portrait mode

The hamburger button only toggled the user dropdown, leaving the
sidebar navigation inaccessible on mobile devices in portrait mode.

Add a dedicated sidebar toggle button (visible only on mobile), give
the sidebar an id so Bootstrap collapse can target it, add a backdrop
overlay for the open state, and auto-close the sidebar when a nav
link is clicked.

Fixes #8550

* admin: address review feedback on mobile sidebar

- Remove redundant JS show/hide.bs.collapse listeners; CSS sibling
  selector already handles backdrop visibility
- Use const instead of var for non-reassigned variables
- Move inline style on user icon to CSS class

* admin: add aria attributes to user-menu toggler, use CSS variable for navbar height

- Add aria-controls, aria-expanded, and aria-label to the user-menu
  toggle button for assistive technology
- Extract hard-coded 56px navbar height into --navbar-height CSS
  custom property used by sidebar and backdrop positioning

* admin: extract hideSidebar helper, use toggler visibility for breakpoint check

- Extract duplicated collapse-hide logic into a hideSidebar helper
- Replace hardcoded window.innerWidth < 768 with a check on the
  sidebar toggler's computed display, decoupling JS from CSS breakpoints
- Add aria-expanded="false" to sidebar toggle button

---------

Co-authored-by: Copilot <copilot@github.com>
2026-03-08 12:32:14 -07:00
Chris Lu
587c24ec89 plugin worker: support job type categories (all, default, heavy) (#8547)
* plugin worker: add handler registry with job categories

Introduce a self-registration pattern for plugin worker job handlers.
Each handler can register itself via init() with a HandlerFactory that
declares its job type, category (default/heavy), CLI aliases, and a
builder function.

ResolveHandlerFactories accepts a mix of category names ("all",
"default", "heavy") and explicit job type names/aliases, returning the
matching factories. This enables workers to be configured by resource
profile rather than requiring explicit job type enumeration.

* plugin worker: register all handlers via init()

Each job handler now self-registers into the global handler registry
with its canonical job type, category, CLI aliases, and build function:

  - vacuum:              category=default
  - volume_balance:      category=default
  - admin_script:        category=default
  - erasure_coding:      category=heavy
  - iceberg_maintenance: category=heavy

Adding a new job type now only requires adding the init() call in the
handler file itself — no other files need to be touched.

* plugin worker: replace hardcoded job type switch with registry

Remove buildPluginWorkerHandler, parsePluginWorkerJobTypes, and
canonicalPluginWorkerJobType from worker_runtime.go. The simplified
buildPluginWorkerHandlers now delegates to
pluginworker.ResolveHandlerFactories, which resolves category names
("all", "default", "heavy") and explicit job type names/aliases.

The default job type is changed from an explicit list to "all", so new
handlers registered via init() are automatically picked up.

Update all tests to use the new API.

* plugin worker: update CLI help text for job categories

Update the -jobType flag description and command examples to document
category support (all, default, heavy) alongside explicit job type names.

* plugin worker: address review feedback

- Add CategoryAll constant; use typed constants in tokenAsCategory
- Pre-allocate result slice in ResolveHandlerFactories
- Add vacuum aliases (vol.vacuum, volume.vacuum)
- List alias examples (ec, balance, iceberg) in -jobType flag help
- Create handlers aggregator package for subpackage blank imports so
  new handler subpackages only need to be added in one place
- Make category tests relationship-based (subset/union checks) instead
  of asserting exact handler counts
- Add clarifying comments to worker_test.go and mini_plugin_test.go
  listing expected handler names next to count assertions

---------

Co-authored-by: Copilot <copilot@github.com>
2026-03-07 18:30:58 -08:00
Chris Lu
72c2c7ef8b Add iceberg_maintenance plugin worker handler (Phase 1) (#8501)
* Add iceberg_maintenance plugin worker handler (Phase 1)

Implement automated Iceberg table maintenance as a new plugin worker job
type. The handler scans S3 table buckets for tables needing maintenance
and executes operations in the correct Iceberg order: expire snapshots,
remove orphan files, and rewrite manifests.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Fix unsafe int64→int narrowing for MaxSnapshotsToKeep

Use int64(wouldKeep) instead of int(config.MaxSnapshotsToKeep) to
avoid potential truncation on 32-bit platforms (CodeQL high severity).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Fix unsafe int64→int narrowing for MinInputFiles

Use int64(len(manifests)) instead of int(config.MinInputFiles) to
avoid potential truncation on 32-bit platforms (CodeQL high severity).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Fix unsafe int64→int narrowing for MaxCommitRetries

Clamp MaxCommitRetries to [1,20] range and keep as int64 throughout
the retry loop to avoid truncation on 32-bit platforms (CodeQL high
severity).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Sort snapshots explicitly by timestamp in expireSnapshots

The previous logic relied on implicit ordering of the snapshot list.
Now explicitly sorts snapshots by timestamp descending (most recent
first) and uses a simpler keep-count loop: keep the first
MaxSnapshotsToKeep newest snapshots plus the current snapshot
unconditionally, then expire the rest that exceed the retention window.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Handle errors properly in listFilerEntries

Previously all errors from ListEntries and Recv were silently swallowed.
Now: treat "not found" errors as empty directory, propagate other
ListEntries errors, and check for io.EOF explicitly on Recv instead of
breaking on any error.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Fix overly broad HasSuffix check in orphan detection

The bare strings.HasSuffix(ref, entry.Name) could match files with
similar suffixes (e.g. "123.avro" matching "snap-123.avro"). Replaced
with exact relPath match and a "/"-prefixed suffix check to avoid
false positives.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Replace fmt.Sscanf with strconv.Atoi in extractMetadataVersion

strconv.Atoi is more explicit and less fragile than fmt.Sscanf for
parsing a simple integer from a trimmed string.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Recursively traverse directories for orphan file detection

The orphan cleanup only listed a single directory level under data/
and metadata/, skipping IsDirectory entries. Partitioned Iceberg
tables store data files in nested partition directories (e.g.
data/region=us-east/file.parquet) which were never evaluated.

Add walkFilerEntries helper that recursively descends into
subdirectories, and use it in removeOrphans so all nested files
are considered for orphan checks.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Fix manifest path drift from double time.Now() calls

rewriteManifests called time.Now().UnixMilli() twice: once for the
path embedded in WriteManifest and once for the filename passed to
saveFilerFile. These timestamps would differ, causing the manifest's
internal path reference to not match the actual saved filename.

Compute the filename once and reuse it for both WriteManifest and
saveFilerFile so they always reference the same path.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Add TestManifestRewritePathConsistency test

Verifies that WriteManifest returns a ManifestFile whose FilePath()
matches the path passed in, and that path.Base() of that path matches
the filename used for saveFilerFile. This validates the single-
timestamp pattern used in rewriteManifests produces consistent paths.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Make parseOperations return error on unknown operations

Previously parseOperations silently dropped unknown operation names
and could return an empty list. Now validates inputs against the
canonical set and returns a clear error if any unknown operation is
specified. Updated Execute to surface the error instead of proceeding
with an empty operation list.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Use gRPC status codes instead of string matching in listFilerEntries

Replace brittle strings.Contains(err.Error(), "not found") check with
status.Code(err) == codes.NotFound for proper gRPC error handling.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Add stale-plan guard in commit closures for expireSnapshots and rewriteManifests

Both operations plan outside the commit mutation using a snapshot ID
captured from the initial metadata read. If the table head advances
concurrently, the mutation would create a snapshot parented to the
wrong head or remove snapshots based on a stale view.

Add a guard inside each mutation closure that verifies
currentMeta.CurrentSnapshot().SnapshotID still matches the planned
snapshot ID. If it differs, return errStalePlan which propagates
immediately (not retried, since the plan itself is invalid).

Also fix rewriteManifests to derive SequenceNumber from the fresh
metadata (cs.SequenceNumber) instead of the captured currentSnap.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Add compare-and-swap to updateTableMetadataXattr

updateTableMetadataXattr previously re-read the entry but did not
verify the metadataVersion matched what commitWithRetry had loaded.
A concurrent update could be silently clobbered.

Now accepts expectedVersion parameter and compares it against the
stored metadataVersion before writing. Returns errMetadataVersionConflict
on mismatch, which commitWithRetry treats as retryable (deletes the
staged metadata file and retries with fresh state).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Export shared plugin worker helpers for use by sub-packages

Export ShouldSkipDetectionByInterval, BuildExecutorActivity, and
BuildDetectorActivity so the iceberg sub-package can reuse them
without duplicating logic.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Refactor iceberg maintenance handler into weed/plugin/worker/iceberg package

Split the 1432-line iceberg_maintenance_handler.go into focused files
in a new iceberg sub-package: handler.go, config.go, detection.go,
operations.go, filer_io.go, and compact.go (Phase 2 data compaction).

Key changes:
- Rename types to drop stutter (IcebergMaintenanceHandler → Handler, etc.)
- Fix loadFileByIcebergPath to preserve nested directory paths via
  normalizeIcebergPath instead of path.Base which dropped subdirectories
- Check SendProgress errors instead of discarding them
- Add stale-plan guard to compactDataFiles commitWithRetry closure
- Add "compact" operation to parseOperations canonical order
- Duplicate readStringConfig/readInt64Config helpers (~20 lines)
- Update worker_runtime.go to import new iceberg sub-package

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Remove iceberg_maintenance from default plugin worker job types

Iceberg maintenance is not yet ready to be enabled by default.
Workers can still opt in by explicitly listing iceberg_maintenance
in their job types configuration.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Clamp config values to safe minimums in ParseConfig

Prevents misconfiguration by enforcing minimum values using the
default constants for all config fields.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Harden filer I/O: path helpers, strict CAS guard, path traversal prevention

- Use path.Dir/path.Base instead of strings.SplitN in loadCurrentMetadata
- Make CAS guard error on missing or unparseable metadataVersion
- Add path.Clean and traversal validation in loadFileByIcebergPath

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Fix compact: single snapshot ID, oversized bin splitting, ensureFilerDir

- Use single newSnapID for all manifest entries in a compaction run
- Add splitOversizedBin to break bins exceeding targetSize
- Make ensureFilerDir only create on NotFound, propagate other errors

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Add wildcard filters, scan limit, and context cancellation to table scanning

- Use wildcard matchers (*, ?) for bucket/namespace/table filters
- Add limit parameter to scanTablesForMaintenance for early termination
- Add ctx.Done() checks in bucket and namespace scan loops
- Update filter UI descriptions and placeholders for wildcard support

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Remove dead detection interval check and validate namespace parameter

- Remove ineffective ShouldSkipDetectionByInterval call with hardcoded 0
- Add namespace to required parameter validation in Execute

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Improve operations: exponential backoff, orphan matching, full file cleanup

- Use exponential backoff (50ms, 100ms, 200ms, ...) in commitWithRetry
- Use normalizeIcebergPath for orphan matching instead of fragile suffix check
- Add collectSnapshotFiles to traverse manifest lists → manifests → data files
- Delete all unreferenced files after expiring snapshots, not just manifest lists
- Refactor removeOrphans to reuse collectSnapshotFiles

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* iceberg: fix ensureFilerDir to handle filer_pb.ErrNotFound sentinel

filer_pb.LookupEntry converts gRPC NotFound errors to filer_pb.ErrNotFound
(a plain sentinel), so status.Code() never returns codes.NotFound for that
error. This caused ensureFilerDir to return an error instead of creating
the directory when it didn't exist.

* iceberg: clean up orphaned artifacts when compaction commit fails

Track all files written during compaction (merged data files, manifest,
manifest list) and delete them if the commit or any subsequent write step
fails, preventing orphaned files from accumulating in the filer.

* iceberg: derive tablePath from namespace/tableName when empty

An empty table_path parameter would be passed to maintenance operations
unchecked. Default it to path.Join(namespace, tableName) when not provided.

* iceberg: make collectSnapshotFiles return error on read/parse failure

Previously, errors reading manifests were logged and skipped, returning a
partial reference set. This could cause incorrect delete decisions during
snapshot expiration or orphan cleanup. Now the function returns an error
and all callers abort when reference data is incomplete.

* iceberg: include active metadata file in removeOrphans referenced set

The metadataFileName returned by loadCurrentMetadata was discarded, so
the active metadata file could be incorrectly treated as an orphan and
deleted. Capture it and add it to the referencedFiles map.

* iceberg: only retry commitWithRetry on metadata version conflicts

Previously all errors from updateTableMetadataXattr triggered retries.
Now only errMetadataVersionConflict causes retry; other errors (permissions,
transport, malformed xattr) fail immediately.

* iceberg: respect req.Limit in fakeFilerServer.ListEntries mock

The mock ListEntries ignored the Limit field, so tests couldn't exercise
pagination. Now it stops streaming once Limit entries have been sent.

* iceberg: validate parquet schema compatibility before merging files

mergeParquetFiles now compares each source file's schema against the
first file's schema and aborts with a clear error if they differ, instead
of blindly writing rows that could panic or produce corrupt output.

* iceberg: normalize empty JobType to canonical jobType in Execute events

When request.Job.JobType is empty, status events and completion messages
were emitted with a blank job type. Derive a canonical value early and
use it consistently in all outbound events.

* iceberg: log warning on unexpected config value types in read helpers

readStringConfig and readInt64Config now log a V(1) warning when they
encounter an unhandled ConfigValue kind, aiding debugging of unexpected
config types that silently fall back to defaults.

* worker: add iceberg_maintenance to default plugin worker job types

Workers using the default job types list didn't advertise the
iceberg_maintenance handler despite the handler and canonical name
being registered. Add it so workers pick up the handler by default.

* iceberg: use defer and detached context for compaction artifact cleanup

The cleanup closure used the job context which could already be canceled,
and was not called on ctx.Done() early exits. Switch to a deferred
cleanup with a detached context (30s timeout) so artifact deletion
completes on all exit paths including context cancellation.

* iceberg: use proportional jitter in commitWithRetry backoff

Fixed 25ms max jitter becomes insignificant at higher retry attempts.
Use 0-20% of the current backoff value instead so jitter scales with
the exponential delay.

* iceberg: add malformed filename cases to extractMetadataVersion test

Cover edge cases like "invalid.metadata.json", "metadata.json", "",
and "v.metadata.json" to ensure the function returns 0 for unparseable
inputs.

* iceberg: fail compaction on manifest read errors and skip delete manifests

Previously, unreadable manifests were silently skipped during compaction,
which could drop live files from the entry set. Now manifest read/parse
errors are returned as fatal errors.

Also abort compaction when delete manifests exist since the compactor
does not apply deletes — carrying them through unchanged could produce
incorrect results.

* iceberg: use table-relative path for active metadata file in orphan scan

metadataFileName was stored as a basename (e.g. "v1.metadata.json") but
the orphan scanner matches against table-relative paths like
"metadata/v1.metadata.json". Prefix with "metadata/" so the active
metadata file is correctly recognized as referenced.

* iceberg: fix MetadataBuilderFromBase location to use metadata file path

The second argument to MetadataBuilderFromBase records the previous
metadata file in the metadata log. Using meta.Location() (the table
root) was incorrect — it must be the actual metadata file path so
old metadata files can be tracked and eventually cleaned up.

* iceberg: update metadataLocation and versionToken in xattr on commit

updateTableMetadataXattr was only updating metadataVersion,
modifiedAt, and fullMetadata but not metadataLocation or
versionToken. This left catalog state inconsistent after
maintenance commits — the metadataLocation still pointed to the
old metadata file and the versionToken was stale.

Add a newMetadataLocation parameter and regenerate the
versionToken on every commit, matching the S3 Tables handler
behavior.

* iceberg: group manifest entries by partition spec in rewriteManifests

rewriteManifests was writing all entries into a single manifest
using the table's current partition spec. For spec-evolved tables
where manifests reference different partition specs, this produces
an invalid manifest.

Group entries by the source manifest's PartitionSpecID and write
one merged manifest per spec, looking up each spec from the
table's PartitionSpecs list.

* iceberg: remove dead code loop for non-data manifests in compaction

The early abort guard at the top of compactDataFiles already ensures
no delete manifests are present. The loop that copied non-data
manifests into allManifests was unreachable dead code.

* iceberg: use JSON encoding in partitionKey for unambiguous grouping

partitionKey used fmt.Sprintf("%d=%v") joined by commas, which
produces ambiguous keys when partition values contain commas or '='.
Use json.Marshal for values and NUL byte as separator to eliminate
collisions.

* iceberg: precompute normalized reference set in removeOrphans

The orphan check was O(files × refs) because it normalized each
reference path inside the per-file loop. Precompute the normalized
set once for O(1) lookups per candidate file.

* iceberg: add artifact cleanup to rewriteManifests on commit failure

rewriteManifests writes merged manifests and a manifest list to
the filer before committing but did not clean them up on failure.
Add the same deferred cleanup pattern used by compactDataFiles:
track written artifacts and delete them if the commit does not
succeed.

* iceberg: pass isDeleteData=true in deleteFilerFile

deleteFilerFile called DoRemove with isDeleteData=false, which only
removed filer metadata and left chunk data behind on volume servers.
All other data-file deletion callers in the codebase pass true.

* iceberg: clean up test: remove unused snapID, simplify TestDetectWithFakeFiler

Remove unused snapID variable and eliminate the unnecessary second
fake filer + entry copy in TestDetectWithFakeFiler by capturing
the client from the first startFakeFiler call.

* fix: update TestWorkerDefaultJobTypes to expect 5 job types

The test expected 4 default job types but iceberg_maintenance was
added as a 5th default in a previous commit.

* iceberg: document client-side CAS TOCTOU limitation in updateTableMetadataXattr

Add a note explaining the race window where two workers can both
pass the version check and race at UpdateEntry. The proper fix
requires server-side precondition support on UpdateEntryRequest.

* iceberg: remove unused sender variable in TestFullExecuteFlow

* iceberg: abort compaction when multiple partition specs are present

The compactor writes all entries into a single manifest using the
current partition spec, which is invalid for spec-evolved tables.
Detect multiple PartitionSpecIDs and skip compaction until
per-spec compaction is implemented.

* iceberg: validate tablePath to prevent directory traversal

Sanitize the table_path parameter with path.Clean and verify it
matches the expected namespace/tableName prefix to prevent path
traversal attacks via crafted job parameters.

* iceberg: cap retry backoff at 5s and make it context-aware

The exponential backoff could grow unbounded and blocked on
time.Sleep ignoring context cancellation. Cap at 5s and use
a timer with select on ctx.Done so retries respect cancellation.

* iceberg: write manifest list with new snapshot identity in rewriteManifests

The manifest list was written with the old snapshot's ID and sequence
number, but the new snapshot created afterwards used a different
identity. Compute newSnapshotID and newSeqNum before writing
manifests and the manifest list so all artifacts are consistent.

* ec: also remove .vif file in removeEcVolumeFiles

removeEcVolumeFiles cleaned up .ecx, .ecj, and shard files but
not the .vif volume info file, leaving it orphaned. The .vif file
lives in the data directory alongside shard files.

The directory handling for index vs data files was already correct:
.ecx/.ecj are removed from IdxDirectory and shard files from
Directory, matching how NewEcVolume loads them.

Revert "ec: also remove .vif file in removeEcVolumeFiles"

This reverts commit acc82449e12a00115268a5652aef0d6c46d9f2dd.

* iceberg: skip orphan entries with nil Attributes instead of defaulting to epoch

When entry.Attributes is nil, mtime defaulted to Unix epoch (1970),
making unknown-age entries appear ancient and eligible for deletion.
Skip these entries instead to avoid deleting files whose age cannot
be determined.

* iceberg: use unique metadata filenames to prevent concurrent write clobbering

Add timestamp nonce to metadata filenames (e.g. v3-1709766000.metadata.json)
so concurrent writers stage to distinct files. Update extractMetadataVersion
to strip the nonce suffix, and loadCurrentMetadata to read the actual filename
from the metadataLocation xattr field.

* iceberg: defer artifact tracking until data file builder succeeds

Move the writtenArtifacts append to after NewDataFileBuilder succeeds,
so a failed builder doesn't leave a stale entry for an already-deleted
file in the cleanup list.

* iceberg: use detached context for metadata file cleanup

Use context.WithTimeout(context.Background(), 10s) when deleting staged
metadata files after CAS failure, so cleanup runs even if the original
request context is canceled.

* test: update default job types count to include iceberg_maintenance

* iceberg: use parquet.EqualNodes for structural schema comparison

Replace String()-based schema comparison with parquet.EqualNodes which
correctly compares types, repetition levels, and logical types.

* iceberg: add nonce-suffixed filename cases to TestExtractMetadataVersion

* test: assert iceberg_maintenance is present in default job types

* iceberg: validate operations config early in Detect

Call parseOperations in Detect so typos in the operations config fail
fast before emitting proposals, matching the validation already done
in Execute.

* iceberg: detect chunked files in loadFileByIcebergPath

Return an explicit error when a file has chunks but no inline content,
rather than silently returning empty data. Data files uploaded via S3
are stored as chunks, so compaction would otherwise produce corrupt
merged files.

---------

Co-authored-by: Copilot <copilot@github.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-07 12:45:33 -08:00
Chris Lu
d89eb8267f s3: use url.PathUnescape for X-Amz-Copy-Source header (#8545)
* s3: use url.PathUnescape for X-Amz-Copy-Source header (#8544)

The X-Amz-Copy-Source header is a URL-encoded path, not a query string.
Using url.QueryUnescape incorrectly converts literal '+' characters to
spaces, which can cause object key mismatches during copy operations.

Switch to url.PathUnescape in CopyObjectHandler, CopyObjectPartHandler,
and pathToBucketObjectAndVersion to correctly handle special characters
like '!', '+', and other RFC 3986 sub-delimiters that S3 clients may
percent-encode (e.g. '!' as %21).

* s3: add path validation to CopyObjectPartHandler

CopyObjectPartHandler was missing the validateTableBucketObjectPath
checks that CopyObjectHandler has, allowing potential path traversal
in the source bucket/object of copy part requests.

* s3: fix case-sensitive HeadersRegexp for copy source routing

The HeadersRegexp for X-Amz-Copy-Source used `%2F` which only matched
uppercase hex encoding. RFC 3986 allows both `%2F` and `%2f`, so
clients sending lowercase percent-encoding would bypass the copy
handler and hit PutObjectHandler instead. Add (?i) flag for
case-insensitive matching.

Also add test coverage for the versionId branch in
pathToBucketObjectAndVersion and for lowercase %2f routing.
2026-03-07 11:10:02 -08:00
Chris Lu
3f946fc0c0 mount: make metadata cache rebuilds snapshot-consistent (#8531)
* filer: expose metadata events and list snapshots

* mount: invalidate hot directory caches

* mount: read hot directories directly from filer

* mount: add sequenced metadata cache applier

* mount: apply metadata responses through cache applier

* mount: replay snapshot-consistent directory builds

* mount: dedupe self metadata events

* mount: factor directory build cleanup

* mount: replace proto marshal dedup with composite key and ring buffer

The dedup logic was doing a full deterministic proto.Marshal on every
metadata event just to produce a dedup key. Replace with a cheap
composite string key (TsNs|Directory|OldName|NewName).

Also replace the sliding-window slice (which leaked the backing array
unboundedly) with a fixed-size ring buffer that reuses the same array.

* filer: remove mutex and proto.Clone from request-scoped MetadataEventSink

MetadataEventSink is created per-request and only accessed by the
goroutine handling the gRPC call. The mutex and double proto.Clone
(once in Record, once in Last) were unnecessary overhead on every
filer write operation. Store the pointer directly instead.

* mount: skip proto.Clone for caller-owned metadata events

Add ApplyMetadataResponseOwned that takes ownership of the response
without cloning. Local metadata events (mkdir, create, flush, etc.)
are freshly constructed and never shared, so the clone is unnecessary.

* filer: only populate MetadataEvent on successful DeleteEntry

Avoid calling eventSink.Last() on error paths where the sink may
contain a partial event from an intermediate child deletion during
recursive deletes.

* mount: avoid map allocation in collectDirectoryNotifications

Replace the map with a fixed-size array and linear dedup. There are
at most 3 directories to notify (old parent, new parent, new child
if directory), so a 3-element array avoids the heap allocation on
every metadata event.

* mount: fix potential deadlock in enqueueApplyRequest

Release applyStateMu before the blocking channel send. Previously,
if the channel was full (cap 128), the send would block while holding
the mutex, preventing Shutdown from acquiring it to set applyClosed.

* mount: restore signature-based self-event filtering as fast path

Re-add the signature check that was removed when content-based dedup
was introduced. Checking signatures is O(1) on a small slice and
avoids enqueuing and processing events that originated from this
mount instance. The content-based dedup remains as a fallback.

* filer: send snapshotTsNs only in first ListEntries response

The snapshot timestamp is identical for every entry in a single
ListEntries stream. Sending it in every response message wastes
wire bandwidth for large directories. The client already reads
it only from the first response.

* mount: exit read-through mode after successful full directory listing

MarkDirectoryRefreshed was defined but never called, so directories
that entered read-through mode (hot invalidation threshold) stayed
there permanently, hitting the filer on every readdir even when cold.
Call it after a complete read-through listing finishes.

* mount: include event shape and full paths in dedup key

The previous dedup key only used Names, which could collapse distinct
rename targets. Include the event shape (C/D/U/R), source directory,
new parent path, and both entry names so structurally different events
are never treated as duplicates.

* mount: drain pending requests on shutdown in runApplyLoop

After receiving the shutdown sentinel, drain any remaining requests
from applyCh non-blockingly and signal each with errMetaCacheClosed
so callers waiting on req.done are released.

* mount: include IsDirectory in synthetic delete events

metadataDeleteEvent now accepts an isDirectory parameter so the
applier can distinguish directory deletes from file deletes. Rmdir
passes true, Unlink passes false.

* mount: fall back to synthetic event when MetadataEvent is nil

In mknod and mkdir, if the filer response omits MetadataEvent (e.g.
older filer without the field), synthesize an equivalent local
metadata event so the cache is always updated.

* mount: make Flush metadata apply best-effort after successful commit

After filer_pb.CreateEntryWithResponse succeeds, the entry is
persisted. Don't fail the Flush syscall if the local metadata cache
apply fails — log and invalidate the directory cache instead.
Also fall back to a synthetic event when MetadataEvent is nil.

* mount: make Rename metadata apply best-effort

The rename has already succeeded on the filer by the time we apply
the local metadata event. Log failures instead of returning errors
that would be dropped by the caller anyway.

* mount: make saveEntry metadata apply best-effort with fallback

After UpdateEntryWithResponse succeeds, treat local metadata apply
as non-fatal. Log and invalidate the directory cache on failure.
Also fall back to a synthetic event when MetadataEvent is nil.

* filer_pb: preserve snapshotTsNs on error in ReadDirAllEntriesWithSnapshot

Return the snapshot timestamp even when the first page fails, so
callers receive the snapshot boundary when partial data was received.

* filer: send snapshot token for empty directory listings

When no entries are streamed, send a final ListEntriesResponse with
only SnapshotTsNs so clients always receive the snapshot boundary.

* mount: distinguish not-found vs transient errors in lookupEntry

Return fuse.EIO for non-not-found filer errors instead of
unconditionally returning ENOENT, so transient failures don't
masquerade as missing entries.

* mount: make CacheRemoteObject metadata apply best-effort

The file content has already been cached successfully. Don't fail
the read if the local metadata cache update fails.

* mount: use consistent snapshot for readdir in direct mode

Capture the SnapshotTsNs from the first loadDirectoryEntriesDirect
call and store it on the DirectoryHandle. Subsequent batch loads
pass this stored timestamp so all batches use the same snapshot.

Also export DoSeaweedListWithSnapshot so mount can use it directly
with snapshot passthrough.

* filer_pb: fix test fake to send SnapshotTsNs only on first response

Match the server behavior: only the first ListEntriesResponse in a
page carries the snapshot timestamp, subsequent entries leave it zero.

* Fix nil pointer dereference in ListEntries stream consumers

Remove the empty-directory snapshot-only response from ListEntries
that sent a ListEntriesResponse with Entry==nil, which crashed every
raw stream consumer that assumed resp.Entry is always non-nil.

Also add defensive nil checks for resp.Entry in all raw ListEntries
stream consumers across: S3 listing, broker topic lookup, broker
topic config, admin dashboard, topic retention, hybrid message
scanner, Kafka integration, and consumer offset storage.

* Add nil guards for resp.Entry in remaining ListEntries stream consumers

Covers: S3 object lock check, MQ management dashboard (version/
partition/offset loops), and topic retention version loop.

* Make applyLocalMetadataEvent best-effort in Link and Symlink

The filer operations already succeeded; failing the syscall because
the local cache apply failed is wrong. Log a warning and invalidate
the parent directory cache instead.

* Make applyLocalMetadataEvent best-effort in Mkdir/Rmdir/Mknod/Unlink

The filer RPC already committed; don't fail the syscall when the
local metadata cache apply fails. Log a warning and invalidate the
parent directory cache to force a re-fetch on next access.

* flushFileMetadata: add nil-fallback for metadata event and best-effort apply

Synthesize a metadata event when resp.GetMetadataEvent() is nil
(matching doFlush), and make the apply best-effort with cache
invalidation on failure.

* Prevent double-invocation of cleanupBuild in doEnsureVisited

Add a cleanupDone guard so the deferred cleanup and inline error-path
cleanup don't both call DeleteFolderChildren/AbortDirectoryBuild.

* Fix comment: signature check is O(n) not O(1)

* Prevent deferred cleanup after successful CompleteDirectoryBuild

Set cleanupDone before returning from the success path so the
deferred context-cancellation check cannot undo a published build.

* Invalidate parent directory caches on rename metadata apply failure

When applyLocalMetadataEvent fails during rename, invalidate the
source and destination parent directory caches so subsequent accesses
trigger a re-fetch from the filer.

* Add event nil-fallback and cache invalidation to Link and Symlink

Synthesize metadata events when the server doesn't return one, and
invalidate parent directory caches on apply failure.

* Match requested partition when scanning partition directories

Parse the partition range format (NNNN-NNNN) and match against the
requested partition parameter instead of using the first directory.

* Preserve snapshot timestamp across empty directory listings

Initialize actualSnapshotTsNs from the caller-requested value so it
isn't lost when the server returns no entries. Re-add the server-side
snapshot-only response for empty directories (all raw stream consumers
now have nil guards for Entry).

* Fix CreateEntry error wrapping to support errors.Is/errors.As

Use errors.New + %w instead of %v for resp.Error so callers can
unwrap the underlying error.

* Fix object lock pagination: only advance on non-nil entries

Move entriesReceived inside the nil check so nil entries don't
cause repeated ListEntries calls with the same lastFileName.

* Guard Attributes nil check before accessing Mtime in MQ management

* Do not send nil-Entry response for empty directory listings

The snapshot-only ListEntriesResponse (with Entry == nil) for empty
directories breaks consumers that treat any received response as an
entry (Java FilerClient, S3 listing). The Go client-side
DoSeaweedListWithSnapshot already preserves the caller-requested
snapshot via actualSnapshotTsNs initialization, so the server-side
send is unnecessary.

* Fix review findings: subscriber dedup, invalidation normalization, nil guards, shutdown race

- Remove self-signature early-return in processEventFn so all events
  flow through the applier (directory-build buffering sees self-originated
  events that arrive after a snapshot)
- Normalize NewParentPath in collectEntryInvalidations to avoid duplicate
  invalidations when NewParentPath is empty (same-directory update)
- Guard resp.Entry.Attributes for nil in admin_server.go and
  topic_retention.go to prevent panics on entries without attributes
- Fix enqueueApplyRequest race with shutdown by using select on both
  applyCh and applyDone, preventing sends after the apply loop exits
- Add cleanupDone check to deferred cleanup in meta_cache_init.go for
  clarity alongside the existing guard in cleanupBuild
- Add empty directory test case for snapshot consistency

* Propagate authoritative metadata event from CacheRemoteObjectToLocalCluster and generate client-side snapshot for empty directories

- Add metadata_event field to CacheRemoteObjectToLocalClusterResponse
  proto so the filer-emitted event is available to callers
- Use WithMetadataEventSink in the server handler to capture the event
  from NotifyUpdateEvent and return it on the response
- Update filehandle_read.go to prefer the RPC's metadata event over
  a locally fabricated one, falling back to metadataUpdateEvent when
  the server doesn't provide one (e.g., older filers)
- Generate a client-side snapshot cutoff in DoSeaweedListWithSnapshot
  when the server sends no snapshot (empty directory), so callers like
  CompleteDirectoryBuild get a meaningful boundary for filtering
  buffered events

* Skip directory notifications for dirs being built to prevent mid-build cache wipe

When a metadata event is buffered during a directory build,
applyMetadataSideEffects was still firing noteDirectoryUpdate for the
building directory. If the directory accumulated enough updates to
become "hot", markDirectoryReadThrough would call DeleteFolderChildren,
wiping entries that EnsureVisited had already inserted. The build would
then complete and mark the directory cached with incomplete data.

Fix by using applyMetadataSideEffectsSkippingBuildingDirs for buffered
events, which suppresses directory notifications for dirs currently in
buildingDirs while still applying entry invalidations.

* Add test for directory notification suppression during active build

TestDirectoryNotificationsSuppressedDuringBuild verifies that metadata
events targeting a directory under active EnsureVisited build do NOT
fire onDirectoryUpdate for that directory. In production, this prevents
markDirectoryReadThrough from calling DeleteFolderChildren mid-build,
which would wipe entries already inserted by the listing.

The test inserts an entry during a build, sends multiple metadata events
for the building directory, asserts no notifications fired for it,
verifies the entry survives, and confirms buffered events are replayed
after CompleteDirectoryBuild.

* Fix create invalidations, build guard, event shape, context, and snapshot error path

- collectEntryInvalidations: invalidate FUSE kernel cache on pure
  create events (OldEntry==nil && NewEntry!=nil), not just updates
  and deletes
- completeDirectoryBuildNow: only call markCachedFn when an active
  build existed (state != nil), preventing an unpopulated directory
  from being marked as cached
- Add metadataCreateEvent helper that produces a create-shaped event
  (NewEntry only, no OldEntry) and use it in mkdir, mknod, symlink,
  and hardlink create fallback paths instead of metadataUpdateEvent
  which incorrectly set both OldEntry and NewEntry
- applyMetadataResponseEnqueue: use context.Background() for the
  queued mutation so a cancelled caller context cannot abort the
  apply loop mid-write
- DoSeaweedListWithSnapshot: move snapshot initialization before
  ListEntries call so the error path returns the preserved snapshot
  instead of 0

* Fix review findings: test loop, cache race, context safety, snapshot consistency

- Fix build test loop starting at i=1 instead of i=0, missing new-0.txt verification
- Re-check IsDirectoryCached after cache miss to avoid ENOENT race with markDirectoryReadThrough
- Use context.Background() in enqueueAndWait so caller cancellation can't abort build/complete mid-way
- Pass dh.snapshotTsNs in skip-batch loadDirectoryEntriesDirect for snapshot consistency
- Prefer resp.MetadataEvent over fallback in Unlink event derivation
- Add comment on MetadataEventSink.Record single-event assumption

* Fix empty-directory snapshot clock skew and build cancellation race

Empty-directory snapshot: Remove client-side time.Now() synthesis when
the server returns no entries. Instead return snapshotTsNs=0, and in
completeDirectoryBuildNow replay ALL buffered events when snapshot is 0.
This eliminates the clock-skew bug where a client ahead of the filer
would filter out legitimate post-list events.

Build cancellation: Use context.Background() for BeginDirectoryBuild
and CompleteDirectoryBuild calls in doEnsureVisited, so errgroup
cancellation doesn't cause enqueueAndWait to return early and trigger
cleanupBuild while the operation is still queued.

* Add tests for empty-directory build replay and cancellation resilience

TestEmptyDirectoryBuildReplaysAllBufferedEvents: verifies that when
CompleteDirectoryBuild receives snapshotTsNs=0 (empty directory, no
server snapshot), ALL buffered events are replayed regardless of their
TsNs values — no clock-skew-sensitive filtering occurs.

TestBuildCompletionSurvivesCallerCancellation: verifies that once
CompleteDirectoryBuild is enqueued, a cancelled caller context does not
prevent the build from completing. The apply loop runs with
context.Background(), so the directory becomes cached and buffered
events are replayed even when the caller gives up waiting.

* Fix directory subtree cleanup, Link rollback, test robustness

- applyMetadataResponseLocked: when a directory entry is deleted or
  moved, call DeleteFolderChildren on the old path so cached descendants
  don't leak as stale entries.

- Link: save original HardLinkId/Counter before mutation. If
  CreateEntryWithResponse fails after the source was already updated,
  rollback the source entry to its original state via UpdateEntry.

- TestBuildCompletionSurvivesCallerCancellation: replace fixed
  time.Sleep(50ms) with a deadline-based poll that checks
  IsDirectoryCached in a loop, failing only after 2s timeout.

- TestReadDirAllEntriesWithSnapshotEmptyDirectory: assert that
  ListEntries was actually invoked on the mock client so the test
  exercises the RPC path.

- newMetadataEvent: add early return when both oldEntry and newEntry are
  nil to avoid emitting events with empty Directory.

---------

Co-authored-by: Copilot <copilot@github.com>
2026-03-07 09:19:40 -08:00