Commit Graph

1035 Commits

Author SHA1 Message Date
Chris Lu
d1823d3784 fix(s3): include static identities in listing operations (#8903)
* fix(s3): include static identities in listing operations

Static identities loaded from -s3.config file were only stored in the
S3 API server's in-memory state. Listing operations (s3.configure shell
command, aws iam list-users) queried the credential manager which only
returned dynamic identities from the backend store.

Register static identities with the credential manager after loading
so they are included in LoadConfiguration and ListUsers results, and
filtered out before SaveConfiguration to avoid persisting them to the
dynamic store.

Fixes https://github.com/seaweedfs/seaweedfs/discussions/8896

* fix: avoid mutating caller's config and defensive copies

- SaveConfiguration: use shallow struct copy instead of mutating the
  caller's config.Identities field
- SetStaticIdentities: skip nil entries to avoid panics
- GetStaticIdentities: defensively copy PolicyNames slice to avoid
  aliasing the original

* fix: filter nil static identities and sync on config reload

- SetStaticIdentities: filter nil entries from the stored slice (not
  just from staticNames) to prevent panics in LoadConfiguration/ListUsers
- Extract updateCredentialManagerStaticIdentities helper and call it
  from both startup and the grace.OnReload handler so the credential
  manager's static snapshot stays current after config file reloads

* fix: add mutex for static identity fields and fix ListUsers for store callers

- Add sync.RWMutex to protect staticIdentities/staticNames against
  concurrent reads during config reload
- Revert CredentialManager.ListUsers to return only store users, since
  internal callers (e.g. DeletePolicy) look up each user in the store
  and fail on non-existent static entries
- Merge static usernames in the filer gRPC ListUsers handler instead,
  via the new GetStaticUsernames method
- Fix CI: TestIAMPolicyManagement/managed_policy_crud_lifecycle was
  failing because DeletePolicy iterated static users that don't exist
  in the store

* fix: show static identities in admin UI and weed shell

The admin UI and weed shell s3.configure command query the filer's
credential manager via gRPC, which is a separate instance from the S3
server's credential manager. Static identities were only registered
on the S3 server's credential manager, so they never appeared in the
filer's responses.

- Add CredentialManager.LoadS3ConfigFile to parse a static S3 config
  file and register its identities
- Add FilerOptions.s3ConfigFile so the filer can load the same static
  config that the S3 server uses
- Wire s3ConfigFile through in weed mini and weed server modes
- Merge static usernames in filer gRPC ListUsers handler
- Add CredentialManager.GetStaticUsernames helper
- Add sync.RWMutex to protect concurrent access to static identity
  fields
- Avoid importing weed/filer from weed/credential (which pulled in
  filer store init() registrations and broke test isolation)
- Add docker/compose/s3_static_users_example.json

* fix(admin): make static users read-only in admin UI

Static users loaded from the -s3.config file should not be editable
or deletable through the admin UI since they are managed via the
config file.

- Add IsStatic field to ObjectStoreUser, set from credential manager
- Hide edit, delete, and access key buttons for static users in the
  users table template
- Show a "static" badge next to static user names
- Return 403 Forbidden from UpdateUser and DeleteUser API handlers
  when the target user is a static identity

* fix(admin): show details for static users

GetObjectStoreUserDetails called credentialManager.GetUser which only
queries the dynamic store. For static users this returned
ErrUserNotFound. Fall back to GetStaticIdentity when the store lookup
fails.

* fix(admin): load static S3 identities in admin server

The admin server has its own credential manager (gRPC store) which is
a separate instance from the S3 server's and filer's. It had no static
identity data, so IsStaticIdentity returned false (edit/delete buttons
shown) and GetStaticIdentity returned nil (details page failed).

Pass the -s3.config file path through to the admin server and call
LoadS3ConfigFile on its credential manager, matching the approach
used for the filer.

* fix: use protobuf is_static field instead of passing config file path

The previous approach passed -s3.config file path to every component
(filer, admin). This is wrong because the admin server should not need
to know about S3 config files.

Instead, add an is_static field to the Identity protobuf message.
The field is set when static identities are serialized (in
GetStaticIdentities and LoadS3ConfigFile). Any gRPC client that loads
configuration via GetConfiguration automatically sees which identities
are static, without needing the config file.

- Add is_static field (tag 8) to iam_pb.Identity proto message
- Set IsStatic=true in GetStaticIdentities and LoadS3ConfigFile
- Admin GetObjectStoreUsers reads identity.IsStatic from proto
- Admin IsStaticUser helper loads config via gRPC to check the flag
- Filer GetUser gRPC handler falls back to GetStaticIdentity
- Remove s3ConfigFile from AdminOptions and NewAdminServer signature
2026-04-03 20:01:28 -07:00
Chris Lu
0798b274dd feat(s3): add concurrent chunk prefetch for large file downloads (#8917)
* feat(s3): add concurrent chunk prefetch for large file downloads

Add a pipe-based prefetch pipeline that overlaps chunk fetching with
response writing during S3 GetObject, SSE downloads, and filer proxy.

While chunk N streams to the HTTP response, fetch goroutines for the
next K chunks establish HTTP connections to volume servers ahead of
time, eliminating the RTT gap between sequential chunk fetches.

Uses io.Pipe for minimal memory overhead (~1MB per download regardless
of chunk size, vs buffering entire chunks). Also increases the
streaming read buffer from 64KB to 256KB to reduce syscall overhead.

Benchmark results (64KB chunks, prefetch=4):
- 0ms latency:  1058 → 2362 MB/s (2.2× faster)
- 5ms latency:  11.0 → 41.7 MB/s (3.8× faster)
- 10ms latency: 5.9  → 23.3 MB/s (4.0× faster)
- 20ms latency: 3.1  → 12.1 MB/s (3.9× faster)

* fix: address review feedback for prefetch pipeline

- Fix data race: use *chunkPipeResult (pointer) on channel to avoid
  copying struct while fetch goroutines write to it. Confirmed clean
  with -race detector.
- Remove concurrent map write: retryWithCacheInvalidation no longer
  updates fileId2Url map. Producer only reads it; consumer never writes.
- Use mem.Allocate/mem.Free for copy buffer to reduce GC pressure.
- Add local cancellable context so consumer errors (client disconnect)
  immediately stop the producer and all in-flight fetch goroutines.

* fix(test): remove dead code and add Range header support in test server

- Remove unused allData variable in makeChunksAndServer
- Add Range header handling to createTestServer for partial chunk
  read coverage (206 Partial Content, 416 Range Not Satisfiable)

* fix: correct retry condition and goroutine leak in prefetch pipeline

- Fix retry condition: use result.fetchErr/result.written instead of
  copied to decide cache-invalidation retry. The old condition wrongly
  triggered retry when the fetch succeeded but the response writer
  failed on the first write (copied==0 despite fetcher having data).
  Now matches the sequential path (stream.go:197) which checks whether
  the fetcher itself wrote zero bytes.

- Fix goroutine leak: when the producer's send to the results channel
  is interrupted by context cancellation, the fetch goroutine was
  already launched but the result was never sent to the channel. The
  drain loop couldn't handle it. Now waits on result.done before
  returning so every fetch goroutine is properly awaited.
2026-04-03 19:57:30 -07:00
Chris Lu
3efe88c718 feat(s3): store and return checksum headers for additional checksum algorithms (#8914)
* feat(s3): store and return checksum headers for additional checksum algorithms

When clients upload with --checksum-algorithm (SHA256, CRC32, etc.),
SeaweedFS validated the checksum but discarded it. The checksum was
never stored in metadata or returned in PUT/HEAD/GET responses.

Now the checksum is computed alongside MD5 during upload, stored in
entry extended attributes, and returned as the appropriate
x-amz-checksum-* header in all responses.

Fixes #8911

* fix(s3): address review feedback and CI failures for checksum support

- Gate GET/HEAD checksum response headers on x-amz-checksum-mode: ENABLED
  per AWS S3 spec, fixing FlexibleChecksumError on ranged GETs and
  multipart copies
- Verify computed checksum against client-provided header value for
  non-chunked uploads, returning BadDigest on mismatch
- Add nil check for getCheckSumWriter to prevent panic
- Handle comma-separated values in X-Amz-Trailer header
- Use ordered slice instead of map for deterministic checksum header
  selection; extract shared mappings into package-level vars

* fix(s3): skip checksum header for ranged GET responses

The stored checksum covers the full object. Returning it for ranged
(partial) responses causes SDK checksum validation failures because the
SDK validates the header value against the partial content received.

Skip emitting x-amz-checksum-* headers when a Range request header is
present, fixing PyArrow large file read failures.

* fix(s3): reject unsupported checksum algorithm with 400

detectRequestedChecksumAlgorithm now returns an error code when
x-amz-sdk-checksum-algorithm or x-amz-checksum-algorithm contains
an unsupported value, instead of silently ignoring it.

* feat(s3): compute composite checksum for multipart uploads

Store the checksum algorithm during CreateMultipartUpload, then during
CompleteMultipartUpload compute a composite checksum from per-part
checksums following the AWS S3 spec: concatenate raw per-part checksums,
hash with the same algorithm, format as "base64-N" where N is part count.

The composite checksum is persisted on the final object entry and
returned in HEAD/GET responses (gated on x-amz-checksum-mode: ENABLED).

Reuses existing per-part checksum storage from putToFiler and the
getCheckSumWriter/checksumHeaders infrastructure.

* fix(s3): validate checksum algorithm in CreateMultipartUpload, error on missing part checksums

- Move detectRequestedChecksumAlgorithm call before mkdir callback so
  an unsupported algorithm returns 400 before the upload is created
- Change computeCompositeChecksum to return an error when a part is
  missing its checksum (the upload was initiated with a checksum
  algorithm, so all parts must have checksums)
- Propagate the error as ErrInvalidPart in CompleteMultipartUpload

* fix(s3): return checksum header in CompleteMultipartUpload response, validate per-part algorithm

- Add ChecksumHeaderName/ChecksumValue fields to CompleteMultipartUploadResult
  and set the x-amz-checksum-* HTTP response header in the handler, matching
  the AWS S3 CompleteMultipartUpload response spec
- Validate that each part's stored checksum algorithm matches the upload's
  expected algorithm before assembling the composite checksum; return an
  error if a part was uploaded with a different algorithm
2026-04-03 18:37:54 -07:00
Chris Lu
995dfc4d5d chore: remove ~50k lines of unreachable dead code (#8913)
* chore: remove unreachable dead code across the codebase

Remove ~50,000 lines of unreachable code identified by static analysis.

Major removals:
- weed/filer/redis_lua: entire unused Redis Lua filer store implementation
- weed/wdclient/net2, resource_pool: unused connection/resource pool packages
- weed/plugin/worker/lifecycle: unused lifecycle plugin worker
- weed/s3api: unused S3 policy templates, presigned URL IAM, streaming copy,
  multipart IAM, key rotation, and various SSE helper functions
- weed/mq/kafka: unused partition mapping, compression, schema, and protocol functions
- weed/mq/offset: unused SQL storage and migration code
- weed/worker: unused registry, task, and monitoring functions
- weed/query: unused SQL engine, parquet scanner, and type functions
- weed/shell: unused EC proportional rebalance functions
- weed/storage/erasure_coding/distribution: unused distribution analysis functions
- Individual unreachable functions removed from 150+ files across admin,
  credential, filer, iam, kms, mount, mq, operation, pb, s3api, server,
  shell, storage, topology, and util packages

* fix(s3): reset shared memory store in IAM test to prevent flaky failure

TestLoadIAMManagerFromConfig_EmptyConfigWithFallbackKey was flaky because
the MemoryStore credential backend is a singleton registered via init().
Earlier tests that create anonymous identities pollute the shared store,
causing LookupAnonymous() to unexpectedly return true.

Fix by calling Reset() on the memory store before the test runs.

* style: run gofmt on changed files

* fix: restore KMS functions used by integration tests

* fix(plugin): prevent panic on send to closed worker session channel

The Plugin.sendToWorker method could panic with "send on closed channel"
when a worker disconnected while a message was being sent. The race was
between streamSession.close() closing the outgoing channel and sendToWorker
writing to it concurrently.

Add a done channel to streamSession that is closed before the outgoing
channel, and check it in sendToWorker's select to safely detect closed
sessions without panicking.
2026-04-03 16:04:27 -07:00
Chris Lu
8fad85aed7 feat(s3): support WEED_S3_SSE_KEY env var for SSE-S3 KEK (#8904)
* feat(s3): support WEED_S3_SSE_KEY env var for SSE-S3 KEK

Add support for providing the SSE-S3 Key Encryption Key (KEK) via the
WEED_S3_SSE_KEY environment variable (hex-encoded 256-bit key). This
avoids storing the master key in plaintext on the filer at /etc/s3/sse_kek.

Key source priority:
1. WEED_S3_SSE_KEY environment variable (recommended)
2. Existing filer KEK at /etc/s3/sse_kek (backward compatible)
3. Auto-generate and save to filer (deprecated for new deployments)

Existing deployments with a filer-stored KEK continue to work unchanged.
A deprecation warning is logged when auto-generating a new filer KEK.

* refactor(s3): derive KEK from any string via HKDF instead of requiring hex

Accept any secret string in WEED_S3_SSE_KEY and derive a 256-bit key
using HKDF-SHA256 instead of requiring a hex-encoded key. This is
simpler for users — no need to generate hex, just set a passphrase.

* feat(s3): add WEED_S3_SSE_KEK and WEED_S3_SSE_KEY env vars for KEK

Two env vars for providing the SSE-S3 Key Encryption Key:

- WEED_S3_SSE_KEK: hex-encoded, same format as /etc/s3/sse_kek.
  If the filer file also exists, they must match.
- WEED_S3_SSE_KEY: any string, 256-bit key derived via HKDF-SHA256.
  Refuses to start if /etc/s3/sse_kek exists (must delete first).

Only one may be set. Existing filer-stored KEKs continue to work.
Auto-generating and storing new KEKs on filer is deprecated.

* fix(s3): stop auto-generating KEK, fail only when SSE-S3 is used

Instead of auto-generating a KEK and storing it on the filer when no
key source is configured, simply leave SSE-S3 disabled. Encrypt and
decrypt operations return a clear error directing the user to set
WEED_S3_SSE_KEK or WEED_S3_SSE_KEY.

* refactor(s3): move SSE-S3 KEK config to security.toml

Move KEK configuration from standalone env vars to security.toml's new
[sse_s3] section, following the same pattern as JWT keys and TLS certs.

  [sse_s3]
  kek = ""   # hex-encoded 256-bit key (same format as /etc/s3/sse_kek)
  key = ""   # any string, HKDF-derived

Viper's WEED_ prefix auto-mapping provides env var support:
WEED_SSE_S3_KEK and WEED_SSE_S3_KEY.

All existing behavior is preserved: filer KEK fallback, mismatch
detection, and HKDF derivation.

* refactor(s3): rename SSE-S3 config keys to s3.sse.kek / s3.sse.key

Use [s3.sse] section in security.toml, matching the existing naming
convention (e.g. [s3.*]). Env vars: WEED_S3_SSE_KEK, WEED_S3_SSE_KEY.

* fix(s3): address code review findings for SSE-S3 KEK

- Don't hold mutex during filer retry loop (up to 20s of sleep).
  Lock only to write filerClient and superKey.
- Remove dead generateAndSaveSuperKeyToFiler and unused constants.
- Return error from deriveKeyFromSecret instead of ignoring it.
- Fix outdated doc comment on InitializeWithFiler.
- Use t.Setenv in tests instead of manual os.Setenv/Unsetenv.

* fix(s3): don't block startup on filer errors when KEK is configured

- When s3.sse.kek is set, a temporarily unreachable filer no longer
  prevents startup. The filer consistency check becomes best-effort
  with a warning.
- Same treatment for s3.sse.key: filer unreachable logs a warning
  instead of failing.
- Rewrite error messages to suggest migration instead of file deletion,
  avoiding the risk of orphaning encrypted data.

Finding 3 (restore auto-generation) intentionally skipped — auto-gen
was removed by design to avoid storing plaintext KEK on filer.

* fix(test): set WEED_S3_SSE_KEY in SSE integration test server startup

SSE-S3 no longer auto-generates a KEK, so integration tests must
provide one. Set WEED_S3_SSE_KEY=test-sse-s3-key in all weed mini
invocations in the test Makefile.
2026-04-03 13:01:21 -07:00
Chris Lu
059bee683f feat(s3): add STS GetFederationToken support (#8891)
* feat(s3): add STS GetFederationToken support

Implement the AWS STS GetFederationToken API, which allows long-term IAM
users to obtain temporary credentials scoped down by an optional inline
session policy. This is useful for server-side applications that mint
per-user temporary credentials.

Key behaviors:
- Requires SigV4 authentication from a long-term IAM user
- Rejects calls from temporary credentials (session tokens)
- Name parameter (2-64 chars) identifies the federated user
- DurationSeconds supports 900-129600 (15 min to 36 hours, default 12h)
- Optional inline session policy for permission scoping
- Caller's attached policies are embedded in the JWT token
- Returns federated user ARN: arn:aws:sts::<account>:federated-user/<Name>

No performance impact on the S3 hot path — credential vending is a
separate control-plane operation, and all policy data is embedded in
the stateless JWT token.

* fix(s3): address GetFederationToken PR review feedback

- Fix Name validation: max 32 chars (not 64) per AWS spec, add regex
  validation for [\w+=,.@-]+ character whitelist
- Refactor parseDurationSeconds into parseDurationSecondsWithBounds to
  eliminate duplicated duration parsing logic
- Add sts:GetFederationToken permission check via VerifyActionPermission
  mirroring the AssumeRole authorization pattern
- Change GetPoliciesForUser to return ([]string, error) so callers fail
  closed on policy-resolution failures instead of silently returning nil
- Move temporary-credentials rejection before SigV4 verification for
  early rejection and proper test coverage
- Update tests: verify specific error message for temp cred rejection,
  add regex validation test cases (spaces, slashes rejected)

* refactor(s3): use sts.Action* constants instead of hard-coded strings

Replace hard-coded "sts:AssumeRole" and "sts:GetFederationToken" strings
in VerifyActionPermission calls with sts.ActionAssumeRole and
sts.ActionGetFederationToken package constants.

* fix(s3): pass through sts: prefix in action resolver and merge policies

Two fixes:

1. mapBaseActionToS3Format now passes through "sts:" prefix alongside
   "s3:" and "iam:", preventing sts:GetFederationToken from being
   rewritten to s3:sts:GetFederationToken in VerifyActionPermission.
   This also fixes the existing sts:AssumeRole permission checks.

2. GetFederationToken policy embedding now merges identity.PolicyNames
   (from SigV4 identity) with policies from the IAM manager (which may
   include group-attached policies), deduplicated via a map. Previously
   the IAM manager lookup was skipped when identity.PolicyNames was
   non-empty, causing group policies to be omitted from the token.

* test(s3): add integration tests for sts: action passthrough and policy merge

Action resolver tests:
- TestMapBaseActionToS3Format_ServicePrefixPassthrough: verifies s3:, iam:,
  and sts: prefixed actions pass through unchanged while coarse actions
  (Read, Write) are mapped to S3 format
- TestResolveS3Action_STSActionsPassthrough: verifies sts:AssumeRole,
  sts:GetFederationToken, sts:GetCallerIdentity pass through ResolveS3Action
  unchanged with both nil and real HTTP requests

Policy merge tests:
- TestGetFederationToken_GetPoliciesForUser: tests IAMManager.GetPoliciesForUser
  with no user store (error), missing user, user with policies, user without
- TestGetFederationToken_PolicyMergeAndDedup: tests that identity.PolicyNames
  and IAM-manager-resolved policies are merged and deduplicated (SharedPolicy
  appears in both sources, result has 3 unique policies)
- TestGetFederationToken_PolicyMergeNoManager: tests that when IAM manager is
  unavailable, identity.PolicyNames alone are embedded

* test(s3): add end-to-end integration tests for GetFederationToken

Add integration tests that call GetFederationToken using real AWS SigV4
signed HTTP requests against a running SeaweedFS instance, following the
existing pattern in test/s3/iam/s3_sts_assume_role_test.go.

Tests:
- TestSTSGetFederationTokenValidation: missing name, name too short/long,
  invalid characters, duration too short/long, malformed policy, anonymous
  rejection (7 subtests)
- TestSTSGetFederationTokenRejectTemporaryCredentials: obtains temp creds
  via AssumeRole then verifies GetFederationToken rejects them
- TestSTSGetFederationTokenSuccess: basic success, custom 1h duration,
  36h max duration with expiration time verification
- TestSTSGetFederationTokenWithSessionPolicy: creates a bucket, obtains
  federated creds with GetObject-only session policy, verifies GetObject
  succeeds and PutObject is denied using the AWS SDK S3 client
2026-04-02 17:37:05 -07:00
Chris Lu
a4b896a224 fix(s3): skip directories before marker in ListObjectVersions pagination (#8890)
* fix(s3): skip directories before marker in ListObjectVersions pagination

ListObjectVersions was re-traversing the entire directory tree from the
beginning on every paginated request, only skipping entries at the leaf
level. For buckets with millions of objects in deep hierarchies, this
caused exponentially slower responses as pagination progressed.

Two optimizations:
1. Use keyMarker to compute a startFrom position at each directory level,
   skipping directly to the relevant entry instead of scanning from the
   beginning (mirroring how ListObjects uses marker descent).
2. Skip recursing into subdirectories whose keys are entirely before the
   keyMarker.

Changes per-page cost from O(entries_before_marker) to O(tree_depth).

* test(s3): add integration test for deep-hierarchy version listing pagination

Adds TestVersioningPaginationDeepDirectoryHierarchy which creates objects
across 20 subdirectories at depth 6 (mimicking Veeam 365 backup layout)
and paginates through them with small maxKeys. Verifies correctness
(no duplicates, sorted order, all objects found) and checks that later
pages don't take dramatically longer than earlier ones — the symptom
of the pre-fix re-traversal bug. Also tests delimiter+pagination
interaction across subdirectories.

* test(s3): strengthen deep-hierarchy pagination assertions

- Replace timing warning (t.Logf) with a failing assertion (t.Errorf)
  so pagination regressions actually fail the test.
- Replace generic count/uniqueness/sort checks on CommonPrefixes with
  exact equality against the expected prefix slice, catching wrong-but-
  sorted results.

* test(s3): use allKeys for exact assertion in deep-hierarchy pagination test

Wire the allKeys slice (previously unused dead code) into the version
listing assertion, replacing generic count/uniqueness/sort checks with
an exact equality comparison against the keys that were created.
2026-04-02 15:59:52 -07:00
Chris Lu
7c59b639c9 STS: add GetCallerIdentity support (#8893)
* STS: add GetCallerIdentity support

Implement the AWS STS GetCallerIdentity action, which returns the
ARN, account ID, and user ID of the caller based on SigV4 authentication.
This is commonly used by AWS SDKs and CLI tools (e.g. `aws sts get-caller-identity`)
to verify credentials and determine the authenticated identity.

* test: remove trivial GetCallerIdentity tests

Remove the XML unmarshal test (we don't consume this response as input)
and the routing constant test (just asserts a literal equals itself).

* fix: route GetCallerIdentity through STS in UnifiedPostHandler and use stable UserId

- UnifiedPostHandler only dispatched actions starting with "AssumeRole" to STS,
  so GetCallerIdentity in a POST body would fall through to the IAM path and
  get AccessDenied for non-admin users. Add explicit check for GetCallerIdentity.
- Use identity.Name as UserId instead of credential.AccessKey, which is a
  transient value and incorrect for STS assumed-role callers.
2026-04-02 15:59:09 -07:00
Chris Lu
ab4e52ae2f fix(s3): use recursive delete for .versions directory cleanup (#8887)
* fix(s3): use recursive delete for .versions directory cleanup

When only delete markers remain in a .versions directory,
updateLatestVersionAfterDeletion tried to delete it non-recursively,
which failed with "fail to delete non-empty folder" because the delete
marker entries were still present. Use recursive deletion so the
directory and its remaining delete marker entries are cleaned up together.

* fix(s3): guard .versions directory deletion against truncated listings

When the version listing is truncated (>1000 entries), content versions
may exist beyond the first page. Skip the recursive directory deletion
in this case to prevent data loss.

* fix(s3): preserve delete markers in .versions directory

Delete markers must be preserved per S3 semantics — they are only
removed by an explicit DELETE with versionId. The previous fix would
recursively delete the entire .versions directory (including delete
markers) when no content versions were found.

Now the logic distinguishes three cases:
1. Content versions exist → update latest version metadata
2. Only delete markers remain (or listing truncated) → keep directory
3. Truly empty → safe to delete directory (non-recursive)
2026-04-02 11:55:13 -07:00
Chris Lu
efbed39e25 S3: map canned ACL to file permissions and add configurable default file mode (#8886)
* S3: map canned ACL to file permissions and add configurable default file mode

S3 uploads were hardcoded to 0660 regardless of ACL headers. Now the
X-Amz-Acl header maps to Unix file permissions per-object:
- public-read, authenticated-read, bucket-owner-read → 0644
- public-read-write → 0666
- private, bucket-owner-full-control → 0660

Also adds -defaultFileMode / -s3.defaultFileMode flag to set a
server-wide default when no ACL header is present.

Closes #8874

* Address review feedback for S3 file mode feature

- Extract hardcoded 0660 to defaultFileMode constant
- Change parseDefaultFileMode to return error instead of calling Fatalf
- Add -s3.defaultFileMode flag to filer.go and mini.go (was missing)
- Add doc comment to S3Options about updating all four flag sites
- Add TestResolveFileMode with 10 test cases covering ACL mapping,
  server default, and priority ordering
2026-04-02 11:51:54 -07:00
Chris Lu
b3e50bb12f fix(s3): remove customer encryption key from SSE-C debug log (#8875)
* fix(s3): remove customer encryption key from SSE-C debug log

The debug log in validateAndParseSSECHeaders was logging the raw
customer-provided encryption key bytes in hex format (keyBytes=%x),
leaking sensitive key material to log output. Remove the key bytes
from the log statement while keeping the MD5 hash comparison info.

* Apply suggestion from @gemini-code-assist[bot]

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

---------

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-04-01 23:23:56 -07:00
msementsov
4c13a9ce65 Client disconnects create context cancelled errors, 500x errors and Filer lookup failures (#8845)
* Update stream.go

Client disconnects create context cancelled errors and Filer lookup failures

* s3api: handle canceled stream requests cleanly

* s3api: address canceled streaming review feedback

---------

Co-authored-by: Chris Lu <chris.lu@gmail.com>
2026-03-30 12:11:30 -07:00
Lars Lehtonen
5c5d377277 weed/s3api: prune test-only functions (#8840)
weed/s3api: prune functions that are referenced only from tests and the tests that exercise them.
2026-03-30 09:43:33 -07:00
Chris Lu
e8a6fcaafb s3api: skip TTL fast-path for versioned buckets (#8823)
* s3api: skip TTL fast-path for versioned buckets (#8757)

PutBucketLifecycleConfiguration was translating Expiration.Days into
filer.conf TTL entries for all buckets. For versioned buckets this is
wrong:

 1. TTL volumes expire as a unit, destroying all data — including
    noncurrent versions that should be preserved.
 2. Filer-backend TTL (RocksDB compaction filter, Redis key expiry)
    removes entries without triggering chunk deletion, leaving orphaned
    volume data with 0 deleted bytes.
 3. On AWS S3, Expiration.Days on a versioned bucket creates a delete
    marker — it does not hard-delete data. TTL has no such nuance.

Fix: skip the TTL fast-path when the bucket has versioning enabled or
suspended. All lifecycle rules are evaluated at scan time by the
lifecycle worker instead.

Also fix the lifecycle worker to evaluate Expiration rules against the
latest version in .versions/ directories, which was previously skipped
entirely — only NoncurrentVersionExpiration was handled.

* lifecycle worker: handle SeaweedList error in versions dir cleanup

Do not assume the directory is empty when the list call fails — log
the error and skip the directory to avoid incorrect deletion.

* address review feedback

- Fetch version file for tag-based rules instead of reading tags from
  the .versions directory entry where they are not cached.
- Handle getBucketVersioningStatus error by failing closed (treat as
  versioned) to avoid creating TTL entries on transient failures.
- Capture and assert deleteExpiredObjects return values in test.
- Improve test documentation.
2026-03-29 00:05:53 -07:00
Chris Lu
297cdef1a3 s3api: accept all supported lifecycle rule types (#8813)
* s3api: accept NoncurrentVersionExpiration, AbortIncompleteMultipartUpload, Expiration.Date

Update PutBucketLifecycleConfigurationHandler to accept all newly-supported
lifecycle rule types. Only Transition and NoncurrentVersionTransition are
still rejected (require storage class tier infrastructure).

Changes:
- Remove ErrNotImplemented for Expiration.Date (handled by worker at scan time)
- Only reject rules with Transition.set or NoncurrentVersionTransition.set
- Extract prefix from Filter.And when present
- Add comment explaining that non-Expiration.Days rules are evaluated by
  the lifecycle worker from stored lifecycle XML, not via filer.conf TTL

The lifecycle XML is already stored verbatim in bucket metadata, so new
rule types are preserved on Get even without explicit handler support.
Filer.conf TTL entries are only created for Expiration.Days (fast path).

* s3api: skip TTL fast path for rules with tag or size filters

Rules with tag or size constraints (Filter.Tag, Filter.And with tags
or size bounds, Filter.ObjectSizeGreaterThan/LessThan) must not be
lowered to filer.conf TTL entries, because TTL applies unconditionally
to all objects under the prefix. These rules are evaluated at scan
time by the lifecycle worker which checks each object's tags and size.

Only simple Expiration.Days rules with prefix-only filters use the
TTL fast path (RocksDB compaction filter).

---------

Co-authored-by: Copilot <copilot@github.com>
2026-03-28 19:39:21 -07:00
Chris Lu
782ab84f95 lifecycle worker: drive MPU abort from lifecycle rules (#8812)
* lifecycle worker: drive MPU abort from lifecycle rules

Update the multipart upload abort phase to read
AbortIncompleteMultipartUpload.DaysAfterInitiation from the parsed
lifecycle rules. Falls back to the worker config abort_mpu_days when
no lifecycle XML rule specifies the value.

This means per-bucket MPU abort thresholds are now respected when
set via PutBucketLifecycleConfiguration, instead of using a single
global worker config value for all buckets.

* lifecycle worker: only use config AbortMPUDays when no lifecycle XML exists

When a bucket has lifecycle XML (useRuleEval=true) but no
AbortIncompleteMultipartUpload rule, mpuAbortDays should be 0
(no abort), not the worker config default. The config fallback
should only apply to buckets without lifecycle XML.

* lifecycle worker: only skip .uploads at bucket root

* lifecycle worker: use per-upload rule evaluation for MPU abort

Replace the single bucket-wide mpuAbortDays with per-upload evaluation
using s3lifecycle.EvaluateMPUAbort, which respects each rule's prefix
filter and DaysAfterInitiation threshold.

Previously the code took the first enabled abort rule's days value
and applied it to all uploads, ignoring prefix scoping and multiple
rules with different thresholds.

Config fallback (abort_mpu_days) now only applies when lifecycle XML
is truly absent (xmlPresent=false), not when XML exists but has no
abort rules.

Also fix EvaluateMPUAbort to use expectedExpiryTime for midnight-UTC
semantics matching other lifecycle cutoffs.

---------

Co-authored-by: Copilot <copilot@github.com>
2026-03-28 13:50:33 -07:00
Chris Lu
f52a3c87ce lifecycle worker: fix ExpiredObjectDeleteMarker to match AWS semantics (#8811)
* lifecycle worker: add NoncurrentVersionExpiration support

Add version-aware scanning to the rule-based execution path. When the
walker encounters a .versions directory, processVersionsDirectory():
- Lists all version entries (v_<versionId>)
- Sorts by version timestamp (newest first)
- Walks non-current versions with ShouldExpireNoncurrentVersion()
  which handles both NoncurrentDays and NewerNoncurrentVersions
- Extracts successor time from version IDs (both old/new format)
- Skips delete markers in noncurrent version counting
- Falls back to entry Mtime when version ID timestamp is unavailable

Helper functions:
- sortVersionsByTimestamp: insertion sort by version ID timestamp
- getEntryVersionTimestamp: extracts timestamp with Mtime fallback

* lifecycle worker: address review feedback for noncurrent versions

- Use sentinel errLimitReached in versions directory handler
- Set NoncurrentIndex on ObjectInfo for proper NewerNoncurrentVersions
  evaluation

* lifecycle worker: fail closed on XML parse error, guard zero Mtime

- Fail closed when lifecycle XML exists but fails to parse, instead
  of falling back to TTL which could apply broader rules
- Guard Mtime > 0 before using time.Unix(mtime, 0) to avoid mapping
  unset Mtime to 1970, which would misorder versions and cause
  premature expiration

* lifecycle worker: count delete markers toward NoncurrentIndex

Noncurrent delete markers should count toward the
NewerNoncurrentVersions retention threshold so data versions
get the correct position index. Previously, skipping delete
markers without incrementing the index could retain too many
versions after delete/recreate cycles.

* lifecycle worker: fix version ordering, error propagation, and fail-closed scope

1. Use full version ID comparison (CompareVersionIds) for sorting
   .versions entries, not just decoded timestamps. Two versions with
   the same timestamp prefix but different random suffixes were
   previously misordered, potentially treating the newest version as
   noncurrent and deleting it.

2. Propagate .versions listing failures to the caller instead of
   swallowing them with (nil, 0). Transient filer errors on a
   .versions directory now surface in the job result.

3. Narrow the fail-closed path to only malformed lifecycle XML
   (errMalformedLifecycleXML). Transient filer LookupEntry errors
   now fall back to TTL with a warning, matching the original intent
   of "fail closed on bad config, not on network blips."

* lifecycle worker: only skip .uploads at bucket root

* lifecycle worker: sort.Slice, mixed-format test, XML presence tracking

- Replace manual insertion sort with sort.Slice in sortVersionsByVersionId
- Add TestCompareVersionIdsMixedFormats covering old/new format ordering
- Distinguish "no lifecycle XML" (nil) from "XML present but no effective
  rules" (non-nil empty slice) so buckets with all-disabled rules don't
  incorrectly fall back to filer.conf TTL expiration

* lifecycle worker: guard nil Attributes, use TrimSuffix in test

- Guard entry.Attributes != nil before accessing GetFileSize() and
  Mtime in both listExpiredObjectsByRules and processVersionsDirectory
- Use strings.TrimPrefix/TrimSuffix in TestVersionsDirectoryNaming
  to match the production code pattern

* lifecycle worker: skip TTL scan when XML present, fix test assertions

- When lifecycle XML is present but has no effective rules, skip
  object scanning entirely instead of falling back to TTL path
- Test sort output against concrete expected names instead of
  re-using the same comparator as the sort itself

* lifecycle worker: fix ExpiredObjectDeleteMarker to match AWS semantics

Rewrite cleanupDeleteMarkers() to only remove delete markers that are
the sole remaining version of an object. Previously, delete markers
were removed unconditionally which could resurface older versions in
versioned buckets.

New algorithm:
1. Walk bucket tree looking for .versions directories
2. Check ExtLatestVersionIsDeleteMarker from directory metadata
3. Count versions in the .versions directory
4. Only remove if count == 1 (delete marker is sole version)
5. Require an ExpiredObjectDeleteMarker=true rule (when lifecycle
   XML rules are present)
6. Remove the empty .versions directory after cleanup

This phase runs after NoncurrentVersionExpiration so version counts
are accurate.

* lifecycle worker: respect prefix filter in ExpiredObjectDeleteMarker rules

Previously hasDeleteMarkerRule was a bucket-wide boolean that ignored
rule prefixes. A prefix-scoped rule like "logs/" would incorrectly
clean up delete markers in all paths.

Add matchesDeleteMarkerRule() that checks if a matching enabled
ExpiredObjectDeleteMarker rule exists for the specific object key,
respecting the rule's prefix filter. Falls back to legacy behavior
(allow cleanup) when no lifecycle XML rules are provided.

* lifecycle worker: only skip .uploads at bucket root

Check dir == bucketPath before skipping directories named .uploads.
Previously a user-created directory like data/.uploads/ at any depth
would be incorrectly skipped during lifecycle scanning.

* lifecycle worker: fix delete marker cleanup with XML-present empty rules

1. matchesDeleteMarkerRule now uses nil check (not len==0) for legacy
   fallback. A non-nil empty slice means lifecycle XML was present but
   had no ExpiredObjectDeleteMarker rules, so cleanup is blocked.
   Previously, an empty slice triggered the legacy true path.

2. Use per-directory removedHere flag instead of cumulative cleaned
   counter when deciding to remove .versions directories. Previously,
   after the first successful cleanup anywhere in the bucket, every
   subsequent .versions directory would be removed even if its own
   delete marker was not actually deleted.

* lifecycle worker: use full filter matching for delete marker rules

matchesDeleteMarkerRule now uses s3lifecycle.MatchesFilter (exported)
instead of prefix-only matching. This ensures tag and size filters
on ExpiredObjectDeleteMarker rules are respected, preventing broader
deletions than the configured policy intends.

Add TestMatchesDeleteMarkerRule covering: nil rules (legacy), empty
rules (XML present), prefix match/mismatch, disabled rules, rules
without the flag, and tag-filtered rules against tagless markers.

---------

Co-authored-by: Copilot <copilot@github.com>
2026-03-28 13:26:57 -07:00
Lars Lehtonen
b01a74c6bb Prune Unused Functions from weed/s3api (#8815)
* weed/s3api: prune calculatePartOffset()

* weed/s3api: prune clearCachedListMetadata()

* weed/s3api: prune S3ApiServer.isObjectRetentionActive()

* weed/s3api: prune S3ApiServer.ensureDirectoryAllEmpty()

* weed/s3api: prune s3ApiServer.getEncryptionTypeString()

* weed/s3api: prune newStreamError()

* weed/s3api: prune S3ApiServer.rotateSSECKey()

weed/s3api: prune S3ApiServer.rotateSSEKMSKey()

weed/s3api: prune S3ApiServer.rotateSSEKMSMetadataOnly()

weed/s3api: prune S3ApiServer.rotateSSECChunks()

weed/s3api: prune S3ApiServer.rotateSSEKMSChunks()

weed/s3api: prune S3ApiServer.rotateSSECChunk()

weed/s3api: prune S3ApiServer.rotateSSEKMSChunk()

* weed/s3api: prune addCounterToIV()

* weed/s3api: prune minInt()

* weed/s3api: prune isMethodActionMismatch()

* weed/s3api: prune hasSpecificQueryParameters()

* weed/s3api: prune handlePutToFilerError()

weed/s3api: prune handlePutToFilerInternalError()

weed/s3api: prune logErrorAndReturn()

weed/s3api: prune logInternalError

weed/s3api: prune handleSSEError()

weed/s3api: prune handleSSEInternalError()

* weed/s3api: prune encryptionConfigToProto()

* weed/s3api: prune S3ApiServer.touch()
2026-03-28 13:24:11 -07:00
Chris Lu
f6ec9941cb lifecycle worker: NoncurrentVersionExpiration support (#8810)
* lifecycle worker: add NoncurrentVersionExpiration support

Add version-aware scanning to the rule-based execution path. When the
walker encounters a .versions directory, processVersionsDirectory():
- Lists all version entries (v_<versionId>)
- Sorts by version timestamp (newest first)
- Walks non-current versions with ShouldExpireNoncurrentVersion()
  which handles both NoncurrentDays and NewerNoncurrentVersions
- Extracts successor time from version IDs (both old/new format)
- Skips delete markers in noncurrent version counting
- Falls back to entry Mtime when version ID timestamp is unavailable

Helper functions:
- sortVersionsByTimestamp: insertion sort by version ID timestamp
- getEntryVersionTimestamp: extracts timestamp with Mtime fallback

* lifecycle worker: address review feedback for noncurrent versions

- Use sentinel errLimitReached in versions directory handler
- Set NoncurrentIndex on ObjectInfo for proper NewerNoncurrentVersions
  evaluation

* lifecycle worker: fail closed on XML parse error, guard zero Mtime

- Fail closed when lifecycle XML exists but fails to parse, instead
  of falling back to TTL which could apply broader rules
- Guard Mtime > 0 before using time.Unix(mtime, 0) to avoid mapping
  unset Mtime to 1970, which would misorder versions and cause
  premature expiration

* lifecycle worker: count delete markers toward NoncurrentIndex

Noncurrent delete markers should count toward the
NewerNoncurrentVersions retention threshold so data versions
get the correct position index. Previously, skipping delete
markers without incrementing the index could retain too many
versions after delete/recreate cycles.

* lifecycle worker: fix version ordering, error propagation, and fail-closed scope

1. Use full version ID comparison (CompareVersionIds) for sorting
   .versions entries, not just decoded timestamps. Two versions with
   the same timestamp prefix but different random suffixes were
   previously misordered, potentially treating the newest version as
   noncurrent and deleting it.

2. Propagate .versions listing failures to the caller instead of
   swallowing them with (nil, 0). Transient filer errors on a
   .versions directory now surface in the job result.

3. Narrow the fail-closed path to only malformed lifecycle XML
   (errMalformedLifecycleXML). Transient filer LookupEntry errors
   now fall back to TTL with a warning, matching the original intent
   of "fail closed on bad config, not on network blips."

* lifecycle worker: only skip .uploads at bucket root

* lifecycle worker: sort.Slice, mixed-format test, XML presence tracking

- Replace manual insertion sort with sort.Slice in sortVersionsByVersionId
- Add TestCompareVersionIdsMixedFormats covering old/new format ordering
- Distinguish "no lifecycle XML" (nil) from "XML present but no effective
  rules" (non-nil empty slice) so buckets with all-disabled rules don't
  incorrectly fall back to filer.conf TTL expiration

* lifecycle worker: guard nil Attributes, use TrimSuffix in test

- Guard entry.Attributes != nil before accessing GetFileSize() and
  Mtime in both listExpiredObjectsByRules and processVersionsDirectory
- Use strings.TrimPrefix/TrimSuffix in TestVersionsDirectoryNaming
  to match the production code pattern

* lifecycle worker: skip TTL scan when XML present, fix test assertions

- When lifecycle XML is present but has no effective rules, skip
  object scanning entirely instead of falling back to TTL path
- Test sort output against concrete expected names instead of
  re-using the same comparator as the sort itself

---------

Co-authored-by: Copilot <copilot@github.com>
2026-03-28 12:58:21 -07:00
Chris Lu
54dd4f091d s3lifecycle: add lifecycle rule evaluator package and extend XML types (#8807)
* s3api: extend lifecycle XML types with NoncurrentVersionExpiration, AbortIncompleteMultipartUpload

Add missing S3 lifecycle rule types to the XML data model:
- NoncurrentVersionExpiration with NoncurrentDays and NewerNoncurrentVersions
- NoncurrentVersionTransition with NoncurrentDays and StorageClass
- AbortIncompleteMultipartUpload with DaysAfterInitiation
- Filter.ObjectSizeGreaterThan and ObjectSizeLessThan
- And.ObjectSizeGreaterThan and ObjectSizeLessThan
- Filter.UnmarshalXML to properly parse Tag, And, and size filter elements

Each new type follows the existing set-field pattern for conditional
XML marshaling. No behavior changes - these types are not yet wired
into handlers or the lifecycle worker.

* s3lifecycle: add lifecycle rule evaluator package

New package weed/s3api/s3lifecycle/ provides a pure-function lifecycle
rule evaluation engine. The evaluator accepts flattened Rule structs and
ObjectInfo metadata, and returns the appropriate Action.

Components:
- evaluator.go: Evaluate() for per-object actions with S3 priority
  ordering (delete marker > noncurrent version > current expiration),
  ShouldExpireNoncurrentVersion() with NewerNoncurrentVersions support,
  EvaluateMPUAbort() for multipart upload rules
- filter.go: prefix, tag, and size-based filter matching
- tags.go: ExtractTags() extracts S3 tags from filer Extended metadata,
  HasTagRules() for scan-time optimization
- version_time.go: GetVersionTimestamp() extracts timestamps from
  SeaweedFS version IDs (both old and new format)

Comprehensive test coverage: 54 tests covering all action types,
filter combinations, edge cases, and version ID formats.

* s3api: add UnmarshalXML for Expiration, Transition, ExpireDeleteMarker

Add UnmarshalXML methods that set the internal 'set' flag during XML
parsing. Previously these flags were only set programmatically, causing
XML round-trip to drop elements. This ensures lifecycle configurations
stored as XML survive unmarshal/marshal cycles correctly.

Add comprehensive XML round-trip tests for all lifecycle rule types
including NoncurrentVersionExpiration, AbortIncompleteMultipartUpload,
Filter with Tag/And/size constraints, and a complete Terraform-style
lifecycle configuration.

* s3lifecycle: address review feedback

- Fix version_time.go overflow: guard timestampPart > MaxInt64 before
  the inversion subtraction to prevent uint64 wrap
- Make all expiry checks inclusive (!now.Before instead of now.After)
  so actions trigger at the exact scheduled instant
- Add NoncurrentIndex to ObjectInfo so Evaluate() can properly handle
  NewerNoncurrentVersions via ShouldExpireNoncurrentVersion()
- Add test for high-bit overflow version ID

* s3lifecycle: guard ShouldExpireNoncurrentVersion against zero SuccessorModTime

Add early return when obj.IsLatest or obj.SuccessorModTime.IsZero()
to prevent premature expiration of versions with uninitialized
successor timestamps (zero value would compute to epoch, always expired).

---------

Co-authored-by: Copilot <copilot@github.com>
2026-03-28 11:10:31 -07:00
Chris Lu
7d5cbfd547 s3: support s3:x-amz-server-side-encryption policy condition (#8806)
* s3: support s3:x-amz-server-side-encryption policy condition (#7680)

- Normalize x-amz-server-side-encryption header values to canonical form
  (aes256 → AES256, aws:kms mixed-case → aws:kms) so StringEquals
  conditions work regardless of client capitalisation
- Exempt UploadPart and UploadPartCopy from SSE Null conditions: these
  actions inherit SSE from the initial CreateMultipartUpload request and
  do not re-send the header, so Deny/Null("true") should not block them
- Add sse_condition_test.go covering StringEquals, Null, case-insensitive
  normalisation, and multipart continuation action exemption

* s3: address review comments on SSE condition support

- Replace "inherited" sentinel in injectSSEForMultipart with "AES256" so
  that StringEquals/Null conditions evaluate against a meaningful value;
  add TODO noting that KMS multipart uploads need the actual algorithm
  looked up from the upload state
- Rewrite TestSSECaseInsensitiveNormalization to drive normalisation
  through EvaluatePolicyForRequest with a real *http.Request so regressions
  in the production code path are caught; split into AES256 and aws:kms
  variants to cover both normalisation branches

* s3: plumb real inherited SSE from multipart upload state into policy eval

Instead of injecting a static "AES256" sentinel for UploadPart/UploadPartCopy,
look up the actual SSE algorithm from the stored CreateMultipartUpload entry
and pass it through the evaluation chain.

Changes:
- PolicyEvaluationArgs gains InheritedSSEAlgorithm string; set by the
  BucketPolicyEngine wrapper for multipart continuation actions
- injectSSEForMultipart(conditions, inheritedSSE) now accepts the real
  algorithm; empty string means no SSE → Null("true") fires correctly
- IsMultipartContinuationAction exported so the s3api wrapper can use it
- BucketPolicyEngine gets a MultipartSSELookup callback (set by S3ApiServer)
  that fetches the upload entry and reads SeaweedFSSSEKMSKeyID /
  SeaweedFSSSES3Encryption to determine the algorithm
- S3ApiServer.getMultipartSSEAlgorithm implements the lookup via getEntry
- Tests updated: three multipart cases (AES256, aws:kms, no-SSE-must-deny)
  plus UploadPartCopy coverage
2026-03-27 23:15:01 -07:00
Chris Lu
e3f052cd84 s3api: preserve lifecycle config responses for Terraform (#8805)
* s3api: preserve lifecycle configs for terraform

* s3api: bound lifecycle config request bodies

* s3api: make bucket config updates copy-on-write

* s3api: tighten string slice cloning
2026-03-27 22:50:02 -07:00
Chris Lu
0adb78bc6b s3api: make conditional mutations atomic and AWS-compatible (#8802)
* s3api: serialize conditional write finalization

* s3api: add conditional delete mutation checks

* s3api: enforce destination conditions for copy

* s3api: revalidate multipart completion under lock

* s3api: rollback failed put finalization hooks

* s3api: report delete-marker version deletions

* s3api: fix copy destination versioning edge cases

* s3api: make versioned multipart completion idempotent

* test/s3: cover conditional mutation regressions

* s3api: rollback failed copy version finalization

* s3api: resolve suspended delete conditions via latest entry

* s3api: remove copy test null-version injection

* s3api: reject out-of-order multipart completions

* s3api: preserve multipart replay version metadata

* s3api: surface copy destination existence errors

* s3api: simplify delete condition target resolution

* test/s3: make conditional delete assertions order independent

* test/s3: add distributed lock gateway integration

* s3api: fail closed multipart versioned completion

* s3api: harden copy metadata and overwrite paths

* s3api: create delete markers for suspended deletes

* s3api: allow duplicate multipart completion parts
2026-03-27 19:22:26 -07:00
Chris Lu
17028fbf59 fix: serialize SSE-KMS metadata when bucket default encryption applies KMS (#8780)
* fix: serialize SSE-KMS metadata when bucket default encryption applies KMS

When a bucket has default SSE-KMS encryption enabled and a file is uploaded
without explicit SSE headers, the encryption was applied correctly but the
SSE-KMS metadata (x-seaweedfs-sse-kms-key) was not serialized. This caused
downloads to fail with "empty SSE-KMS metadata" because the entry's Extended
map stored an empty byte slice.

The existing code already handled this for SSE-S3 bucket defaults
(SerializeSSES3Metadata) but was missing the equivalent call to
SerializeSSEKMSMetadata for the KMS path.

Fixes seaweedfs/seaweedfs#8776

* ci: add KMS integration tests to GitHub Actions

Add a kms-tests.yml workflow that runs on changes to KMS/SSE code with
two jobs:

1. KMS provider tests: starts OpenBao via Docker, runs Go integration
   tests in test/kms/ against a real KMS backend
2. S3 KMS e2e tests: starts OpenBao + weed mini built from source, runs
   test_s3_kms.sh which covers bucket-default SSE-KMS upload/download
   (the exact scenario from #8776)

Supporting changes:
- test/kms/Makefile: add CI targets (test-provider-ci, test-s3-kms-ci)
  that manage OpenBao via plain Docker and run weed from source
- test/kms/s3-config-openbao-template.json: S3 config template with
  OpenBao KMS provider for weed mini

* refactor: combine SSE-S3 and SSE-KMS metadata serialization into else-if

SSE-S3 and SSE-KMS bucket default encryption are mutually exclusive, so
use a single if/else-if block instead of two independent if blocks.

* Update .github/workflows/kms-tests.yml

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

* fix(ci): start weed mini from data dir to avoid Docker filer.toml

weed mini reads filer.toml from the current working directory first.
When running from test/kms/, it picked up the Docker-targeted filer.toml
which has dir="/data/filerdb" (a path that doesn't exist in CI), causing
a fatal crash at filer store initialization.

Fix by cd-ing to the data directory before starting weed mini.
Also improve log visibility on failure.

---------

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
2026-03-26 14:07:01 -07:00
Chris Lu
28fe92065a S3: reject part uploads after AbortMultipartUpload (#8768)
* S3: reject part uploads after AbortMultipartUpload

PutObjectPartHandler did not verify that the multipart upload session
still exists before accepting parts. After AbortMultipartUpload deleted
the upload directory, the ErrNotFound from getEntry was silently ignored
(treated as "may be non-SSE upload"), allowing parts to be stored as
orphaned files.

Now return ErrNoSuchUpload when the upload directory is not found,
matching AWS S3 behavior.

Fixes #8766

* S3: check upload existence unconditionally in PutObjectPartHandler

Move the getEntry call out of the SSE-type conditional so the upload
existence check runs for all part uploads, including SSE-C. Previously
the SSE-C path skipped the check entirely, allowing parts to be uploaded
after abort when SSE-C headers were present.

Also flattens the nested SSE branching by one level now that getEntry
is called once upfront.

* S3: address PR review feedback for PutObjectPartHandler

- Log at error level when getEntry fails with an unexpected error,
  since we return ErrInternalError to the client
- Distinguish base IV decode errors from length validation failures
  with separate, clearer error messages

---------

Co-authored-by: Copilot <copilot@github.com>
2026-03-24 18:11:51 -07:00
Chris Lu
0b3867dca3 filer: add structured error codes to CreateEntryResponse (#8767)
* filer: add FilerError enum and error_code field to CreateEntryResponse

Add a machine-readable error code alongside the existing string error
field. This follows the precedent set by PublishMessageResponse in the
MQ broker proto. The string field is kept for human readability and
backward compatibility.

Defined codes: OK, ENTRY_NAME_TOO_LONG, PARENT_IS_FILE,
EXISTING_IS_DIRECTORY, EXISTING_IS_FILE, ENTRY_ALREADY_EXISTS.

* filer: add sentinel errors and error code mapping in filer_pb

Define sentinel errors (ErrEntryNameTooLong, ErrParentIsFile, etc.) in
the filer_pb package so both the filer and consumers can reference them
without circular imports.

Add FilerErrorToSentinel() to map proto error codes to sentinels, and
update CreateEntryWithResponse() to check error_code first, falling back
to the string-based path for backward compatibility with old servers.

* filer: return wrapped sentinel errors and set proto error codes

Replace fmt.Errorf string errors in filer.CreateEntry, UpdateEntry, and
ensureParentDirectoryEntry with wrapped filer_pb sentinel errors (using
%w). This preserves errors.Is() traversal on the server side.

In the gRPC CreateEntry handler, map sentinel errors to the
corresponding FilerError proto codes using errors.Is(), setting both
resp.Error (string, for backward compat) and resp.ErrorCode (enum).

* S3: use errors.Is() with filer sentinels instead of string matching

Replace fragile string-based error matching in filerErrorToS3Error and
other S3 API consumers with errors.Is() checks against filer_pb sentinel
errors. This works because the updated CreateEntryWithResponse helper
reconstructs sentinel errors from the proto FilerError code.

Update iceberg stage_create and metadata_files to check resp.ErrorCode
instead of parsing resp.Error strings. Update SSE-S3 to use errors.Is()
for the already-exists check.

String matching is retained only for non-filer errors (gRPC transport
errors, checksum validation) that don't go through CreateEntryResponse.

* filer: remove backward-compat string fallbacks for error codes

Clients and servers are always deployed together, so there is no need
for backward-compatibility fallback paths that parse resp.Error strings
when resp.ErrorCode is unset. Simplify all consumers to rely solely on
the structured error code.

* iceberg: ensure unknown non-OK error codes are not silently ignored

When FilerErrorToSentinel returns nil for an unrecognized error code,
return an error including the code and message rather than falling
through to return nil.

* filer: fix redundant error message and restore error wrapping in helper

Use request path instead of resp.Error in the sentinel error format
string to avoid duplicating the sentinel message (e.g. "entry already
exists: entry already exists"). Restore %w wrapping with errors.New()
in the fallback paths so callers can use errors.Is()/errors.As().

* filer: promote file to directory on path conflict instead of erroring

S3 allows both "foo/bar" (object) and "foo/bar/xyzzy" (another object)
to coexist because S3 has a flat key space. When ensureParentDirectoryEntry
finds a parent path that is a file instead of a directory, promote it to
a directory by setting ModeDir while preserving the original content and
chunks. Use Store.UpdateEntry directly to bypass the Filer.UpdateEntry
type-change guard.

This fixes the S3 compatibility test failures where creating overlapping
keys (e.g. "foo/bar" then "foo/bar/xyzzy") returned ExistingObjectIsFile.
2026-03-24 17:08:22 -07:00
Chris Lu
152884eff2 S3: add s3: prefix to x-amz-* condition keys for AWS compatibility (#8765)
AWS S3 policy conditions reference request headers with the s3: namespace
prefix (e.g., s3:x-amz-server-side-encryption). The extraction code was
storing these headers without the prefix, so bucket policy conditions
using the standard AWS key names would never match.
2026-03-24 14:04:42 -07:00
Chris Lu
2877febd73 S3: fix silent PutObject failure and enforce 1024-byte key limit (#8764)
* S3: add KeyTooLongError error code

Add ErrKeyTooLongError (HTTP 400, code "KeyTooLongError") to match the
standard AWS S3 error for object keys that exceed length limits.

* S3: fix silent PutObject failure when entry name exceeds max_file_name_length

putToFiler called client.CreateEntry() directly and discarded the gRPC
response. The filer embeds application errors like "entry name too long"
in resp.Error (not as gRPC transport errors), so the error was silently
swallowed and clients received HTTP 200 with an ETag for objects that
were never stored.

Switch to the filer_pb.CreateEntry() helper which properly checks
resp.Error, and map "entry name too long" to KeyTooLongError (HTTP 400).

To avoid fragile string parsing across the gRPC boundary, define shared
error message constants in weed/util/constants and use them in both the
filer (producing errors) and S3 API (matching errors). Switch
filerErrorToS3Error to use strings.Contains/HasSuffix with these
constants so matches work regardless of any wrapper prefix. Apply
filerErrorToS3Error to the mkdir path for directory markers.

Fixes #8759

* S3: enforce 1024-byte maximum object key length

AWS S3 limits object keys to 1024 bytes. Add early validation on write
paths (PutObject, CopyObject, CreateMultipartUpload) to reject keys
exceeding the limit with the standard KeyTooLongError (HTTP 400).

The key length check runs before bucket auto-creation to prevent
overlong keys from triggering unnecessary side effects.

Also use filerErrorToS3Error for CopyObject's mkFile error paths so
name-too-long errors from the filer return KeyTooLongError instead of
InternalError.

Ref #8758

* S3: add handler-level tests for key length validation and error mapping

Add tests for filerErrorToS3Error mapping "entry name too long" to
KeyTooLongError, including a regression test for the CreateEntry-prefixed
"existing ... is a directory" form. Add handler-level integration tests
that exercise PutObjectHandler, CopyObjectHandler, and
NewMultipartUploadHandler via httptest, verifying HTTP 400 and
KeyTooLongError XML response for overlong keys and acceptance of keys at
the 1024-byte limit.
2026-03-24 13:35:28 -07:00
Chris Lu
e5f72077ee fix: resolve CORS cache race condition causing stale 404 responses (#8748)
The metadata subscription handler (updateBucketConfigCacheFromEntry) was
making a separate RPC call via loadCORSFromBucketContent to load CORS
configuration. This created a race window where a slow CreateBucket
subscription event could re-cache stale data after PutBucketCors had
already cleared the cache, causing subsequent GetBucketCors to return
404 NoSuchCORSConfiguration.

Parse CORS directly from the subscription entry's Content field instead
of making a separate RPC. Also fix getBucketConfig to parse CORS from
the already-fetched entry, eliminating a redundant RPC call.

Fix TestCORSCaching to use require.NoError to prevent nil pointer
dereference panics when GetBucketCors fails.
2026-03-23 19:33:20 -07:00
Chris Lu
d5ee35c8df Fix S3 delete for non-empty directory markers (#8740)
* Fix S3 delete for non-empty directory markers

* Address review feedback on directory marker deletes

* Stabilize FUSE concurrent directory operations
2026-03-23 13:35:16 -07:00
Chris Lu
6ccda3e809 fix(s3): allow deleting the anonymous user from admin webui (#8706)
Remove the block that prevented deleting the "anonymous" identity
and stop auto-creating it when absent.  If no anonymous identity
exists (or it is disabled), LookupAnonymous returns not-found and
both auth paths return ErrAccessDenied for anonymous requests.

To enable anonymous access, explicitly create the "anonymous" user.
To revoke it, delete the user like any other identity.

Closes #8694
2026-03-19 18:10:20 -07:00
Chris Lu
80f3079d2a fix(s3): include directory markers in ListObjects without delimiter (#8704)
* fix(s3): include directory markers in ListObjects without delimiter (#8698)

Directory key objects (zero-byte objects with keys ending in "/") created
via PutObject were omitted from ListObjects/ListObjectsV2 results when no
delimiter was specified. AWS S3 includes these as regular keys in Contents.

The issue was in doListFilerEntries: when recursing into directories in
non-delimiter mode, directory key objects were only emitted when
prefixEndsOnDelimiter was true. Added an else branch to emit them in the
general recursive case as well.

* remove issue reference from inline comment

* test: add child-under-marker and paginated listing coverage

Extend test 6 to place a child object under the directory marker
and paginate with MaxKeys=1 so the emit-then-recurse truncation
path is exercised.

* fix(test): skip directory markers in Spark temporary artifacts check

The listing check now correctly shows directory markers (keys ending
in "/") after the ListObjects fix. These 0-byte metadata objects are
not data artifacts — filter them from the listing check since the
HeadObject-based check already verifies their cleanup with a timeout.
2026-03-19 15:36:11 -07:00
Chris Lu
c197206897 fix(s3): return ETag header for directory marker PutObject requests (#8688)
* fix(s3): return ETag header for directory marker PutObject requests

The PutObject handler has a special path for keys ending with "/" (directory
markers) that creates the entry via mkdir. This path never computed or set
the ETag response header, unlike the regular PutObject path. AWS S3 always
returns an ETag header, even for empty-body puts.

Compute the MD5 of the content (empty or otherwise), store it in the entry
attributes and extended attributes, and set the ETag response header.

Fixes #8682

* fix: handle io.ReadAll error and chunked encoding for directory markers

Address review feedback:
- Handle error from io.ReadAll instead of silently discarding it
- Change condition from ContentLength > 0 to ContentLength != 0 to
  correctly handle chunked transfer encoding (ContentLength == -1)

* fix hanging tests
2026-03-18 17:26:33 -07:00
Chris Lu
9984ce7dcb fix(s3): omit NotResource:null from bucket policy JSON response (#8658)
* fix(s3): omit NotResource:null from bucket policy JSON response (#8657)

Change NotResource from value type to pointer (*StringOrStringSlice) so
that omitempty properly omits it when unset, matching the existing
Principal field pattern. This prevents IaC tools (Terraform, Ansible)
from detecting false configuration drift.

Add bucket policy round-trip idempotency integration tests.

* simplify JSON comparison in bucket policy idempotency test

Use require.JSONEq directly on the raw JSON strings instead of
round-tripping through unmarshal/marshal, since JSONEq already
handles normalization internally.

* fix bucket policy test cases that locked out the admin user

The Deny+NotResource test cases used Action:"s3:*" which denied the
admin's own GetBucketPolicy call. Scope deny to s3:GetObject only,
and add an Allow+NotResource variant instead.

* fix(s3): also make Resource a pointer to fix empty string in JSON

Apply the same omitempty pointer fix to the Resource field, which was
emitting "Resource":"" when only NotResource was set. Add
NewStringOrStringSlicePtr helper, make Strings() nil-safe, and handle
*StringOrStringSlice in normalizeToStringSliceWithError.

* improve bucket policy integration tests per review feedback

- Replace time.Sleep with waitForClusterReady using ListBuckets
- Use structural hasKey check instead of brittle substring NotContains
- Assert specific NoSuchBucketPolicy error code after delete
- Handle single-statement policies in hasKey helper
2026-03-16 12:58:26 -07:00
Chris Lu
acea36a181 filer: add conditional update preconditions (#8647)
* filer: add conditional update preconditions

* iceberg: tighten metadata CAS preconditions
2026-03-16 12:33:32 -07:00
Chris Lu
8cde3d4486 Add data file compaction to iceberg maintenance (Phase 2) (#8503)
* Add iceberg_maintenance plugin worker handler (Phase 1)

Implement automated Iceberg table maintenance as a new plugin worker job
type. The handler scans S3 table buckets for tables needing maintenance
and executes operations in the correct Iceberg order: expire snapshots,
remove orphan files, and rewrite manifests.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Add data file compaction to iceberg maintenance handler (Phase 2)

Implement bin-packing compaction for small Parquet data files:
- Enumerate data files from manifests, group by partition
- Merge small files using parquet-go (read rows, write merged output)
- Create new manifest with ADDED/DELETED/EXISTING entries
- Commit new snapshot with compaction metadata

Add 'compact' operation to maintenance order (runs before expire_snapshots),
configurable via target_file_size_bytes and min_input_files thresholds.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Fix memory exhaustion in mergeParquetFiles by processing files sequentially

Previously all source Parquet files were loaded into memory simultaneously,
risking OOM when a compaction bin contained many small files. Now each file
is loaded, its rows are streamed into the output writer, and its data is
released before the next file is loaded — keeping peak memory proportional
to one input file plus the output buffer.

* Validate bucket/namespace/table names against path traversal

Reject names containing '..', '/', or '\' in Execute to prevent
directory traversal via crafted job parameters.

* Add filer address failover in iceberg maintenance handler

Try each filer address from cluster context in order instead of only
using the first one. This improves resilience when the primary filer
is temporarily unreachable.

* Add separate MinManifestsToRewrite config for manifest rewrite threshold

The rewrite_manifests operation was reusing MinInputFiles (meant for
compaction bin file counts) as its manifest count threshold. Add a
dedicated MinManifestsToRewrite field with its own config UI section
and default value (5) so the two thresholds can be tuned independently.

* Fix risky mtime fallback in orphan removal that could delete new files

When entry.Attributes is nil, mtime defaulted to Unix epoch (1970),
which would always be older than the safety threshold, causing the
file to be treated as eligible for deletion. Skip entries with nil
Attributes instead, matching the safer logic in operations.go.

* Fix undefined function references in iceberg_maintenance_handler.go

Use the exported function names (ShouldSkipDetectionByInterval,
BuildDetectorActivity, BuildExecutorActivity) matching their
definitions in vacuum_handler.go.

* Remove duplicated iceberg maintenance handler in favor of iceberg/ subpackage

The IcebergMaintenanceHandler and its compaction code in the parent
pluginworker package duplicated the logic already present in the
iceberg/ subpackage (which self-registers via init()). The old code
lacked stale-plan guards, proper path normalization, CAS-based xattr
updates, and error-returning parseOperations.

Since the registry pattern (default "all") makes the old handler
unreachable, remove it entirely. All functionality is provided by
iceberg.Handler with the reviewed improvements.

* Fix MinManifestsToRewrite clamping to match UI minimum of 2

The clamp reset values below 2 to the default of 5, contradicting the
UI's advertised MinValue of 2. Clamp to 2 instead.

* Sort entries by size descending in splitOversizedBin for better packing

Entries were processed in insertion order which is non-deterministic
from map iteration. Sorting largest-first before the splitting loop
improves bin packing efficiency by filling bins more evenly.

* Add context cancellation check to drainReader loop

The row-streaming loop in drainReader did not check ctx between
iterations, making long compaction merges uncancellable. Check
ctx.Done() at the top of each iteration.

* Fix splitOversizedBin to always respect targetSize limit

The minFiles check in the split condition allowed bins to grow past
targetSize when they had fewer than minFiles entries, defeating the
OOM protection. Now bins always split at targetSize, and a trailing
runt with fewer than minFiles entries is merged into the previous bin.

* Add integration tests for iceberg table maintenance plugin worker

Tests start a real weed mini cluster, create S3 buckets and Iceberg
table metadata via filer gRPC, then exercise the iceberg.Handler
operations (ExpireSnapshots, RemoveOrphans, RewriteManifests) against
the live filer. A full maintenance cycle test runs all operations in
sequence and verifies metadata consistency.

Also adds exported method wrappers (testing_api.go) so the integration
test package can call the unexported handler methods.

* Fix splitOversizedBin dropping files and add source path to drainReader errors

The runt-merge step could leave leading bins with fewer than minFiles
entries (e.g. [80,80,10,10] with targetSize=100, minFiles=2 would drop
the first 80-byte file). Replace the filter-based approach with an
iterative merge that folds any sub-minFiles bin into its smallest
neighbor, preserving all eligible files.

Also add the source file path to drainReader error messages so callers
can identify which Parquet file caused a read/write failure.

* Harden integration test error handling

- s3put: fail immediately on HTTP 4xx/5xx instead of logging and
  continuing
- lookupEntry: distinguish NotFound (return nil) from unexpected RPC
  errors (fail the test)
- writeOrphan and orphan creation in FullMaintenanceCycle: check
  CreateEntryResponse.Error in addition to the RPC error

* go fmt

---------

Co-authored-by: Copilot <copilot@github.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-15 11:27:42 -07:00
Chris Lu
8ac4caf930 fix(s3api): return no-encryption instead of error when bucket metadata is missing
When getEncryptionConfiguration encounters a not-found error (e.g.,
during bucket recreation after a partial delete), return
ErrNoSuchBucketEncryptionConfiguration instead of ErrInternalError.
This prevents uploads from failing with 500 errors during recovery.
2026-03-11 13:49:21 -07:00
Chris Lu
ab85f46529 fix(s3api): clear negative cache in autoCreateBucket when bucket exists
When autoCreateBucket finds the bucket already exists, remove it from
the negative cache so subsequent requests don't unnecessarily trigger
another auto-create attempt.
2026-03-11 13:49:15 -07:00
Chris Lu
5208c7c727 fix(s3api): improve PutBucketHandler comment for orphaned collection recovery
Clarify the comment and log message for the case where a collection
exists but the bucket directory is missing, explaining the root cause
(partial deletion) more precisely.
2026-03-11 13:49:11 -07:00
Chris Lu
12b360f499 fix(s3api): delete bucket directory before collection to prevent inconsistent state
Reorder DeleteBucketHandler to remove the bucket directory first, then
delete the collection. If collection deletion fails, the bucket is still
effectively deleted and can be recreated. Previously, if directory
deletion succeeded but collection deletion failed, the bucket was left
in an unrecoverable state.
2026-03-11 13:49:03 -07:00
Chris Lu
d1a631123f fix(s3api): allow bucket recreation when orphaned collection exists (#8605)
* fix(s3api): allow bucket recreation when orphaned collection exists (#8601)

When a bucket is deleted, its filer directory is removed but the
underlying collection/volumes may not be fully cleaned up yet. If the
bucket is immediately recreated, PutBucketHandler was returning
ErrBucketAlreadyExists due to the orphaned collection, blocking bucket
recreation and causing subsequent uploads to fail with InternalError.

Allow bucket creation to proceed when a collection exists without a
corresponding bucket directory, since this is a transient orphaned state
from a previous deletion.

* fix(s3api): handle concurrent bucket creation race in mkdir

On mkdir failure, re-check whether the bucket directory now exists
and return BucketAlreadyExists instead of InternalError when another
request created the bucket concurrently.
2026-03-11 13:42:43 -07:00
Chris Lu
e1e4c9437a fix(s3api): ListObjects with trailing-slash prefix matches sibling directories (#8599)
fix(s3api): ListObjects with trailing-slash prefix returns wrong results

When ListObjectsV2 is called with a prefix ending in "/" (e.g., "foo/"),
normalizePrefixMarker strips the trailing slash and splits into
dir="parent" and prefix="foo". The filer then lists entries matching
prefix "foo", which returns both directory "foo" and "foo1000".

The prefixEndsOnDelimiter guard correctly identifies directory "foo" as
the target and recurses into it, but then resets the guard to false.
The loop continues and incorrectly recurses into "foo1000" as well,
causing the listing to return objects from unrelated directories.

Fix: after recursing into the exact directory targeted by the
trailing-slash prefix, return immediately from the listing loop.
There is no reason to process sibling entries since the original
prefix specifically targeted one directory.
2026-03-11 02:28:34 -07:00
Chris Lu
0a5c5ed4ce Persist S3 bucket counter metrics across idle periods (#8595)
* Stop deleting counter metrics during bucket TTL cleanup

Counter metrics (traffic bytes, request counts, object counts) are
monotonically increasing by design. Deleting them after 10 minutes of
bucket inactivity causes them to vanish from /metrics output and reset
to zero when traffic resumes, breaking Prometheus rate()/increase()
queries and making historical traffic reporting impossible.

Only delete gauges and histograms in the TTL cleanup loop, as these
represent current state and are safely re-populated on next activity.

Fixes https://github.com/seaweedfs/seaweedfs/issues/8521

* Clean up all bucket metrics on bucket deletion

Add DeleteBucketMetrics() to delete all metrics (including counters)
for a bucket when it is explicitly deleted. This prevents unbounded
label cardinality from accumulating for buckets that no longer exist.

Called from DeleteBucketHandler after successful bucket deletion.

* Reduce mutex scope in bucket metrics TTL sweep

Collect expired bucket names under the lock, then release before
calling DeletePartialMatch on Prometheus metrics. This prevents
RecordBucketActiveTime from blocking during the expensive cleanup.
2026-03-10 19:00:40 -07:00
Chris Lu
f8d783f80e fix: ListObjectVersions interleave Version and DeleteMarker in sort order (#8567)
* fix: ListObjectVersions interleave Version and DeleteMarker in sort order

Go's default xml.Marshal serializes struct fields in definition order,
causing all <Version> elements to appear before all <DeleteMarker>
elements. The S3 API contract requires these elements to be interleaved
in the correct global sort order (by key ascending, then newest version
first within each key).

This broke clients that validate version list ordering within a single
key — an older Version would appear before a newer DeleteMarker for the
same object.

Fix: Replace the separate Versions/DeleteMarkers/CommonPrefixes arrays
with a single Entries []VersionListEntry slice. Each VersionListEntry
uses a per-element MarshalXML that outputs the correct XML tag name
(<Version>, <DeleteMarker>, or <CommonPrefixes>) based on which field
is populated. Since the entries are already in their correct sorted
order from buildSortedCombinedList, the XML output is automatically
interleaved correctly.

Also removes the unused ListObjectVersionsResult struct.

Note: The reporter also mentioned a cross-key timestamp ordering issue
when paginating with max-keys=1, but that is correct S3 behavior —
ListObjectVersions sorts by key name (ascending), not by timestamp.
Different keys having non-monotonic timestamps is expected.

* test: add CommonPrefixes XML marshaling coverage for ListObjectVersions

* fix: validate VersionListEntry has exactly one field set in MarshalXML

Return an error instead of silently emitting an empty <Version> element
when no field (or multiple fields) are populated. Also clean up the
misleading xml:"Version" struct tag on the Entries field.
2026-03-09 12:37:59 -07:00
Chris Lu
992db11d2b iam: add IAM group management (#8560)
* iam: add Group message to protobuf schema

Add Group message (name, members, policy_names, disabled) and
add groups field to S3ApiConfiguration for IAM group management
support (issue #7742).

* iam: add group CRUD to CredentialStore interface and all backends

Add group management methods (CreateGroup, GetGroup, DeleteGroup,
ListGroups, UpdateGroup) to the CredentialStore interface with
implementations for memory, filer_etc, postgres, and grpc stores.
Wire group loading/saving into filer_etc LoadConfiguration and
SaveConfiguration.

* iam: add group IAM response types

Add XML response types for group management IAM actions:
CreateGroup, DeleteGroup, GetGroup, ListGroups, AddUserToGroup,
RemoveUserFromGroup, AttachGroupPolicy, DetachGroupPolicy,
ListAttachedGroupPolicies, ListGroupsForUser.

* iam: add group management handlers to embedded IAM API

Add CreateGroup, DeleteGroup, GetGroup, ListGroups, AddUserToGroup,
RemoveUserFromGroup, AttachGroupPolicy, DetachGroupPolicy,
ListAttachedGroupPolicies, and ListGroupsForUser handlers with
dispatch in ExecuteAction.

* iam: add group management handlers to standalone IAM API

Add group handlers (CreateGroup, DeleteGroup, GetGroup, ListGroups,
AddUserToGroup, RemoveUserFromGroup, AttachGroupPolicy, DetachGroupPolicy,
ListAttachedGroupPolicies, ListGroupsForUser) and wire into DoActions
dispatch. Also add helper functions for user/policy side effects.

* iam: integrate group policies into authorization

Add groups and userGroups reverse index to IdentityAccessManagement.
Populate both maps during ReplaceS3ApiConfiguration and
MergeS3ApiConfiguration. Modify evaluateIAMPolicies to evaluate
policies from user's enabled groups in addition to user policies.
Update VerifyActionPermission to consider group policies when
checking hasAttachedPolicies.

* iam: add group side effects on user deletion and rename

When a user is deleted, remove them from all groups they belong to.
When a user is renamed, update group membership references. Applied
to both embedded and standalone IAM handlers.

* iam: watch /etc/iam/groups directory for config changes

Add groups directory to the filer subscription watcher so group
file changes trigger IAM configuration reloads.

* admin: add group management page to admin UI

Add groups page with CRUD operations, member management, policy
attachment, and enable/disable toggle. Register routes in admin
handlers and add Groups entry to sidebar navigation.

* test: add IAM group management integration tests

Add comprehensive integration tests for group CRUD, membership,
policy attachment, policy enforcement, disabled group behavior,
user deletion side effects, and multi-group membership. Add
"group" test type to CI matrix in s3-iam-tests workflow.

* iam: address PR review comments for group management

- Fix XSS vulnerability in groups.templ: replace innerHTML string
  concatenation with DOM APIs (createElement/textContent) for rendering
  member and policy lists
- Use userGroups reverse index in embedded IAM ListGroupsForUser for
  O(1) lookup instead of iterating all groups
- Add buildUserGroupsIndex helper in standalone IAM handlers; use it
  in ListGroupsForUser and removeUserFromAllGroups for efficient lookup
- Add note about gRPC store load-modify-save race condition limitation

* iam: add defensive copies, validation, and XSS fixes for group management

- Memory store: clone groups on store/retrieve to prevent mutation
- Admin dash: deep copy groups before mutation, validate user/policy exists
- HTTP handlers: translate credential errors to proper HTTP status codes,
  use *bool for Enabled field to distinguish missing vs false
- Groups templ: use data attributes + event delegation instead of inline
  onclick for XSS safety, prevent stale async responses

* iam: add explicit group methods to PropagatingCredentialStore

Add CreateGroup, GetGroup, DeleteGroup, ListGroups, and UpdateGroup
methods instead of relying on embedded interface fallthrough. Group
changes propagate via filer subscription so no RPC propagation needed.

* iam: detect postgres unique constraint violation and add groups index

Return ErrGroupAlreadyExists when INSERT hits SQLState 23505 instead of
a generic error. Add index on groups(disabled) for filtered queries.

* iam: add Marker field to group list response types

Add Marker string field to GetGroupResult, ListGroupsResult,
ListAttachedGroupPoliciesResult, and ListGroupsForUserResult to
match AWS IAM pagination response format.

* iam: check group attachment before policy deletion

Reject DeletePolicy if the policy is attached to any group, matching
AWS IAM behavior. Add PolicyArn to ListAttachedGroupPolicies response.

* iam: include group policies in IAM authorization

Merge policy names from user's enabled groups into the IAMIdentity
used for authorization, so group-attached policies are evaluated
alongside user-attached policies.

* iam: check for name collision before renaming user in UpdateUser

Scan identities and inline policies for newUserName before mutating,
returning EntityAlreadyExists if a collision is found. Reuse the
already-loaded policies instead of loading them again inside the loop.

* test: use t.Cleanup for bucket cleanup in group policy test

* iam: wrap ErrUserNotInGroup sentinel in RemoveGroupMember error

Wrap credential.ErrUserNotInGroup so errors.Is works in
groupErrorToHTTPStatus, returning proper 400 instead of 500.

* admin: regenerate groups_templ.go with XSS-safe data attributes

Regenerated from groups.templ which uses data-group-name attributes
instead of inline onclick with string interpolation.

* iam: add input validation and persist groups during migration

- Validate nil/empty group name in CreateGroup and UpdateGroup
- Save groups in migrateToMultiFile so they survive legacy migration

* admin: use groupErrorToHTTPStatus in GetGroupMembers and GetGroupPolicies

* iam: short-circuit UpdateUser when newUserName equals current name

* iam: require empty PolicyNames before group deletion

Reject DeleteGroup when group has attached policies, matching the
existing members check. Also fix GetGroup error handling in
DeletePolicy to only skip ErrGroupNotFound, not all errors.

* ci: add weed/pb/** to S3 IAM test trigger paths

* test: replace time.Sleep with require.Eventually for propagation waits

Use polling with timeout instead of fixed sleeps to reduce flakiness
in integration tests waiting for IAM policy propagation.

* fix: use credentialManager.GetPolicy for AttachGroupPolicy validation

Policies created via CreatePolicy through credentialManager are stored
in the credential store, not in s3cfg.Policies (which only has static
config policies). Change AttachGroupPolicy to use credentialManager.GetPolicy()
for policy existence validation.

* feat: add UpdateGroup handler to embedded IAM API

Add UpdateGroup action to enable/disable groups and rename groups
via the IAM API. This is a SeaweedFS extension (not in AWS SDK) used
by tests to toggle group disabled status.

* fix: authenticate raw IAM API calls in group tests

The embedded IAM endpoint rejects anonymous requests. Replace
callIAMAPI with callIAMAPIAuthenticated that uses JWT bearer token
authentication via the test framework.

* feat: add UpdateGroup handler to standalone IAM API

Mirror the embedded IAM UpdateGroup handler in the standalone IAM API
for parity.

* fix: add omitempty to Marker XML tags in group responses

Non-truncated responses should not emit an empty <Marker/> element.

* fix: distinguish backend errors from missing policies in AttachGroupPolicy

Return ServiceFailure for credential manager errors instead of masking
them as NoSuchEntity. Also switch ListGroupsForUser to use s3cfg.Groups
instead of in-memory reverse index to avoid stale data. Add duplicate
name check to UpdateGroup rename.

* fix: standalone IAM AttachGroupPolicy uses persisted policy store

Check managed policies from GetPolicies() instead of s3cfg.Policies
so dynamically created policies are found. Also add duplicate name
check to UpdateGroup rename.

* fix: rollback inline policies on UpdateUser PutPolicies failure

If PutPolicies fails after moving inline policies to the new username,
restore both the identity name and the inline policies map to their
original state to avoid a partial-write window.

* fix: correct test cleanup ordering for group tests

Replace scattered defers with single ordered t.Cleanup in each test
to ensure resources are torn down in reverse-creation order:
remove membership, detach policies, delete access keys, delete users,
delete groups, delete policies. Move bucket cleanup to parent test
scope and delete objects before bucket.

* fix: move identity nil check before map lookup and refine hasAttachedPolicies

Move the nil check on identity before accessing identity.Name to
prevent panic. Also refine hasAttachedPolicies to only consider groups
that are enabled and have actual policies attached, so membership in
a no-policy group doesn't incorrectly trigger IAM authorization.

* fix: fail group reload on unreadable or corrupt group files

Return errors instead of logging and continuing when group files
cannot be read or unmarshaled. This prevents silently applying a
partial IAM config with missing group memberships or policies.

* fix: use errors.Is for sql.ErrNoRows comparison in postgres group store

* docs: explain why group methods skip propagateChange

Group changes propagate to S3 servers via filer subscription
(watching /etc/iam/groups/) rather than gRPC RPCs, since there
are no group-specific RPCs in the S3 cache protocol.

* fix: remove unused policyNameFromArn and strings import

* fix: update service account ParentUser on user rename

When renaming a user via UpdateUser, also update ParentUser references
in service accounts to prevent them from becoming orphaned after the
next configuration reload.

* fix: wrap DetachGroupPolicy error with ErrPolicyNotAttached sentinel

Use credential.ErrPolicyNotAttached so groupErrorToHTTPStatus maps
it to 400 instead of falling back to 500.

* fix: use admin S3 client for bucket cleanup in enforcement test

The user S3 client may lack permissions by cleanup time since the
user is removed from the group in an earlier subtest. Use the admin
S3 client to ensure bucket and object cleanup always succeeds.

* fix: add nil guard for group param in propagating store log calls

Prevent potential nil dereference when logging group.Name in
CreateGroup and UpdateGroup of PropagatingCredentialStore.

* fix: validate Disabled field in UpdateGroup handlers

Reject values other than "true" or "false" with InvalidInputException
instead of silently treating them as false.

* fix: seed mergedGroups from existing groups in MergeS3ApiConfiguration

Previously the merge started with empty group maps, dropping any
static-file groups. Now seeds from existing iam.groups before
overlaying dynamic config, and builds the reverse index after
merging to avoid stale entries from overridden groups.

* fix: use errors.Is for filer_pb.ErrNotFound comparison in group loading

Replace direct equality (==) with errors.Is() to correctly match
wrapped errors, consistent with the rest of the codebase.

* fix: add ErrUserNotFound and ErrPolicyNotFound to groupErrorToHTTPStatus

Map these sentinel errors to 404 so AddGroupMember and
AttachGroupPolicy return proper HTTP status codes.

* fix: log cleanup errors in group integration tests

Replace fire-and-forget cleanup calls with error-checked versions
that log failures via t.Logf for debugging visibility.

* fix: prevent duplicate group test runs in CI matrix

The basic lane's -run "TestIAM" regex also matched TestIAMGroup*
tests, causing them to run in both the basic and group lanes.
Replace with explicit test function names.

* fix: add GIN index on groups.members JSONB for membership lookups

Without this index, ListGroupsForUser and membership queries
require full table scans on the groups table.

* fix: handle cross-directory moves in IAM config subscription

When a file is moved out of an IAM directory (e.g., /etc/iam/groups),
the dir variable was overwritten with NewParentPath, causing the
source directory change to be missed. Now also notifies handlers
about the source directory for cross-directory moves.

* fix: validate members/policies before deleting group in admin handler

AdminServer.DeleteGroup now checks for attached members and policies
before delegating to credentialManager, matching the IAM handler guards.

* fix: merge groups by name instead of blind append during filer load

Match the identity loader's merge behavior: find existing group
by name and replace, only append when no match exists. Prevents
duplicates when legacy and multi-file configs overlap.

* fix: check DeleteEntry response error when cleaning obsolete group files

Capture and log resp.Error from filer DeleteEntry calls during
group file cleanup, matching the pattern used in deleteGroupFile.

* fix: verify source user exists before no-op check in UpdateUser

Reorder UpdateUser to find the source identity first and return
NoSuchEntityException if not found, before checking if the rename
is a no-op. Previously a non-existent user renamed to itself
would incorrectly return success.

* fix: update service account parent refs on user rename in embedded IAM

The embedded IAM UpdateUser handler updated group membership but
not service account ParentUser fields, unlike the standalone handler.

* fix: replay source-side events for all handlers on cross-dir moves

Pass nil newEntry to bucket, IAM, and circuit-breaker handlers for
the source directory during cross-directory moves, so all watchers
can clear caches for the moved-away resource.

* fix: don't seed mergedGroups from existing iam.groups in merge

Groups are always dynamic (from filer), never static (from s3.config).
Seeding from iam.groups caused stale deleted groups to persist.
Now only uses config.Groups from the dynamic filer config.

* fix: add deferred user cleanup in TestIAMGroupUserDeletionSideEffect

Register t.Cleanup for the created user so it gets cleaned up
even if the test fails before the inline DeleteUser call.

* fix: assert UpdateGroup HTTP status in disabled group tests

Add require.Equal checks for 200 status after UpdateGroup calls
so the test fails immediately on API errors rather than relying
on the subsequent Eventually timeout.

* fix: trim whitespace from group name in filer store operations

Trim leading/trailing whitespace from group.Name before validation
in CreateGroup and UpdateGroup to prevent whitespace-only filenames.
Also merge groups by name during multi-file load to prevent duplicates.

* fix: add nil/empty group validation in gRPC store

Guard CreateGroup and UpdateGroup against nil group or empty name
to prevent panics and invalid persistence.

* fix: add nil/empty group validation in postgres store

Guard CreateGroup and UpdateGroup against nil group or empty name
to prevent panics from nil member access and empty-name row inserts.

* fix: add name collision check in embedded IAM UpdateUser

The embedded IAM handler renamed users without checking if the
target name already existed, unlike the standalone handler.

* fix: add ErrGroupNotEmpty sentinel and map to HTTP 409

AdminServer.DeleteGroup now wraps conflict errors with
ErrGroupNotEmpty, and groupErrorToHTTPStatus maps it to
409 Conflict instead of 500.

* fix: use appropriate error message in GetGroupDetails based on status

Return "Group not found" only for 404, use "Failed to retrieve group"
for other error statuses instead of always saying "Group not found".

* fix: use backend-normalized group.Name in CreateGroup response

After credentialManager.CreateGroup may normalize the name (e.g.,
trim whitespace), use group.Name instead of the raw input for
the returned GroupData to ensure consistency.

* fix: add nil/empty group validation in memory store

Guard CreateGroup and UpdateGroup against nil group or empty name
to prevent panics from nil pointer dereference on map access.

* fix: reorder embedded IAM UpdateUser to verify source first

Find the source identity before checking for collisions, matching
the standalone handler's logic. Previously a non-existent user
renamed to an existing name would get EntityAlreadyExists instead
of NoSuchEntity.

* fix: handle same-directory renames in metadata subscription

Replay a delete event for the old entry name during same-directory
renames so handlers like onBucketMetadataChange can clean up stale
state for the old name.

* fix: abort GetGroups on non-ErrGroupNotFound errors

Only skip groups that return ErrGroupNotFound. Other errors (e.g.,
transient backend failures) now abort the handler and return the
error to the caller instead of silently producing partial results.

* fix: add aria-label and title to icon-only group action buttons

Add accessible labels to View and Delete buttons so screen readers
and tooltips provide meaningful context.

* fix: validate group name in saveGroup to prevent invalid filenames

Trim whitespace and reject empty names before writing group JSON
files, preventing creation of files like ".json".

* fix: add /etc/iam/groups to filer subscription watched directories

The groups directory was missing from the watched directories list,
so S3 servers in a cluster would not detect group changes made by
other servers via filer. The onIamConfigChange handler already had
code to handle group directory changes but it was never triggered.

* add direct gRPC propagation for group changes to S3 servers

Groups now have the same dual propagation as identities and policies:
direct gRPC push via propagateChange + async filer subscription.

- Add PutGroup/RemoveGroup proto messages and RPCs
- Add PutGroup/RemoveGroup in-memory cache methods on IAM
- Add PutGroup/RemoveGroup gRPC server handlers
- Update PropagatingCredentialStore to call propagateChange on group mutations

* reduce log verbosity for config load summary

Change ReplaceS3ApiConfiguration log from Infof to V(1).Infof
to avoid noisy output on every config reload.

* admin: show user groups in view and edit user modals

- Add Groups field to UserDetails and populate from credential manager
- Show groups as badges in user details view modal
- Add group management to edit user modal: display current groups,
  add to group via dropdown, remove from group via badge x button

* fix: remove duplicate showAlert that broke modal-alerts.js

admin.js defined showAlert(type, message) which overwrote the
modal-alerts.js version showAlert(message, type), causing broken
unstyled alert boxes. Remove the duplicate and swap all callers
in admin.js to use the correct (message, type) argument order.

* fix: unwrap groups API response in edit user modal

The /api/groups endpoint returns {"groups": [...]}, not a bare array.

* Update object_store_users_templ.go

* test: assert AccessDenied error code in group denial tests

Replace plain assert.Error checks with awserr.Error type assertion
and AccessDenied code verification, matching the pattern used in
other IAM integration tests.

* fix: propagate GetGroups errors in ShowGroups handler

getGroupsPageData was swallowing errors and returning an empty page
with 200 status. Now returns the error so ShowGroups can respond
with a proper error status.

* fix: reject AttachGroupPolicy when credential manager is nil

Previously skipped policy existence validation when credentialManager
was nil, allowing attachment of nonexistent policies. Now returns
a ServiceFailureException error.

* fix: preserve groups during partial MergeS3ApiConfiguration updates

UpsertIdentity calls MergeS3ApiConfiguration with a partial config
containing only the updated identity (nil Groups). This was wiping
all in-memory group state. Now only replaces groups when
config.Groups is non-nil (full config reload).

* fix: propagate errors from group lookup in GetObjectStoreUserDetails

ListGroups and GetGroup errors were silently ignored, potentially
showing incomplete group data in the UI.

* fix: use DOM APIs for group badge remove button to prevent XSS

Replace innerHTML with onclick string interpolation with DOM
createElement + addEventListener pattern. Also add aria-label
and title to the add-to-group button.

* fix: snapshot group policies under RLock to prevent concurrent map access

evaluateIAMPolicies was copying the map reference via groupMap :=
iam.groups under RLock then iterating after RUnlock, while PutGroup
mutates the map in-place. Now copies the needed policy names into
a slice while holding the lock.

* fix: add nil IAM check to PutGroup and RemoveGroup gRPC handlers

Match the nil guard pattern used by PutPolicy/DeletePolicy to
prevent nil pointer dereference when IAM is not initialized.
2026-03-09 11:54:32 -07:00
Chris Lu
d89eb8267f s3: use url.PathUnescape for X-Amz-Copy-Source header (#8545)
* s3: use url.PathUnescape for X-Amz-Copy-Source header (#8544)

The X-Amz-Copy-Source header is a URL-encoded path, not a query string.
Using url.QueryUnescape incorrectly converts literal '+' characters to
spaces, which can cause object key mismatches during copy operations.

Switch to url.PathUnescape in CopyObjectHandler, CopyObjectPartHandler,
and pathToBucketObjectAndVersion to correctly handle special characters
like '!', '+', and other RFC 3986 sub-delimiters that S3 clients may
percent-encode (e.g. '!' as %21).

* s3: add path validation to CopyObjectPartHandler

CopyObjectPartHandler was missing the validateTableBucketObjectPath
checks that CopyObjectHandler has, allowing potential path traversal
in the source bucket/object of copy part requests.

* s3: fix case-sensitive HeadersRegexp for copy source routing

The HeadersRegexp for X-Amz-Copy-Source used `%2F` which only matched
uppercase hex encoding. RFC 3986 allows both `%2F` and `%2f`, so
clients sending lowercase percent-encoding would bypass the copy
handler and hit PutObjectHandler instead. Add (?i) flag for
case-insensitive matching.

Also add test coverage for the versionId branch in
pathToBucketObjectAndVersion and for lowercase %2f routing.
2026-03-07 11:10:02 -08:00
Chris Lu
3f946fc0c0 mount: make metadata cache rebuilds snapshot-consistent (#8531)
* filer: expose metadata events and list snapshots

* mount: invalidate hot directory caches

* mount: read hot directories directly from filer

* mount: add sequenced metadata cache applier

* mount: apply metadata responses through cache applier

* mount: replay snapshot-consistent directory builds

* mount: dedupe self metadata events

* mount: factor directory build cleanup

* mount: replace proto marshal dedup with composite key and ring buffer

The dedup logic was doing a full deterministic proto.Marshal on every
metadata event just to produce a dedup key. Replace with a cheap
composite string key (TsNs|Directory|OldName|NewName).

Also replace the sliding-window slice (which leaked the backing array
unboundedly) with a fixed-size ring buffer that reuses the same array.

* filer: remove mutex and proto.Clone from request-scoped MetadataEventSink

MetadataEventSink is created per-request and only accessed by the
goroutine handling the gRPC call. The mutex and double proto.Clone
(once in Record, once in Last) were unnecessary overhead on every
filer write operation. Store the pointer directly instead.

* mount: skip proto.Clone for caller-owned metadata events

Add ApplyMetadataResponseOwned that takes ownership of the response
without cloning. Local metadata events (mkdir, create, flush, etc.)
are freshly constructed and never shared, so the clone is unnecessary.

* filer: only populate MetadataEvent on successful DeleteEntry

Avoid calling eventSink.Last() on error paths where the sink may
contain a partial event from an intermediate child deletion during
recursive deletes.

* mount: avoid map allocation in collectDirectoryNotifications

Replace the map with a fixed-size array and linear dedup. There are
at most 3 directories to notify (old parent, new parent, new child
if directory), so a 3-element array avoids the heap allocation on
every metadata event.

* mount: fix potential deadlock in enqueueApplyRequest

Release applyStateMu before the blocking channel send. Previously,
if the channel was full (cap 128), the send would block while holding
the mutex, preventing Shutdown from acquiring it to set applyClosed.

* mount: restore signature-based self-event filtering as fast path

Re-add the signature check that was removed when content-based dedup
was introduced. Checking signatures is O(1) on a small slice and
avoids enqueuing and processing events that originated from this
mount instance. The content-based dedup remains as a fallback.

* filer: send snapshotTsNs only in first ListEntries response

The snapshot timestamp is identical for every entry in a single
ListEntries stream. Sending it in every response message wastes
wire bandwidth for large directories. The client already reads
it only from the first response.

* mount: exit read-through mode after successful full directory listing

MarkDirectoryRefreshed was defined but never called, so directories
that entered read-through mode (hot invalidation threshold) stayed
there permanently, hitting the filer on every readdir even when cold.
Call it after a complete read-through listing finishes.

* mount: include event shape and full paths in dedup key

The previous dedup key only used Names, which could collapse distinct
rename targets. Include the event shape (C/D/U/R), source directory,
new parent path, and both entry names so structurally different events
are never treated as duplicates.

* mount: drain pending requests on shutdown in runApplyLoop

After receiving the shutdown sentinel, drain any remaining requests
from applyCh non-blockingly and signal each with errMetaCacheClosed
so callers waiting on req.done are released.

* mount: include IsDirectory in synthetic delete events

metadataDeleteEvent now accepts an isDirectory parameter so the
applier can distinguish directory deletes from file deletes. Rmdir
passes true, Unlink passes false.

* mount: fall back to synthetic event when MetadataEvent is nil

In mknod and mkdir, if the filer response omits MetadataEvent (e.g.
older filer without the field), synthesize an equivalent local
metadata event so the cache is always updated.

* mount: make Flush metadata apply best-effort after successful commit

After filer_pb.CreateEntryWithResponse succeeds, the entry is
persisted. Don't fail the Flush syscall if the local metadata cache
apply fails — log and invalidate the directory cache instead.
Also fall back to a synthetic event when MetadataEvent is nil.

* mount: make Rename metadata apply best-effort

The rename has already succeeded on the filer by the time we apply
the local metadata event. Log failures instead of returning errors
that would be dropped by the caller anyway.

* mount: make saveEntry metadata apply best-effort with fallback

After UpdateEntryWithResponse succeeds, treat local metadata apply
as non-fatal. Log and invalidate the directory cache on failure.
Also fall back to a synthetic event when MetadataEvent is nil.

* filer_pb: preserve snapshotTsNs on error in ReadDirAllEntriesWithSnapshot

Return the snapshot timestamp even when the first page fails, so
callers receive the snapshot boundary when partial data was received.

* filer: send snapshot token for empty directory listings

When no entries are streamed, send a final ListEntriesResponse with
only SnapshotTsNs so clients always receive the snapshot boundary.

* mount: distinguish not-found vs transient errors in lookupEntry

Return fuse.EIO for non-not-found filer errors instead of
unconditionally returning ENOENT, so transient failures don't
masquerade as missing entries.

* mount: make CacheRemoteObject metadata apply best-effort

The file content has already been cached successfully. Don't fail
the read if the local metadata cache update fails.

* mount: use consistent snapshot for readdir in direct mode

Capture the SnapshotTsNs from the first loadDirectoryEntriesDirect
call and store it on the DirectoryHandle. Subsequent batch loads
pass this stored timestamp so all batches use the same snapshot.

Also export DoSeaweedListWithSnapshot so mount can use it directly
with snapshot passthrough.

* filer_pb: fix test fake to send SnapshotTsNs only on first response

Match the server behavior: only the first ListEntriesResponse in a
page carries the snapshot timestamp, subsequent entries leave it zero.

* Fix nil pointer dereference in ListEntries stream consumers

Remove the empty-directory snapshot-only response from ListEntries
that sent a ListEntriesResponse with Entry==nil, which crashed every
raw stream consumer that assumed resp.Entry is always non-nil.

Also add defensive nil checks for resp.Entry in all raw ListEntries
stream consumers across: S3 listing, broker topic lookup, broker
topic config, admin dashboard, topic retention, hybrid message
scanner, Kafka integration, and consumer offset storage.

* Add nil guards for resp.Entry in remaining ListEntries stream consumers

Covers: S3 object lock check, MQ management dashboard (version/
partition/offset loops), and topic retention version loop.

* Make applyLocalMetadataEvent best-effort in Link and Symlink

The filer operations already succeeded; failing the syscall because
the local cache apply failed is wrong. Log a warning and invalidate
the parent directory cache instead.

* Make applyLocalMetadataEvent best-effort in Mkdir/Rmdir/Mknod/Unlink

The filer RPC already committed; don't fail the syscall when the
local metadata cache apply fails. Log a warning and invalidate the
parent directory cache to force a re-fetch on next access.

* flushFileMetadata: add nil-fallback for metadata event and best-effort apply

Synthesize a metadata event when resp.GetMetadataEvent() is nil
(matching doFlush), and make the apply best-effort with cache
invalidation on failure.

* Prevent double-invocation of cleanupBuild in doEnsureVisited

Add a cleanupDone guard so the deferred cleanup and inline error-path
cleanup don't both call DeleteFolderChildren/AbortDirectoryBuild.

* Fix comment: signature check is O(n) not O(1)

* Prevent deferred cleanup after successful CompleteDirectoryBuild

Set cleanupDone before returning from the success path so the
deferred context-cancellation check cannot undo a published build.

* Invalidate parent directory caches on rename metadata apply failure

When applyLocalMetadataEvent fails during rename, invalidate the
source and destination parent directory caches so subsequent accesses
trigger a re-fetch from the filer.

* Add event nil-fallback and cache invalidation to Link and Symlink

Synthesize metadata events when the server doesn't return one, and
invalidate parent directory caches on apply failure.

* Match requested partition when scanning partition directories

Parse the partition range format (NNNN-NNNN) and match against the
requested partition parameter instead of using the first directory.

* Preserve snapshot timestamp across empty directory listings

Initialize actualSnapshotTsNs from the caller-requested value so it
isn't lost when the server returns no entries. Re-add the server-side
snapshot-only response for empty directories (all raw stream consumers
now have nil guards for Entry).

* Fix CreateEntry error wrapping to support errors.Is/errors.As

Use errors.New + %w instead of %v for resp.Error so callers can
unwrap the underlying error.

* Fix object lock pagination: only advance on non-nil entries

Move entriesReceived inside the nil check so nil entries don't
cause repeated ListEntries calls with the same lastFileName.

* Guard Attributes nil check before accessing Mtime in MQ management

* Do not send nil-Entry response for empty directory listings

The snapshot-only ListEntriesResponse (with Entry == nil) for empty
directories breaks consumers that treat any received response as an
entry (Java FilerClient, S3 listing). The Go client-side
DoSeaweedListWithSnapshot already preserves the caller-requested
snapshot via actualSnapshotTsNs initialization, so the server-side
send is unnecessary.

* Fix review findings: subscriber dedup, invalidation normalization, nil guards, shutdown race

- Remove self-signature early-return in processEventFn so all events
  flow through the applier (directory-build buffering sees self-originated
  events that arrive after a snapshot)
- Normalize NewParentPath in collectEntryInvalidations to avoid duplicate
  invalidations when NewParentPath is empty (same-directory update)
- Guard resp.Entry.Attributes for nil in admin_server.go and
  topic_retention.go to prevent panics on entries without attributes
- Fix enqueueApplyRequest race with shutdown by using select on both
  applyCh and applyDone, preventing sends after the apply loop exits
- Add cleanupDone check to deferred cleanup in meta_cache_init.go for
  clarity alongside the existing guard in cleanupBuild
- Add empty directory test case for snapshot consistency

* Propagate authoritative metadata event from CacheRemoteObjectToLocalCluster and generate client-side snapshot for empty directories

- Add metadata_event field to CacheRemoteObjectToLocalClusterResponse
  proto so the filer-emitted event is available to callers
- Use WithMetadataEventSink in the server handler to capture the event
  from NotifyUpdateEvent and return it on the response
- Update filehandle_read.go to prefer the RPC's metadata event over
  a locally fabricated one, falling back to metadataUpdateEvent when
  the server doesn't provide one (e.g., older filers)
- Generate a client-side snapshot cutoff in DoSeaweedListWithSnapshot
  when the server sends no snapshot (empty directory), so callers like
  CompleteDirectoryBuild get a meaningful boundary for filtering
  buffered events

* Skip directory notifications for dirs being built to prevent mid-build cache wipe

When a metadata event is buffered during a directory build,
applyMetadataSideEffects was still firing noteDirectoryUpdate for the
building directory. If the directory accumulated enough updates to
become "hot", markDirectoryReadThrough would call DeleteFolderChildren,
wiping entries that EnsureVisited had already inserted. The build would
then complete and mark the directory cached with incomplete data.

Fix by using applyMetadataSideEffectsSkippingBuildingDirs for buffered
events, which suppresses directory notifications for dirs currently in
buildingDirs while still applying entry invalidations.

* Add test for directory notification suppression during active build

TestDirectoryNotificationsSuppressedDuringBuild verifies that metadata
events targeting a directory under active EnsureVisited build do NOT
fire onDirectoryUpdate for that directory. In production, this prevents
markDirectoryReadThrough from calling DeleteFolderChildren mid-build,
which would wipe entries already inserted by the listing.

The test inserts an entry during a build, sends multiple metadata events
for the building directory, asserts no notifications fired for it,
verifies the entry survives, and confirms buffered events are replayed
after CompleteDirectoryBuild.

* Fix create invalidations, build guard, event shape, context, and snapshot error path

- collectEntryInvalidations: invalidate FUSE kernel cache on pure
  create events (OldEntry==nil && NewEntry!=nil), not just updates
  and deletes
- completeDirectoryBuildNow: only call markCachedFn when an active
  build existed (state != nil), preventing an unpopulated directory
  from being marked as cached
- Add metadataCreateEvent helper that produces a create-shaped event
  (NewEntry only, no OldEntry) and use it in mkdir, mknod, symlink,
  and hardlink create fallback paths instead of metadataUpdateEvent
  which incorrectly set both OldEntry and NewEntry
- applyMetadataResponseEnqueue: use context.Background() for the
  queued mutation so a cancelled caller context cannot abort the
  apply loop mid-write
- DoSeaweedListWithSnapshot: move snapshot initialization before
  ListEntries call so the error path returns the preserved snapshot
  instead of 0

* Fix review findings: test loop, cache race, context safety, snapshot consistency

- Fix build test loop starting at i=1 instead of i=0, missing new-0.txt verification
- Re-check IsDirectoryCached after cache miss to avoid ENOENT race with markDirectoryReadThrough
- Use context.Background() in enqueueAndWait so caller cancellation can't abort build/complete mid-way
- Pass dh.snapshotTsNs in skip-batch loadDirectoryEntriesDirect for snapshot consistency
- Prefer resp.MetadataEvent over fallback in Unlink event derivation
- Add comment on MetadataEventSink.Record single-event assumption

* Fix empty-directory snapshot clock skew and build cancellation race

Empty-directory snapshot: Remove client-side time.Now() synthesis when
the server returns no entries. Instead return snapshotTsNs=0, and in
completeDirectoryBuildNow replay ALL buffered events when snapshot is 0.
This eliminates the clock-skew bug where a client ahead of the filer
would filter out legitimate post-list events.

Build cancellation: Use context.Background() for BeginDirectoryBuild
and CompleteDirectoryBuild calls in doEnsureVisited, so errgroup
cancellation doesn't cause enqueueAndWait to return early and trigger
cleanupBuild while the operation is still queued.

* Add tests for empty-directory build replay and cancellation resilience

TestEmptyDirectoryBuildReplaysAllBufferedEvents: verifies that when
CompleteDirectoryBuild receives snapshotTsNs=0 (empty directory, no
server snapshot), ALL buffered events are replayed regardless of their
TsNs values — no clock-skew-sensitive filtering occurs.

TestBuildCompletionSurvivesCallerCancellation: verifies that once
CompleteDirectoryBuild is enqueued, a cancelled caller context does not
prevent the build from completing. The apply loop runs with
context.Background(), so the directory becomes cached and buffered
events are replayed even when the caller gives up waiting.

* Fix directory subtree cleanup, Link rollback, test robustness

- applyMetadataResponseLocked: when a directory entry is deleted or
  moved, call DeleteFolderChildren on the old path so cached descendants
  don't leak as stale entries.

- Link: save original HardLinkId/Counter before mutation. If
  CreateEntryWithResponse fails after the source was already updated,
  rollback the source entry to its original state via UpdateEntry.

- TestBuildCompletionSurvivesCallerCancellation: replace fixed
  time.Sleep(50ms) with a deadline-based poll that checks
  IsDirectoryCached in a loop, failing only after 2s timeout.

- TestReadDirAllEntriesWithSnapshotEmptyDirectory: assert that
  ListEntries was actually invoked on the mock client so the test
  exercises the RPC path.

- newMetadataEvent: add early return when both oldEntry and newEntry are
  nil to avoid emitting events with empty Directory.

---------

Co-authored-by: Copilot <copilot@github.com>
2026-03-07 09:19:40 -08:00
Chris Lu
540fc97e00 s3/iam: reuse one request id per request (#8538)
* request_id: add shared request middleware

* s3err: preserve request ids in responses and logs

* iam: reuse request ids in XML responses

* sts: reuse request ids in XML responses

* request_id: drop legacy header fallback

* request_id: use AWS-style request id format

* iam: fix AWS-compatible XML format for ErrorResponse and field ordering

- ErrorResponse uses bare <RequestId> at root level instead of
  <ResponseMetadata> wrapper, matching the AWS IAM error response spec
- Move CommonResponse to last field in success response structs so
  <ResponseMetadata> serializes after result elements
- Add randomness to request ID generation to avoid collisions
- Add tests for XML ordering and ErrorResponse format

* iam: remove duplicate error_response_test.go

Test is already covered by responses_test.go.

* address PR review comments

- Guard against typed nil pointers in SetResponseRequestID before
  interface assertion (CodeRabbit)
- Use regexp instead of strings.Index in test helpers for extracting
  request IDs (Gemini)

* request_id: prevent spoofing, fix nil-error branch, thread reqID to error writers

- Ensure() now always generates a server-side ID, ignoring client-sent
  x-amz-request-id headers to prevent request ID spoofing. Uses a
  private context key (contextKey{}) instead of the header string.
- writeIamErrorResponse in both iamapi and embedded IAM now accepts
  reqID as a parameter instead of calling Ensure() internally, ensuring
  a single request ID per request lifecycle.
- The nil-iamError branch in writeIamErrorResponse now writes a 500
  Internal Server Error response instead of returning silently.
- Updated tests to set request IDs via context (not headers) and added
  tests for spoofing prevention and context reuse.

* sts: add request-id consistency assertions to ActionInBody tests

* test: update admin test to expect server-generated request IDs

The test previously sent a client x-amz-request-id header and expected
it echoed back. Since Ensure() now ignores client headers to prevent
spoofing, update the test to verify the server returns a non-empty
server-generated request ID instead.

* iam: add generic WithRequestID helper alongside reflection-based fallback

Add WithRequestID[T] that uses generics to take the address of a value
type, satisfying the pointer receiver on SetRequestId without reflection.

The existing SetResponseRequestID is kept for the two call sites that
operate on interface{} (from large action switches where the concrete
type varies at runtime). Generics cannot replace reflection there since
Go cannot infer type parameters from interface{}.

* Remove reflection and generics from request ID setting

Call SetRequestId directly on concrete response types in each switch
branch before boxing into interface{}, eliminating the need for
WithRequestID (generics) and SetResponseRequestID (reflection).

* iam: return pointer responses in action dispatch

* Fix IAM error handling consistency and ensure request IDs on all responses

- UpdateUser/CreatePolicy error branches: use writeIamErrorResponse instead
  of s3err.WriteErrorResponse to preserve IAM formatting and request ID
- ExecuteAction: accept reqID parameter and generate one if empty, ensuring
  every response carries a RequestId regardless of caller

* Clean up inline policies on DeleteUser and UpdateUser rename

DeleteUser: remove InlinePolicies[userName] from policy storage before
removing the identity, so policies are not orphaned.

UpdateUser: move InlinePolicies[userName] to InlinePolicies[newUserName]
when renaming, so GetUserPolicy/DeleteUserPolicy work under the new name.

Both operations persist the updated policies and return an error if
the storage write fails, preventing partial state.
2026-03-06 15:22:39 -08:00
Aaron
14cd0f53ba Places the CommonResponse struct at the *end* of all IAM responses. (#8537)
* Places the CommonResponse struct at the end of all IAM responses, rather than the start.

* iam: fix error response request id layout

* iam: add XML ordering regression test

* iam: share request id generation

---------

Co-authored-by: Aaron Segal <aaron.segal@rpsolutions.com>
Co-authored-by: Chris Lu <chris.lu@gmail.com>
2026-03-06 12:53:23 -08:00
Chris Lu
f9311a3422 s3api: fix static IAM policy enforcement after reload (#8532)
* s3api: honor attached IAM policies over legacy actions

* s3api: hydrate IAM policy docs during config reload

* s3api: use policy-aware auth when listing buckets

* credential: propagate context through filer_etc policy reads

* credential: make legacy policy deletes durable

* s3api: exercise managed policy runtime loader

* s3api: allow static IAM users without session tokens

* iam: deny unmatched attached policies under default allow

* iam: load embedded policy files from filer store

* s3api: require session tokens for IAM presigning

* s3api: sync runtime policies into zero-config IAM

* credential: respect context in policy file loads

* credential: serialize legacy policy deletes

* iam: align filer policy store naming

* s3api: use authenticated principals for presigning

* iam: deep copy policy conditions

* s3api: require request creation in policy tests

* filer: keep ReadInsideFiler as the context-aware API

* iam: harden filer policy store writes

* credential: strengthen legacy policy serialization test

* credential: forward runtime policy loaders through wrapper

* s3api: harden runtime policy merging

* iam: require typed already-exists errors
2026-03-06 12:35:08 -08:00