Commit Graph

264 Commits

Author SHA1 Message Date
Chris Lu
e7fc243ee1 Fix flaky s3tables tests: allocate all ports atomically
The test port allocation had a TOCTOU race where GetFreePort() would
open a listener, grab the port number, then immediately close it.
When called repeatedly, the OS could recycle a just-released port,
causing two services (e.g. Filer and S3) to be assigned the same port.

Replace per-call GetFreePort() with batch AllocatePorts() that holds
all listeners open until every port is obtained, matching the pattern
already used in test/volume_server/framework/cluster.go.
2026-04-02 11:58:03 -07:00
Chris Lu
647b46bd8a Fix flaky FUSE integration tests
- concurrent_operations_test: Add retry loop for transient I/O errors
  on file close during ConcurrentDirectoryOperations
- git_operations_test: Wait for pushed objects to become visible through
  FUSE mount before cloning in Phase 3
2026-04-02 11:49:13 -07:00
Chris Lu
af68449a26 Process .ecj deletions during EC decode and vacuum decoded volume (#8863)
* Process .ecj deletions during EC decode and vacuum decoded volume (#8798)

When decoding EC volumes back to normal volumes, deletions recorded in
the .ecj journal were not being applied before computing the dat file
size or checking for live needles. This caused the decoded volume to
include data for deleted files and could produce false positives in the
all-deleted check.

- Call RebuildEcxFile before HasLiveNeedles/FindDatFileSize in
  VolumeEcShardsToVolume so .ecj deletions are merged into .ecx first
- Vacuum the decoded volume after mounting in ec.decode to compact out
  deleted needle data from the .dat file
- Add integration tests for decoding with non-empty .ecj files

* storage: add offline volume compaction helper

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* ec: compact decoded volumes before deleting shards

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* ec: address PR review comments

- Fall back to data directory for .ecx when idx directory lacks it
- Make compaction failure non-fatal during EC decode
- Remove misleading "buffer: 10%" from space check error message

* ec: collect .ecj from all shard locations during decode

Each server's .ecj only contains deletions for needles whose data
resides in shards held by that server. Previously, sources with no
new data shards to contribute were skipped entirely, losing their
.ecj deletion entries. Now .ecj is always appended from every shard
location so RebuildEcxFile sees the full set of deletions.

* ec: add integration tests for .ecj collection during decode

TestEcDecodePreservesDeletedNeedles: verifies that needles deleted
via VolumeEcBlobDelete are excluded from the decoded volume.

TestEcDecodeCollectsEcjFromPeer: regression test for the fix in
collectEcShards. Deletes a needle only on a peer server that holds
no new data shards, then verifies the deletion survives decode via
.ecj collection.

* ec: address review nits in decode and tests

- Remove double error wrapping in mountDecodedVolume
- Check VolumeUnmount error in peer ecj test
- Assert 404 specifically for deleted needles, fail on 5xx

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-01 01:15:26 -07:00
Chris Lu
0ce4a857e6 test: add dcache stabilisation check after FUSE git reset
After git reset --hard on a FUSE mount, the kernel dcache can
transiently show the directory then drop it moments later. Add a
1-second stabilisation delay and re-verification in
resetToCommitWithRecovery and tryPullFromCommit so that recovery
retries if the entry vanishes in that window.
2026-03-30 17:26:43 -07:00
Chris Lu
d5068b3ee6 test: harden weed mini readiness checks 2026-03-30 16:21:38 -07:00
dependabot[bot]
39ab2ef6ad build(deps): bump golang.org/x/image from 0.36.0 to 0.38.0 in /test/kafka (#8843)
build(deps): bump golang.org/x/image in /test/kafka

Bumps [golang.org/x/image](https://github.com/golang/image) from 0.36.0 to 0.38.0.
- [Commits](https://github.com/golang/image/compare/v0.36.0...v0.38.0)

---
updated-dependencies:
- dependency-name: golang.org/x/image
  dependency-version: 0.38.0
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-03-30 09:50:06 -07:00
Chris Lu
d074830016 fix(worker): pass compaction revision and file sizes in EC volume copy (#8835)
* fix(worker): pass compaction revision and file sizes in EC volume copy

The worker EC task was sending CopyFile requests without the current
compaction revision (defaulting to 0) and with StopOffset set to
math.MaxInt64.  After a vacuum compaction this caused the volume server
to reject the copy or return stale data.

Read the volume file status first and forward the compaction revision
and actual file sizes so the copy is consistent with the compacted
volume.

* propagate erasure coding task context

* fix(worker): validate volume file status and detect short copies

Reject zero dat file size from ReadVolumeFileStatus — a zero-sized
snapshot would produce 0-byte copies and broken EC shards.

After streaming, verify totalBytes matches the expected stopOffset
and return an error on short copies instead of logging success.

* fix(worker): reject zero idx file size in volume status validation

A non-empty dat with zero idx indicates an empty or corrupt volume.
Without this guard, copyFileFromSource gets stopOffset=0, produces a
0-byte .idx, passes the short-copy check, and generateEcShardsLocally
runs against a volume with no index.

* fix fake plugin volume file status

* fix plugin volume balance test fixtures
2026-03-29 18:47:15 -07:00
Chris Lu
ce02cf3c9d stabilize FUSE git reset recovery (#8825) 2026-03-29 00:04:32 -07:00
Chris Lu
0884acd70c ci: add S3 mutation regression coverage (#8804)
* test/s3: stabilize distributed lock regression

* ci: add S3 mutation regression workflow

* test: fix delete regression readiness probe

* test: address mutation regression review comments
2026-03-28 20:17:20 -07:00
Chris Lu
0adb78bc6b s3api: make conditional mutations atomic and AWS-compatible (#8802)
* s3api: serialize conditional write finalization

* s3api: add conditional delete mutation checks

* s3api: enforce destination conditions for copy

* s3api: revalidate multipart completion under lock

* s3api: rollback failed put finalization hooks

* s3api: report delete-marker version deletions

* s3api: fix copy destination versioning edge cases

* s3api: make versioned multipart completion idempotent

* test/s3: cover conditional mutation regressions

* s3api: rollback failed copy version finalization

* s3api: resolve suspended delete conditions via latest entry

* s3api: remove copy test null-version injection

* s3api: reject out-of-order multipart completions

* s3api: preserve multipart replay version metadata

* s3api: surface copy destination existence errors

* s3api: simplify delete condition target resolution

* test/s3: make conditional delete assertions order independent

* test/s3: add distributed lock gateway integration

* s3api: fail closed multipart versioned completion

* s3api: harden copy metadata and overwrite paths

* s3api: create delete markers for suspended deletes

* s3api: allow duplicate multipart completion parts
2026-03-27 19:22:26 -07:00
Chris Lu
bf2a2d2538 test: preserve branch when recovering bare git repo (#8803)
* test: preserve branch when recovering bare git repo

* Replaced the standalone ensureMountClone + gitRun in Phase 5 with a new resetToCommitWithRecovery function that mirrors   the existing pullFromCommitWithRecovery pattern
2026-03-27 17:13:00 -07:00
Chris Lu
db9ea7c87c fix(fuse-test): recover from FUSE directory loss in git pull test (#8789)
* fix(fuse-test): harden ensureMountClone to verify .git/HEAD

On FUSE, the kernel dcache can retain a stale directory entry after
heavy git operations. Checking only os.Stat(mountClone) may find
the top-level directory but the clone internals (.git/HEAD) are gone.

Now ensureMountClone verifies .git/HEAD exists, cleans up stale
remnants with os.RemoveAll before re-cloning, and adds a brief
settle delay for the FUSE metadata cache.

* fix(fuse-test): add pull recovery loop for Phase 6 git operations

After git reset --hard on a FUSE mount, the kernel dcache can
permanently lose the clone directory entry. The existing retry logic
polls for 60+ seconds but the directory never recovers.

Add pullFromCommitWithRecovery which wraps the Phase 6 pull in a
recovery loop: if the clone directory vanishes, it removes the stale
clone, re-creates it from the bare repo, resets to the target commit,
and retries the pull (up to 3 attempts).

Also adds tryGitCommand, a non-fatal git runner that returns
(output, error) instead of calling require.NoError, enabling the
recovery loop to handle failures gracefully without aborting the test.

Fixes flaky TestGitOperations/CloneAndPull on CI.

* fix(fuse-test): make recovery loop fully non-fatal

Address review feedback:

- Extract tryEnsureMountClone (returns error) so a transient FUSE
  failure during re-clone doesn't abort the test — the recovery loop
  can retry instead.
- Check waitForDirEventually return after git reset --hard and return
  a specific error if the directory doesn't recover.

* style(fuse-test): simplify ensureMountClone error check

* fix(fuse-test): recover bare repo when FUSE mount loses it

CI showed that not just the working clone but also the bare repo on
the FUSE mount can vanish after heavy git operations. The recovery
loop now re-creates the bare repo from the local clone (which lives
on local disk and is always available) before attempting to re-clone.

Adds tryEnsureBareRepo: checks HEAD exists in the bare repo, and if
not, re-inits and force-pushes from the local clone.
2026-03-26 19:33:56 -07:00
Chris Lu
ba624f1f34 Rust volume server implementation with CI (#8539)
* Match Go gRPC client transport defaults

* Honor Go HTTP idle timeout

* Honor maintenanceMBps during volume copy

* Honor images.fix.orientation on uploads

* Honor cpuprofile when pprof is disabled

* Match Go memory status payloads

* Propagate request IDs across gRPC calls

* Format pending Rust source updates

* Match Go stats endpoint payloads

* Serve Go volume server UI assets

* Enforce Go HTTP whitelist guards

* Align Rust metrics admin-port test with Go behavior

* Format pending Rust server updates

* Honor access.ui without per-request JWT checks

* Honor keepLocalDatFile in tier upload shortcut

* Honor Go remote volume write mode

* Load tier backends from master config

* Check master config before loading volumes

* Remove vif files on volume destroy

* Delete remote tier data on volume destroy

* Honor vif version defaults and overrides

* Reject mismatched vif bytes offsets

* Load remote-only tiered volumes

* Report Go tail offsets in sync status

* Stream remote dat in incremental copy

* Honor collection vif for EC shard config

* Persist EC expireAtSec in vif metadata

* Stream remote volume reads through HTTP

* Serve HTTP ranges from backend source

* Match Go ReadAllNeedles scan order

* Match Go CopyFile zero-stop metadata

* Delete EC volumes with collection cleanup

* Drop deleted collection metrics

* Match Go tombstone ReadNeedleMeta

* Match Go TTL parsing: all-digit default to minutes, two-pass fit algorithm

* Match Go needle ID/cookie formatting and name size computation

* Match Go image ext checks: webp resize only, no crop; empty healthz body

* Match Go Prometheus metric names and add missing handler counter constants

* Match Go ReplicaPlacement short string parsing with zero-padding

* Add missing EC constants MAX_SHARD_COUNT and MIN_TOTAL_DISKS

* Add walk_ecx_stats for accurate EC volume file counts and size

* Match Go VolumeStatus dat file size, EC shard stats, and disk pct precision

* Match Go needle map: unconditional delete counter, fix redb idx walk offset

* Add CompactMapSegment overflow panic guard matching Go

* Match Go volume: vif creation, version from superblock, TTL expiry, dedup data_size, garbage_level fallback

* Match Go 304 Not Modified: return bare status with no headers

* Match Go JWT error message: use "wrong jwt" instead of detailed error

* Match Go read handler bare 400, delete error prefix, download throttle timeout

* Match Go pretty JSON 1-space indent and "Deletion Failed:" error prefix

* Match Go heartbeat: keep is_heartbeating on error, add EC shard identification

* Match Go needle ReadBytes V2: tolerate EOF on truncated body

* Match Go volume: cookie check on any existing needle, return DataSize, 128KB meta guard

* Match Go DeleteCollection: propagate destroy errors

* Match Go gRPC: BatchDelete no flag, IncrementalCopy error, FetchAndWrite concurrent, VolumeUnmount/DeleteCollection errors, tail draining, query error code

* Match Go Content-Disposition RFC 6266 formatting with RFC 2231 encoding

* Match Go Guard isWriteActive: combine whitelist and signing key check

* Match Go DeleteCollectionMetrics: use partial label matching

* Match Go heartbeat: send state-only delta on volume state changes

* Match Go ReadNeedleMeta paged I/O: read header+tail only, skip data; add EIO tracking

* Match Go ScrubVolume INDEX mode dispatch; add VolumeCopy preallocation and EC NeedleStatus TODOs

* Add read_ec_shard_needle for full needle reconstruction from local EC shards

* Make heartbeat master config helpers pub for VolumeCopy preallocation

* Match Go gRPC: VolumeCopy preallocation, EC NeedleStatus full read, error message wording

* Match Go HTTP responses: omitempty fields, 2-space JSON indent, JWT JSON error, delete pretty/JSONP, 304 Last-Modified, raw write error

* Match Go WriteNeedleBlob V3 timestamp patching, fix makeup_diff double padding, count==0 read handling

* Add rebuild_ecx_file for EC index reconstruction from data shards

* Match Go gRPC: tail header first-chunk-only, EC cleanup on failure, copy append mode, ecx rebuild, compact cancellation

* Add EC volume read and delete support in HTTP handlers

* Add per-shard EC mount/unmount, location predicate search, idx directory for EC

* Add CheckVolumeDataIntegrity on volume load matching Go

* Match Go gRPC: EC multi-disk placement, per-shard mount/unmount, no auto-mount on reconstruct, streaming ReadAll/EcShardRead, ReceiveFile cleanup, version check, proxy streaming, redirect Content-Type

* Match Go heartbeat metric accounting

* Match Go duplicate UUID heartbeat retries

* Delete expired EC volumes during heartbeat

* Match Go volume heartbeat pruning

* Honor master preallocate in volume max

* Report remote storage info in heartbeats

* Emit EC heartbeat deltas on shard changes

* Match Go throttle boundary: use <= instead of <, fix pretty JSON to 1-space

* Match Go write_needle_blob monotonic appendAtNs via get_append_at_ns

* Match Go VolumeUnmount: idempotent success when volume not found

* Match Go TTL Display: return empty string when unit is Empty

Go checks `t.Unit == Empty` separately and returns "" for TTLs
with nonzero count but Empty unit. Rust only checked is_empty()
(count==0 && unit==0), so count>0 with unit=0 would format as
"5 " instead of "".

* Match Go error behavior for truncated needle data in read_body_v2

Go's readNeedleDataVersion2 returns "index out of range %d" errors
(indices 1-7) when needle body or metadata fields are truncated.
Rust was silently tolerating truncation and returning Ok. Now returns
NeedleError::IndexOutOfRange with the matching index for each field.

* Match Go download throttle: return JSON error instead of plain text

* Match Go crop params: default x1/y1 to 0 when not provided

* Match Go ScrubEcVolume: accumulate total_files from EC shards

* Match Go ScrubVolume: count total_files even on scrub error

* Match Go VolumeEcShardsCopy: set ignore_source_file_not_found for .vif

* Match Go VolumeTailSender: send needle_header on every chunk

* Match Go read_super_block: apply replication override from .vif

* Match Go check_volume_data_integrity: verify all 10 entries, detect trailing corruption

* Match Go WriteNeedleBlob: dedup check before writing during replication

* handlers: use meta-only reads for HEAD

* handlers: align range parsing and responses with Go

* handlers: align upload parsing with Go

* deps: enable webp support

* Make 5bytes the default feature for idx entry compatibility

* Match Go TTL: preserve original unit when count fits in byte

* Fix EC locate_needle: use get_actual_size for full needle size

* Fix raw body POST: only parse multipart when Content-Type contains form-data

* Match Go ReceiveFile: return protocol errors in response body, not gRPC status

* add docs

* Match Go VolumeEcShardsCopy: append to .ecj file instead of truncating

* Match Go ParsePath: support _delta suffix on file IDs for sub-file addressing

* Match Go chunk manifest: add Accept-Ranges, Content-Disposition, filename fallback, MIME detection

* Match Go privateStoreHandler: use proper JSON error for unsupported methods

* Match Go Destroy: add only_empty parameter to reject non-empty volume deletion

* Fix compilation: set_read_only_persist and set_writable return ()

These methods fire-and-forget save_vif internally, so gRPC callers
should not try to chain .map_err() on the unit return type.

* Match Go SaveVolumeInfo: check writability and propagate errors in save_vif

* Match Go VolumeDelete: propagate only_empty to delete_volume for defense in depth

The gRPC VolumeDelete handler had a pre-check for only_empty but then
passed false to store.delete_volume(), bypassing the store-level check.
Go passes req.OnlyEmpty directly to DeleteVolume. Now Rust does the same
for defense in depth against TOCTOU races (though the store write lock
makes this unlikely).

* Match Go ProcessRangeRequest: return full content for empty/oversized ranges

Go returns nil from ProcessRangeRequest when ranges are empty or total
range size exceeds content length, causing the caller to serve the full
content as a normal 200 response. Rust was returning an empty 200 body.

* Match Go Query: quote JSON keys in output records

Go's ToJson produces valid JSON with quoted keys like {"name":"Alice"}.
Rust was producing invalid JSON with unquoted keys like {name:"Alice"}.

* Match Go VolumeCopy: reject when no suitable disk location exists

Go returns ErrVolumeNoSpaceLeft when no location matches the disk type
and has sufficient space. Rust had an unsafe fallback that silently
picked the first location regardless of type or available space.

* Match Go DeleteVolumeNeedle: check noWriteOrDelete before allowing delete

Go checks v.noWriteOrDelete before proceeding with needle deletion,
returning "volume is read only" if true. Rust was skipping this check.

* Match Go ReceiveFile: prefer HardDrive location for EC and use response-level write errors

Two fixes: (1) Go prefers HardDriveType disk location for EC volumes,
falling back to first location. Returns "no storage location available"
when no locations exist. (2) Write failures are now response-level
errors (in response body) instead of gRPC status errors, matching Go.

* Match Go CopyFile: sync EC volume journal to disk before copying

Go calls ecVolume.Sync() before copying EC volume files to ensure the
.ecj journal is flushed to disk. Added sync_to_disk() to EcVolume and
call it in the CopyFile EC branch.

* Match Go readSuperBlock: propagate replication parse errors

Go returns an error when parsing the replication string from the .vif
file fails. Rust was silently ignoring the parse failure and using the
super block's replication as-is.

* Match Go TTL expiry: remove append_at_ns > 0 guard

Go computes TTL expiry from AppendAtNs without guarding against zero.
When append_at_ns is 0, the expiry is epoch + TTL which is in the past,
correctly returning NotFound. Rust's extra guard skipped the check,
incorrectly returning success for such needles.

* Match Go delete_collection: skip volumes with compaction in progress

Go checks !v.isCompactionInProgress.Load() before destroying a volume
during collection deletion, skipping compacting volumes. Also changed
destroy errors to log instead of aborting the entire collection delete.

* Match Go MarkReadonly/MarkWritable: always notify master even on local error

Go always notifies the master regardless of whether the local
set_read_only_persist or set_writable step fails. The Rust code was
using `?` which short-circuited on error, skipping the final master
notification. Save the result and defer the `?` until after the
notify call.

* Match Go PostHandler: return 500 for all write errors

Go returns 500 (InternalServerError) for all write failures. Rust was
returning 404 for volume-not-found and 403 for read-only volumes.

* Match Go makeupDiff: validate .cpd compaction revision is old + 1

Go reads the new .cpd file's super block and verifies the compaction
revision is exactly old + 1. Rust only validated the old revision.

* Match Go VolumeStatus: check data backend before returning status

Go checks v.DataBackend != nil before building the status response,
returning an error if missing. Rust was silently returning size 0.

* Match Go PostHandler: always include mime field in upload response JSON

Go always serializes the mime field even when empty ("mime":""). Rust was
omitting it when empty due to Option<String> with skip_serializing_if.

* Match Go FindFreeLocation: account for EC shards in free slot calculation

Go subtracts EC shard equivalents when computing available volume slots.
Rust was only comparing volume count, potentially over-counting free
slots on locations with many EC shards.

* Match Go privateStoreHandler: use INVALID as metrics label for unsupported methods

Go records the method as INVALID in metrics for unsupported HTTP methods.
Rust was using the actual method name.

* Match Go volume: add commit_compact guard and scrub data size validation

Two fixes: (1) commit_compact now checks/sets is_compacting flag to
prevent concurrent commits, matching Go's CompareAndSwap guard.
(2) scrub now validates total needle sizes against .dat file size.

* Match Go gRPC: fix TailSender error propagation, EcShardsInfo all slots, EcShardRead .ecx check

Three fixes: (1) VolumeTailSender now propagates binary search errors
instead of silently falling back to start. (2) VolumeEcShardsInfo
returns entries for all shard slots including unmounted. (3)
VolumeEcShardRead checks .ecx index for deletions instead of .ecj.

* Match Go metrics: add BuildInfo gauge and connection tracking functions

Go exposes a BuildInfo Prometheus metric with version labels, and tracks
open connections via stats.ConnectionOpen/Close. Added both to Rust.

* Match Go NeedleMap.Delete: use !is_deleted() instead of is_valid()

Go's CompactMap.Delete checks !IsDeleted() not IsValid(), so needles
with size==0 (live but anomalous) can still be deleted. The Rust code
was using is_valid() which returns false for size==0, preventing
deletion of such needles.

* Match Go fitTtlCount: always normalize TTL to coarsest unit

Go's fitTtlCount always converts to seconds first, then finds the
coarsest unit that fits in one byte (e.g., 120m → 2h). Rust had an
early return for count<=255 that skipped normalization, producing
different binary encodings for the same duration.

* Match Go BuildInfo metric: correct name and add missing labels

Go uses SeaweedFS_build_info (Namespace=SeaweedFS, Subsystem=build,
Name=info) with labels [version, commit, sizelimit, goos, goarch].
Rust had SeaweedFS_volumeServer_buildInfo with only [version].

* Match Go HTTP handlers: fix UploadResult fields, DiskStatus JSON, chunk manifest ETag

- UploadResult.mime: add skip_serializing_if to omit empty MIME (Go uses omitempty)
- UploadResult.contentMd5: only include when request provided Content-MD5 header
- Content-MD5 response header: only set when request provided it
- DiskStatuses: use camelCase field names (percentFree, percentUsed, diskType)
  to match Go's protobuf JSON marshaling
- Chunk manifest: preserve needle ETag in expanded response headers

* Match Go volume: fix version(), integrity check, scrub, and commit_compact

- version(): use self.version() instead of self.super_block.version in
  read_all_needles, check_volume_data_integrity, scan_raw_needles_from
  to respect volumeInfo.version override
- check_volume_data_integrity: initialize healthy_index_size to idx_size
  (matching Go) and continue on EOF instead of returning error
- scrub(): count deleted needles in total_read since they still occupy
  space in the .dat file (matches Go's totalRead += actualSize for deleted)
- commit_compact: clean up .cpd/.cpx files on makeup_diff failure
  (matches Go's error path cleanup)

* Match Go write queue: add 4MB batch byte limit

Go's startWorker breaks the batch at either 128 requests or 4MB of
accumulated write data. Rust only had the 128-request limit, allowing
large writes to accumulate unbounded latency.

* Add TTL normalization tests for Go parity verification

Test that fit_ttl_count normalizes 120m→2h, 24h→1d, 7d→1w even
when count fits in a byte, matching Go's fitTtlCount behavior.

* Match Go FindFreeLocation: account for EC shards in free slot calculation

Go's free volume count subtracts both regular volumes and EC volumes
from max_volume_count. Rust was only counting regular volumes, which
could over-report available slots when EC shards are mounted.

* Match Go EC volume: mark deletions in .ecx and replay .ecj at startup

Go's DeleteNeedleFromEcx marks needles as deleted in the .ecx index
in-place (writing TOMBSTONE_FILE_SIZE at the size field) in addition
to appending to the .ecj journal. Go's RebuildEcxFile replays .ecj
entries into .ecx on startup, then removes the .ecj file.

Rust was only appending to .ecj without marking .ecx, which meant
deleted EC needles remained readable via .ecx binary search. This
fix:
- Opens .ecx in read/write mode (was read-only)
- Adds mark_needle_deleted_in_ecx: binary search + in-place write
- Calls it from journal_delete before appending to .ecj
- Adds rebuild_ecx_from_journal: replays .ecj into .ecx on startup

* Match Go check_all_ec_shards_deleted: use MAX_SHARD_COUNT instead of hardcoded 14

Go's TotalShardsCount is DataShardsCount + ParityShardsCount = 14 by
default, but custom EC configs via .vif can have more shards (up to
MaxShardCount = 32). Using MAX_SHARD_COUNT ensures all shard files
are checked regardless of EC configuration.

* Match Go EC locate: subtract 1 from shard size and use datFileSize override

Go's LocateEcShardNeedleInterval passes shard.ecdFileSize-1 to
LocateData (shards are padded, -1 avoids overcounting large block
rows). When datFileSize is known, Go uses datFileSize/DataShards
instead. Rust was passing the raw shard file size without adjustment.

* Fix TTL parsing and DiskStatus field names to match Go exactly

TTL::read: Go's ReadTTL preserves the original unit (7d stays 7d,
not 1w) and errors on count > 255. The previous normalization change
was incorrect — Go only normalizes internally via fitTtlCount, not
during string parsing.

DiskStatus: Go uses encoding/json on protobuf structs, which reads
the json struct tags (snake_case: percent_free, percent_used,
disk_type), not the protobuf JSON names (camelCase). Revert to
snake_case to match Go's actual output.

* Fix heartbeat: check leader != current master before redirect, process duplicated UUIDs first

Match Go's volume_grpc_client_to_master.go behavior:
1. Only trigger leader redirect when the leader address differs from the
   current master (prevents unnecessary reconnect loops when master confirms
   its own address).
2. Process duplicated_uuids before leader redirect check, matching Go's
   ordering where duplicate UUID detection takes priority.

* Remove SetState version check to match Go behavior

Go's SetState unconditionally applies the state without any version
mismatch check. The Rust version had an extra optimistic concurrency
check that would reject valid requests from Go clients that don't
track versions.

* Fix TTL::read() to normalize via fit_ttl_count matching Go's ReadTTL

Go's ReadTTL calls fitTtlCount which converts to seconds and normalizes
to the coarsest unit that fits in a byte count (e.g. 120m->2h, 7d->1w,
24h->1d). The Rust version was preserving the original unit, producing
different binary encodings on disk and in heartbeat messages.

* Always return Content-MD5 header and JSON field on successful writes

Go always sets Content-MD5 in the response regardless of whether the
request included it. The Rust version was conditionally including it
only when the request provided Content-MD5.

* Include name and size in UploadResult JSON even when empty/zero

Go's encoding/json always includes empty strings and zero values in
the upload response. The Rust version was using skip_serializing_if
to omit them, causing JSON structure differences.

* Include deleted needles in scan_raw_needles_from to match Go

Go's ScanVolumeFileFrom visits ALL needles including deleted ones.
Skipping deleted entries during incremental copy would cause tombstones
to not be propagated, making deleted files reappear on the receiving side.

* Match Go NeedleMap.Delete: always write tombstone to idx file

Go's NeedleMap.Delete unconditionally writes a tombstone entry to the
idx file and updates metrics, even if the needle doesn't exist or is
already deleted. This is important for replication where every delete
operation must produce an idx write. The Rust version was skipping the
tombstone write for non-existent or already-deleted needles.

* Limit MIME type to 255 bytes matching Go's CreateNeedleFromRequest

* Title-case Seaweed-* pair keys to match Go HTTP header canonicalization

* Unify DiskType::Hdd into HardDrive to match Go's single HardDriveType

* Skip tombstone entries in walk_ecx_stats total_size matching Go's Raw()

* Return EMPTY TTL when computed seconds is zero matching Go's fitTtlCount

* Include disk-space-low in Volume.is_read_only() matching Go

* Log error on CIDR parse failure in whitelist matching Go's glog.Errorf

* Log cookie mismatch in gRPC Query matching Go's V(0).Infof

* Fix is_expired volume_size comparison to use < matching Go

Go checks `volumeSize < super_block.SuperBlockSize` (strict less-than),
but Rust used `<=`. This meant Rust would fail to expire a volume that
is exactly SUPER_BLOCK_SIZE bytes.

* Apply Go's JWT expiry defaults: 10s write, 60s read

Go calls v.SetDefault("jwt.signing.expires_after_seconds", 10) and
v.SetDefault("jwt.signing.read.expires_after_seconds", 60). Rust
defaulted to 0 for both, which meant tokens would never expire when
security.toml has a signing key but omits expires_after_seconds.

* Stop [grpc.volume].ca from overriding [grpc].ca matching Go

Go reads the gRPC CA file only from config.GetString("grpc.ca"), i.e.
the [grpc] section. The [grpc.volume] section only provides cert and
key. Rust was also reading ca from [grpc.volume] which would silently
override the [grpc].ca value when both were present.

* Fix free_volume_count to use EC shard count matching Go

Was counting EC volumes instead of EC shards, which underestimates EC
space usage. One EC volume with 14 shards uses ~1.4 volume slots, not 1.
Now uses Go's formula: ((max - volumes) * DataShardsCount - ecShardCount) / DataShardsCount.

* Include preallocate in compaction space check matching Go

Go uses max(preallocate, estimatedCompactSize) for the free space check.
Rust was only using the estimated volume size, which could start a
compaction that fails mid-way if preallocate exceeds the volume size.

* Check gzip magic bytes before setting Content-Encoding matching Go

Go checks both Accept-Encoding contains "gzip" AND IsGzippedContent
(data starts with 0x1f 0x8b) before setting Content-Encoding: gzip.
Rust only checked Accept-Encoding, which could incorrectly declare
gzip encoding for non-gzip compressed data.

* Only set upload response name when needle HasName matching Go

Go checks reqNeedle.HasName() before setting ret.Name. Rust always set
the name from the filename variable, which could return the fid portion
of the path as the name for raw PUT requests without a filename.

* Treat MaxVolumeCount==0 as unlimited matching Go's hasFreeDiskLocation

Go's hasFreeDiskLocation returns true immediately when MaxVolumeCount
is 0, treating it as unlimited. Rust was computing effective_free as
<= 0 for max==0, rejecting the location. This could fail volume
creation during early startup before the first heartbeat adjusts max.

* Read lastAppendAtNs from deleted V3 entries in integrity check

Go's doCheckAndFixVolumeData reads AppendAtNs from both live entries
(verifyNeedleIntegrity) and deleted tombstones (verifyDeletedNeedleIntegrity).
Rust was skipping deleted entries, which could result in a stale
last_append_at_ns if the last index entry is a deletion.

* Return empty body for empty/oversized range requests matching Go

Go's ProcessRangeRequest returns nil (empty body, 200 OK) when
parsed ranges are empty or combined range size exceeds total content
size. The Rust buffered path incorrectly returned the full file data
for both cases. The streaming path already handled this correctly.

* Dispatch ScrubEcVolume by mode matching Go's INDEX/LOCAL/FULL

Go's ScrubEcVolume switches on mode: INDEX calls v.ScrubIndex()
(ecx integrity only), LOCAL calls v.ScrubLocal(), FULL calls
vs.store.ScrubEcVolume(). Rust was ignoring the mode and always
running verify_ec_shards. Now INDEX mode checks ecx index integrity
(sorted overlap detection + file size validation) without shard I/O,
while LOCAL/FULL modes run the existing shard verification.

* Fix TTL test expectation: 7d normalizes to 1w matching Go's fitTtlCount

Go's ReadTTL calls fitTtlCount which normalizes to the coarsest unit
that fits: 7 days = 1 week, so "7d" becomes {Count:1, Unit:Week}
which displays as "1w". Both Go and Rust normalize identically.

* Add version mismatch check to SetState matching Go's State.Update

Go's State.Update compares the incoming version with the stored
version and returns "version mismatch" error if they differ. This
provides optimistic concurrency control. The Rust implementation
was accepting any version unconditionally.

* Use unquoted keys in Query JSON output matching Go's json.ToJson

Go's json.ToJson produces records with unquoted keys like
{score:12} not {"score":12}. This is a custom format used
internally by SeaweedFS for query results.

* Fix TTL test expectation in VolumeNeedleStatus: 7d normalizes to 1w

Same normalization as the HTTP test: Go's ReadTTL calls fitTtlCount
which converts 7 days to 1 week.

* Include ETag header in 304 Not Modified responses matching Go behavior

Go sets ETag on the response writer (via SetEtag) before the
If-Modified-Since and If-None-Match conditional checks, so both
304 response paths include the ETag header. The Rust implementation
was only adding ETag to 200 responses.

* Remove needle-name fallback in chunk manifest filename resolution

Go's tryHandleChunkedFile only falls back from URL filename to
manifest name. Rust had an extra fallback to needle.name that
Go does not perform, which could produce different
Content-Disposition filenames for chunk manifests.

* Validate JWT nbf (Not Before) claim matching Go's jwt-go/v5

Go's jwt.ParseWithClaims validates the nbf claim when present,
rejecting tokens whose nbf is in the future. The Rust jsonwebtoken
crate defaults validate_nbf to false, so tokens with future nbf
were incorrectly accepted.

* Set isHeartbeating to true at startup matching Go's VolumeServer init

Go unconditionally sets isHeartbeating: true in the VolumeServer
struct literal. Rust was starting with false when masters are
configured, causing /healthz to return 503 until the first
heartbeat succeeds.

* Call store.close() on shutdown matching Go's Shutdown()

Go's Shutdown() calls vs.store.Close() which closes all volumes
and flushes file handles. The Rust server was relying on process
exit for cleanup, which could leave data unflushed.

* Include server ID in maintenance mode error matching Go's format

Go returns "volume server %s is in maintenance mode" with the
store ID. Rust was returning a generic "maintenance mode" message.

* Fix DiskType test: use HardDrive variant matching Go's HddType=""

Go maps both "" and "hdd" to HardDriveType (empty string). The
Rust enum variant is HardDrive, not Hdd. The test referenced a
nonexistent Hdd variant causing compilation failure.

* Do not include ETag in 304 responses matching Go's GetOrHeadHandler

Go sets ETag at L235 AFTER the If-Modified-Since and If-None-Match
304 return paths, so Go's 304 responses do not include the ETag header.
The Rust code was incorrectly including ETag in both 304 response paths.

* Return 400 on malformed query strings in PostHandler matching Go's ParseForm

Go's r.ParseForm() returns HTTP 400 with "form parse error: ..." when
the query string is malformed. Rust was silently falling back to empty
query params via unwrap_or_default().

* Load EC volume version from .vif matching Go's NewEcVolume

Go sets ev.Version = needle.Version(volumeInfo.Version) from the .vif
file. Rust was always using Version::current() (V3), which would produce
wrong needle actual size calculations for volumes created with V1 or V2.

* Sync .ecx file before close matching Go's EcVolume.Close

Go calls ev.ecxFile.Sync() before closing to ensure in-place deletion
marks are flushed to disk. Without this, deletion marks written via
MarkNeedleDeleted could be lost on crash.

* Validate SuperBlock extra data size matching Go's Bytes() guard

Go checks extraSize > 256*256-2 and calls glog.Fatalf to prevent
corrupt super block headers. Rust was silently truncating via u16 cast,
which would write an incorrect extra_size field.

* Update quinn-proto 0.11.13 -> 0.11.14 to fix GHSA-6xvm-j4wr-6v98

Fixes Dependency Review CI failure: quinn-proto < 0.11.14 is vulnerable
to unauthenticated remote DoS via panic in QUIC transport parameter
parsing.

* Skip TestMultipartUploadUsesFormFieldsForTimestampAndTTL for Go server

Go's r.FormValue() cannot read multipart text fields after
r.MultipartReader() consumes the body, so ts/ttl sent as multipart
form fields only work with the Rust volume server. Skip this test
when VOLUME_SERVER_IMPL != "rust" to fix CI failure.

* Flush .ecx in EC volume sync_to_disk matching Go's Sync()

Go's EcVolume.Sync() flushes both the .ecj journal and the .ecx index
to disk. The Rust version only flushed .ecj, leaving in-place deletion
marks in .ecx unpersisted until close(). This could cause data
inconsistency if the server crashes after marking a needle deleted in
.ecx but before close().

* Remove .vif file in EC volume destroy matching Go's Destroy()

Go's EcVolume.Destroy() removes .ecx, .ecj, and .vif files. The Rust
version only removed .ecx and .ecj, leaving orphaned .vif files on
disk after EC volume destruction (e.g., after TTL expiry).

* Fix is_expired to use <= for SuperBlockSize check matching Go

Go checks contentSize <= SuperBlockSize to detect empty volumes (no
needles). Rust used < which would incorrectly allow a volume with
exactly SuperBlockSize bytes (header only, no data) to proceed to
the TTL expiry check and potentially be marked as expired.

* Fix read_append_at_ns to read timestamps from tombstone entries

Go reads the full needle body for all entries including tombstones
(deleted needles with size=0) to extract the actual AppendAtNs
timestamp. The Rust version returned 0 early for size <= 0 entries,
which would cause the binary search in incremental copy to produce
incorrect results for positions containing deleted needles.

Now uses get_actual_size to compute the on-disk size (which handles
tombstones correctly) and only returns 0 when the actual size is 0.

* Add X-Request-Id response header matching Go's requestIDMiddleware

Go sets both X-Request-Id and x-amz-request-id response headers.
The Rust server only set x-amz-request-id, missing X-Request-Id.

* Add skip_serializing_if for UploadResult name and size fields

Go's UploadResult uses json:"name,omitempty" and json:"size,omitempty",
omitting these fields from JSON when they are zero values (empty
string / 0). The Rust struct always serialized them, producing
"name":"" and "size":0 where Go would omit them.

* Support JSONP/pretty-print for write success responses

Go's writeJsonQuiet checks for callback (JSONP) and pretty query
parameters on all JSON responses including write success. The Rust
write success path used axum::Json directly, bypassing JSONP and
pretty-print support. Now uses json_result_with_query to match Go.

* Include actual limit in file size limit error message

Go returns "file over the limited %d bytes" with the actual limit
value included. Rust returned a generic "file size limit exceeded"
without the limit value, making it harder to debug.

* Extract extension from 2-segment URL paths for image operations

Go's parseURLPath extracts the file extension from all URL formats
including 2-segment paths like /vid,fid.jpg. The Rust version only
handled 3-segment paths (/vid/fid/filename.ext), so extensions in
2-segment paths were lost. This caused image resize/crop operations
requested via query params to be silently skipped for those paths.

* Add size_hint to TrackedBody so throttled downloads get Content-Length

TrackedBody (used for download throttling) did not implement
size_hint(), causing HTTP/1.1 to fall back to chunked transfer
encoding instead of setting Content-Length. Go always sets
Content-Length explicitly for non-range responses.

* Add Last-Modified, pairs, and S3 headers to chunk manifest responses

Go sets Last-Modified, needle pairs, and S3 pass-through headers on
the response writer BEFORE calling tryHandleChunkedFile. Since the
Rust chunk manifest handler created fresh response headers and
returned early, these headers were missing from chunk manifest
responses. Now passes last_modified_str into the chunk manifest
handler and applies pairs and S3 pass-through query params
(response-cache-control, response-content-encoding, etc.) to the
chunk manifest response headers.

* Fix multipart fallback to use first part data when no filename

Go reads the first part's data unconditionally, then looks for a
part with a filename. If none found, Go uses the first part's data
(with empty filename). Rust only captured parts with filenames, so
when no part had a filename it fell back to the raw multipart body
bytes (including boundary delimiters), producing corrupt needle data.

* Set HasName and HasMime flags for empty values matching Go

Go's CreateNeedleFromRequest sets HasName and HasMime flags even when
the filename or MIME type is empty (len < 256 is true for len 0).
Rust skipped empty values, causing the on-disk needle format to
differ: Go-written needles include extra bytes for the empty name/mime
size fields, changing the serialized needle size in the idx entry.
This ensures binary format compatibility between Go and Rust servers.

* Add is_stopping guard to vacuum_volume_commit matching Go

Go's CommitCompactVolume (store_vacuum.go L53-54) checks
s.isStopping before committing compaction to prevent file
swaps during shutdown. The Rust handler was missing this
check, which could allow compaction commits while the
server is stopping.

* Remove disk_type from required status fields since Go omits it

Go's default DiskType is "" (HardDriveType), and protobuf's omitempty
tag causes empty strings to be dropped from JSON output.

* test: honor rust env in dual volume harness

* grpc: notify master after volume lifecycle changes

* http: proxy to replicas before download-limit timeout

* test: pass readMode to rust volume harnesses

* fix store free-location predicate selection

* fix volume copy disk placement and heartbeat notification

* fix chunk manifest delete replication

* fix write replication to survive client disconnects

* fix download limit proxy and wait flow

* fix crop gating for streamed reads

* fix upload limit wait counter behavior

* fix chunk manifest image transforms

* fix has_resize_ops to check width/height > 0 instead of is_some()

Go's shouldResizeImages condition is `width > 0 || height > 0`, so
`?width=0` correctly evaluates to false. Rust was using `is_some()`
which made `?width=0` evaluate to true, unnecessarily disabling
streaming reads for those requests.

* fix Content-MD5 to only compute and return when provided by client

Go only computes the MD5 of uncompressed data when a Content-MD5
header or multipart field is provided. Rust was always computing and
returning it. Also fix the mismatch error message to include size,
matching Go's format.

* fix save_vif to compute ExpireAtSec from TTL

Go's SaveVolumeInfo always computes ExpireAtSec = now + ttlSeconds
when the volume has a TTL. The save_vif path (used by set_read_only
and set_writable) was missing this computation, causing .vif files
to be written without the correct expiration timestamp for TTL volumes.

* fix set_writable to not modify no_write_can_delete

Go's MarkVolumeWritable only sets noWriteOrDelete=false and persists.
Rust was additionally setting no_write_can_delete=has_remote_file,
which could incorrectly change the write mode for remote-file volumes
when the master explicitly asks to make the volume writable.

* fix write_needle_blob_and_index to error on too-small V3 blob

Go returns an error when the needle blob is too small for timestamp
patching. Rust was silently skipping the patch and writing the blob
with a stale/zero timestamp, which could cause data integrity issues
during incremental replication that relies on AppendAtNs ordering.

* fix VolumeEcShardsToVolume to validate dataShards range

Go validates that dataShards is > 0 and <= MaxShardCount before
proceeding with EC-to-volume reconstruction. Without this check,
a zero or excessively large data_shards value could cause confusing
downstream failures.

* fix destroy to use VolumeError::NotEmpty instead of generic Io error

The dedicated NotEmpty variant exists in the enum but was not being
used. This makes error matching consistent with Go's ErrVolumeNotEmpty.

* fix SetState to persist state to disk with rollback on failure

Go's State.Update saves VolumeServerState to a state.pb file after
each SetState call, and rolls back the in-memory state if persistence
fails. Rust was only updating in-memory atomics, so maintenance mode
would be lost on server restart. Now saves protobuf-encoded state.pb
and loads it on startup.

* fix VolumeTierMoveDatToRemote to close local dat backend after upload

Go calls v.LoadRemoteFile() after saving volume info, which closes
the local DataBackend before transitioning to remote storage. Without
this, the volume holds a stale file handle to the deleted local .dat
file, causing reads to fail until server restart.

* fix VolumeTierMoveDatFromRemote to close remote dat backend after download

Go calls v.DataBackend.Close() and sets DataBackend=nil after removing
the remote file reference. Without this, the stale remote backend
state lingers and reads may not discover the newly downloaded local
.dat file until server restart.

* fix redirect to use internal url instead of public_url

Go's proxyReqToTargetServer builds the redirect Location header from
loc.Url (the internal URL), not publicUrl. Using public_url could
cause redirect failures when internal and external URLs differ.

* fix redirect test and add state_file_path to integration test

Update redirect unit test to expect internal url (matching the
previous fix). Add missing state_file_path field to the integration
test VolumeServerState constructor.

* fix FetchAndWriteNeedle to await all writes before checking errors

Go uses a WaitGroup to await all writes (local + replicas) before
checking errors. Rust was short-circuiting on local write failure,
which could leave replica writes in-flight without waiting for
completion.

* fix shutdown to send deregister heartbeat before pre_stop delay

Go's StopHeartbeat() closes stopChan immediately on interrupt, causing
the heartbeat goroutine to send the deregister heartbeat right away,
before the preStopSeconds delay. Rust was only setting is_stopping=true
without waking the heartbeat loop, so the deregister was delayed until
after the pre_stop sleep. Now we call volume_state_notify.notify_one()
to wake the heartbeat immediately.

* fix heartbeat response ordering to check duplicate UUIDs first

Go processes heartbeat responses in this order: DuplicatedUuids first,
then volume options (prealloc/size limit), then leader redirect. Rust
was applying volume options before checking for duplicate UUIDs, which
meant volume option changes would take effect even when the response
contained a duplicate UUID error that should cause an immediate return.

* the test thread was blocked

* fix(deps): update aws-lc-sys 0.38.0 → 0.39.0 to resolve security advisories

Bumps aws-lc-rs 1.16.1 → 1.16.2, pulling in aws-lc-sys 0.39.0 which
fixes GHSA-394x-vwmw-crm3 (X.509 Name Constraints wildcard/unicode
bypass) and GHSA-9f94-5g5w-gf6r (CRL Distribution Point scope check
logic error).

* fix: match Go Content-MD5 mismatch error message format

Go uses "Content-MD5 did not match md5 of file data expected [X]
received [Y] size Z" while Rust had a shorter format. Match the
exact Go error string so clients see identical messages.

* fix: match Go Bearer token length check (> 7, not >= 7)

Go requires len(bearer) > 7 ensuring at least one char after
"Bearer ". Rust used >= 7 which would accept an empty token.

* fix(deps): drop legacy rustls 0.21 to resolve rustls-webpki GHSA-pwjx-qhcg-rvj4

aws-sdk-s3's default "rustls" feature enables tls-rustls in
aws-smithy-runtime, which pulls in legacy-rustls-ring (rustls 0.21
→ rustls-webpki 0.101.7, moderate CRL advisory). Replace with
explicit default-https-client which uses only rustls 0.23 /
rustls-webpki 0.103.9.

* fix: use uploaded filename for auto-compression extension detection

Go extracts the file extension from pu.FileName (the uploaded
filename) for auto-compression decisions. Rust was using the URL
path, which typically has no extension for SeaweedFS file IDs.

* fix: add CRC legacy Value() backward-compat check on needle read

Go double-checks CRC: n.Checksum != crc && uint32(n.Checksum) !=
crc.Value(). The Value() path is a deprecated transform for compat
with seaweed versions prior to commit 056c480eb. Rust had the
legacy_value() method but wasn't using it in validation.

* fix: remove /stats/* endpoints to match Go (commented out since L130)

Go's volume_server.go has the /stats/counter, /stats/memory, and
/stats/disk endpoints commented out (lines 130-134). Remove them
from the Rust router along with the now-unused whitelist_guard
middleware.

* fix: filter application/octet-stream MIME for chunk manifests

Go's tryHandleChunkedFile (L334) filters out application/octet-stream
from chunk manifest MIME types, falling back to extension-based
detection. Rust was returning the stored MIME as-is for manifests.

* fix: VolumeMarkWritable returns error before notifying master

Go returns early at L200 if MarkVolumeWritable fails, before
reaching the master notification at L206. Rust was notifying master
even on failure, creating inconsistent state where master thinks
the volume is writable but local marking failed.

* fix: check volume existence before maintenance in MarkReadonly/Writable

Go's VolumeMarkReadonly (L239-241) and VolumeMarkWritable (L253-255)
look up the volume first, then call makeVolumeReadonly/Writable which
checks maintenance. Rust was checking maintenance first, returning
"maintenance mode" instead of "not found" for missing volumes.

* feat: implement ScrubVolume mark_broken_volumes_readonly (PR #8360)

Add the mark_broken_volumes_readonly flag from PR #8360:
- Sync proto field (tag 3) to local volume_server.proto
- After scrubbing, if flag is set, call makeVolumeReadonly on each
  broken volume (notify master, mark local readonly, notify again)
- Collect errors via joined error semantics matching Go's errors.Join
- Factor out make_volume_readonly helper reused by both
  VolumeMarkReadonly and ScrubVolume

Also refactors VolumeMarkReadonly to use the shared helper.

* fix(deps): update rustls-webpki 0.103.9 → 0.103.10 (GHSA-pwjx-qhcg-rvj4)

CRL Distribution Point matching logic fix for moderate severity
advisory about CRLs not considered authoritative.

* test: update integration tests for removed /stats/* endpoints

Replace tests that expected /stats/* routes to return 200/401 with
tests confirming they now fall through to the store handler (400),
matching Go's commented-out stats endpoints.

* docs: fix misleading comment about default offset feature

The comment said "4-byte offsets unless explicitly built with 5-byte
support" but the default feature enables 5bytes. This is intentional
for production parity with Go -tags 5BytesOffset builds. Fix the
comment to match reality.
2026-03-26 17:24:35 -07:00
Chris Lu
17028fbf59 fix: serialize SSE-KMS metadata when bucket default encryption applies KMS (#8780)
* fix: serialize SSE-KMS metadata when bucket default encryption applies KMS

When a bucket has default SSE-KMS encryption enabled and a file is uploaded
without explicit SSE headers, the encryption was applied correctly but the
SSE-KMS metadata (x-seaweedfs-sse-kms-key) was not serialized. This caused
downloads to fail with "empty SSE-KMS metadata" because the entry's Extended
map stored an empty byte slice.

The existing code already handled this for SSE-S3 bucket defaults
(SerializeSSES3Metadata) but was missing the equivalent call to
SerializeSSEKMSMetadata for the KMS path.

Fixes seaweedfs/seaweedfs#8776

* ci: add KMS integration tests to GitHub Actions

Add a kms-tests.yml workflow that runs on changes to KMS/SSE code with
two jobs:

1. KMS provider tests: starts OpenBao via Docker, runs Go integration
   tests in test/kms/ against a real KMS backend
2. S3 KMS e2e tests: starts OpenBao + weed mini built from source, runs
   test_s3_kms.sh which covers bucket-default SSE-KMS upload/download
   (the exact scenario from #8776)

Supporting changes:
- test/kms/Makefile: add CI targets (test-provider-ci, test-s3-kms-ci)
  that manage OpenBao via plain Docker and run weed from source
- test/kms/s3-config-openbao-template.json: S3 config template with
  OpenBao KMS provider for weed mini

* refactor: combine SSE-S3 and SSE-KMS metadata serialization into else-if

SSE-S3 and SSE-KMS bucket default encryption are mutually exclusive, so
use a single if/else-if block instead of two independent if blocks.

* Update .github/workflows/kms-tests.yml

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

* fix(ci): start weed mini from data dir to avoid Docker filer.toml

weed mini reads filer.toml from the current working directory first.
When running from test/kms/, it picked up the Docker-targeted filer.toml
which has dir="/data/filerdb" (a path that doesn't exist in CI), causing
a fatal crash at filer store initialization.

Fix by cd-ing to the data directory before starting weed mini.
Also improve log visibility on failure.

---------

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
2026-03-26 14:07:01 -07:00
dependabot[bot]
8558c586a0 build(deps): bump io.netty:netty-codec-http from 4.1.129.Final to 4.1.132.Final in /test/java/spark (#8784)
build(deps): bump io.netty:netty-codec-http in /test/java/spark

Bumps [io.netty:netty-codec-http](https://github.com/netty/netty) from 4.1.129.Final to 4.1.132.Final.
- [Release notes](https://github.com/netty/netty/releases)
- [Commits](https://github.com/netty/netty/compare/netty-4.1.129.Final...netty-4.1.132.Final)

---
updated-dependencies:
- dependency-name: io.netty:netty-codec-http
  dependency-version: 4.1.132.Final
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-03-26 13:00:03 -07:00
Chris Lu
aa12b51cbf test: restore coverage removed in PR #8360 (#8779)
* test: restore maintenance mode coverage in TestVolumeMarkReadonlyWritableErrorPaths

PR #8360 removed the maintenance mode assertions because the refactored
check ordering (volume lookup before maintenance check) caused the
original test to hit "not found" instead of "maintenance mode" — the
test used a non-existent volume ID.

Restore coverage by allocating a real volume, then verifying:
- existing volume in maintenance mode returns "maintenance mode"
- non-existent volume in maintenance mode still returns "not found"
  (validating the new check ordering)

* test: add coverage for ScrubVolume MarkBrokenVolumesReadonly flag

PR #8360 added the mark_broken_volumes_readonly field to ScrubVolumeRequest
but no tests exercised the new logic paths. Add three integration tests:

- HealthyVolume: flag is a no-op when scrub finds no broken volumes
- CorruptVolume: corrupted .idx triggers broken detection; without the
  flag the volume stays writable, with the flag it becomes read-only
- MaintenanceMode: makeVolumeReadonly fails under maintenance and
  ScrubVolume propagates the error via errors.Join

* refactor: extract CorruptIndexFile and EnableMaintenanceMode test helpers

Move duplicated idx corruption and maintenance mode setup into
framework.CorruptIndexFile() and framework.EnableMaintenanceMode()
helpers. Use defer for file close in the corruption helper.
2026-03-26 10:52:37 -07:00
Lisandro Pin
e5cf2d2a19 Give the ScrubVolume() RPC an option to flag found broken volumes as read-only. (#8360)
* Give the `ScrubVolume()` RPC an option to flag found broken volumes as read-only.

Also exposes this option in the shell `volume.scrub` command.

* Remove redundant test in `TestVolumeMarkReadonlyWritableErrorPaths`.

417051bb slightly rearranges the logic for `VolumeMarkReadonly()` and `VolumeMarkWritable()`,
so calling them for invalid volume IDs will actually yield that error, instead of checking
maintnenance mode first.
2026-03-26 10:20:57 -07:00
Chris Lu
94bfa2b340 mount: stream all filer mutations over single ordered gRPC stream (#8770)
* filer: add StreamMutateEntry bidi streaming RPC

Add a bidirectional streaming RPC that carries all filer mutation
types (create, update, delete, rename) over a single ordered stream.
This eliminates per-request connection overhead for pipelined
operations and guarantees mutation ordering within a stream.

The server handler delegates each request to the existing unary
handlers (CreateEntry, UpdateEntry, DeleteEntry) and uses a proxy
stream adapter for rename operations to reuse StreamRenameEntry logic.

The is_last field signals completion for multi-response operations
(rename sends multiple events per request; create/update/delete
always send exactly one response with is_last=true).

* mount: add streaming mutation multiplexer (streamMutateMux)

Implement a client-side multiplexer that routes all filer mutation RPCs
(create, update, delete, rename) over a single bidirectional gRPC
stream. Multiple goroutines submit requests through a send channel;
a dedicated sendLoop serializes them on the stream; a recvLoop
dispatches responses to waiting callers via per-request channels.

Key features:
- Lazy stream opening on first use
- Automatic reconnection on stream failure
- Permanent fallback to unary RPCs if filer returns Unimplemented
- Monotonic request_id for response correlation
- Multi-response support for rename operations (is_last signaling)

The mux is initialized on WFS and closed during unmount cleanup.
No call sites use it yet — wiring comes in subsequent commits.

* mount: route CreateEntry and UpdateEntry through streaming mux

Wire all CreateEntry call sites to use wfs.streamCreateEntry() which
routes through the StreamMutateEntry stream when available, falling
back to unary RPCs otherwise. Also wire Link's UpdateEntry calls
through wfs.streamUpdateEntry().

Updated call sites:
- flushMetadataToFiler (file flush after write)
- Mkdir (directory creation)
- Symlink (symbolic link creation)
- createRegularFile non-deferred path (Mknod)
- flushFileMetadata (periodic metadata flush)
- Link (hard link: update source + create link + rollback)

* mount: route UpdateEntry and DeleteEntry through streaming mux

Wire remaining mutation call sites through the streaming mux:

- saveEntry (Setattr/chmod/chown/utimes) → streamUpdateEntry
- Unlink → streamDeleteEntry (replaces RemoveWithResponse)
- Rmdir → streamDeleteEntry (replaces RemoveWithResponse)

All filer mutations except Rename now go through StreamMutateEntry
when the filer supports it, with automatic unary RPC fallback.

* mount: route Rename through streaming mux

Wire Rename to use streamMutate.Rename() when available, with
fallback to the existing StreamRenameEntry unary stream.

The streaming mux sends rename as a StreamRenameEntryRequest oneof
variant. The server processes it through the existing rename logic
and sends multiple StreamRenameEntryResponse events (one per moved
entry), with is_last=true on the final response.

All filer mutations now go through a single ordered stream.

* mount: fix stream mux connection ownership

WithGrpcClient(streamingMode=true) closes the gRPC connection when
the callback returns, destroying the stream. Own the connection
directly via pb.GrpcDial so it stays alive for the stream's lifetime.
Close it explicitly in recvLoop on stream failure and in Close on
shutdown.

* mount: fix rename failure for deferred-create files

Three fixes for rename operations over the streaming mux:

1. lookupEntry: fall back to local metadata store when filer returns
   "not found" for entries in uncached directories. Files created with
   deferFilerCreate=true exist only in the local leveldb store until
   flushed; lookupEntry skipped the local store when the parent
   directory had never been readdir'd, causing rename to fail with
   ENOENT.

2. Rename: wait for pending async flushes and force synchronous flush
   of dirty metadata before sending rename to the filer. Covers the
   writebackCache case where close() defers the flush to a background
   worker that may not complete before rename fires.

3. StreamMutateEntry: propagate rename errors from server to client.
   Add error/errno fields to StreamMutateEntryResponse so the mount
   can map filer errors to correct FUSE status codes instead of
   silently returning OK. Also fix the existing Rename error handler
   which could return fuse.OK on unrecognized errors.

* mount: fix streaming mux error handling, sendLoop lifecycle, and fallback

Address PR review comments:

1. Server: populate top-level Error/Errno on StreamMutateEntryResponse for
   create/update/delete errors, not just rename. Previously update errors
   were silently dropped and create/delete errors were only in nested
   response fields that the client didn't check.

2. Client: check nested error fields in CreateEntry (ErrorCode, Error)
   and DeleteEntry (Error) responses, matching CreateEntryWithResponse
   behavior.

3. Fix sendLoop lifecycle: give each stream generation a stopSend channel.
   recvLoop closes it on error to stop the paired sendLoop. Previously a
   reconnect left the old sendLoop draining sendCh, breaking ordering.

4. Transparent fallback: stream helpers and doRename fall back to unary
   RPCs on transport errors (ErrStreamTransport), including the first
   Unimplemented from ensureStream. Previously the first call failed
   instead of degrading.

5. Filer rotation in openStream: try all filer addresses on dial failure,
   matching WithFilerClient behavior. Stop early on Unimplemented.

6. Pass metadata-bearing context to StreamMutateEntry RPC call so
   sw-client-id header is actually sent.

7. Gate lookupEntry local-cache fallback on open dirty handle or pending
   async flush to avoid resurrecting deleted/renamed entries.

8. Remove dead code in flushFileMetadata (err=nil followed by if err!=nil).

9. Use string matching for rename error-to-errno mapping in the mount to
   stay portable across Linux/macOS (numeric errno values differ).

* mount: make failAllPending idempotent with delete-before-close

Change failAllPending to collect pending entries into a local slice
(deleting from the sync.Map first) before closing channels. This
prevents double-close panics if called concurrently. Also remove the
unused err parameter.

* mount: add stream generation tracking and teardownStream

Introduce a generation counter on streamMutateMux that increments each
time a new stream is created. Requests carry the generation they were
enqueued for so sendLoop can reject stale requests after reconnect.

Add teardownStream(gen) which is idempotent (only acts when gen matches
current generation and stream is non-nil). Both sendLoop and recvLoop
call it on error, replacing the inline cleanup in recvLoop. sendLoop now
actively triggers teardown on send errors instead of silently exiting.

ensureStream waits for the prior generation's recvDone before creating a
new stream, ensuring all old pending waiters are failed before reconnect.
recvLoop now takes the stream, generation, and recvDone channel as
parameters to avoid accessing shared fields without the lock.

* mount: harden Close to prevent races with teardownStream

Nil out stream, cancel, and grpcConn under the lock so that any
concurrent teardownStream call from recvLoop/sendLoop becomes a no-op.
Call failAllPending before closing sendCh to unblock waiters promptly.
Guard recvDone with a nil check for the case where Close is called
before any stream was ever opened.

* mount: make errCh receive ctx-aware in doUnary and Rename

Replace the blocking <-sendReq.errCh with a select that also observes
ctx.Done(). If sendLoop exits via stopSend without consuming a buffered
request, the caller now returns ctx.Err() instead of blocking forever.
The buffered errCh (capacity 1) ensures late acknowledgements from
sendLoop don't block the sender.

* mount: fix sendLoop/Close race and recvLoop/teardown pending channel race

Three related fixes:

1. Stop closing sendCh in Close(). Closing the shared producer channel
   races with callers who passed ensureStream() but haven't sent yet,
   causing send-on-closed-channel panics. sendCh is now left open;
   ensureStream checks m.closed to reject new callers.

2. Drain buffered sendCh items on shutdown. sendLoop defers drainSendCh()
   on exit so buffered requests get an ErrStreamTransport on their errCh
   instead of blocking forever. Close() drains again for any stragglers
   enqueued between sendLoop's drain and the final shutdown.

3. Move failAllPending from teardownStream into recvLoop's defer.
   teardownStream (called from sendLoop on send error) was closing
   pending response channels while recvLoop could be between
   pending.Load and the channel send — a send-on-closed-channel panic.
   recvLoop is now the sole closer of pending channels, eliminating
   the race. Close() waits on recvDone (with cancel() to guarantee
   Recv unblocks) so pending cleanup always completes.

* filer/mount: add debug logging for hardlink lifecycle

Add V(0) logging at every point where a HardLinkId is created, stored,
read, or deleted to trace orphaned hardlink references. Logging covers:

- gRPC server: CreateEntry/UpdateEntry when request carries HardLinkId
- FilerStoreWrapper: InsertEntry/UpdateEntry when entry has HardLinkId
- handleUpdateToHardLinks: entry path, HardLinkId, counter, chunk count
- setHardLink: KvPut with blob size
- maybeReadHardLink: V(1) on read attempt and successful decode
- DeleteHardLink: counter decrement/deletion events
- Mount Link(): when NewHardLinkId is generated and link is created

This helps diagnose how a git pack .rev file ended up with a
HardLinkId during a clone (no hard links should be involved).

* test: add git clone/pull integration test for FUSE mount

Shell script that exercises git operations on a SeaweedFS mount:
1. Creates a bare repo on the mount
2. Clones locally, makes 3 commits, pushes to mount
3. Clones from mount bare repo into an on-mount working dir
4. Verifies clone integrity (files, content, commit hashes)
5. Pushes 2 more commits with renames and deletes
6. Checks out an older revision on the mount clone
7. Returns to branch and pulls with real changes
8. Verifies file content, renames, deletes after pull
9. Checks git log integrity and clean status

27 assertions covering file existence, content, commit hashes,
file counts, renames, deletes, and git status. Run against any
existing mount: bash test-git-on-mount.sh /path/to/mount

* test: add git clone/pull FUSE integration test to CI suite

Add TestGitOperations to the existing fuse_integration test framework.
The test exercises git's full file operation surface on the mount:

  1. Creates a bare repo on the mount (acts as remote)
  2. Clones locally, makes 3 commits (files, bulk data, renames), pushes
  3. Clones from mount bare repo into an on-mount working dir
  4. Verifies clone integrity (content, commit hash, file count)
  5. Pushes 2 more commits with new files, renames, and deletes
  6. Checks out an older revision on the mount clone
  7. Returns to branch and pulls with real fast-forward changes
  8. Verifies post-pull state: content, renames, deletes, file counts
  9. Checks git log integrity (5 commits) and clean status

Runs automatically in the existing fuse-integration.yml CI workflow.

* mount: fix permission check with uid/gid mapping

The permission checks in createRegularFile() and Access() compared
the caller's local uid/gid against the entry's filer-side uid/gid
without applying the uid/gid mapper. With -map.uid 501:0, a directory
created as uid 0 on the filer would not match the local caller uid
501, causing hasAccess() to fall through to "other" permission bits
and reject write access (0755 → other has r-x, no w).

Fix: map entry uid/gid from filer-space to local-space before the
hasAccess() call so both sides are in the same namespace.

This fixes rsync -a failing with "Permission denied" on mkstempat
when using uid/gid mapping.

* mount: fix Mkdir/Symlink returning filer-side uid/gid to kernel

Mkdir and Symlink used `defer wfs.mapPbIdFromFilerToLocal(entry)` to
restore local uid/gid, but `outputPbEntry` writes the kernel response
before the function returns — so the kernel received filer-side
uid/gid (e.g., 0:0). macFUSE then caches these and rejects subsequent
child operations (mkdir, create) because the caller uid (501) doesn't
match the directory owner (0), and "other" bits (0755 → r-x) lack
write permission.

Fix: replace the defer with an explicit call to mapPbIdFromFilerToLocal
before outputPbEntry, so the kernel gets local uid/gid.

Also add nil guards for UidGidMapper in Access and createRegularFile
to prevent panics in tests that don't configure a mapper.

This fixes rsync -a "Permission denied" on mkpathat for nested
directories when using uid/gid mapping.

* mount: fix Link outputting filer-side uid/gid to kernel, add nil guards

Link had the same defer-before-outputPbEntry bug as Mkdir and Symlink:
the kernel received filer-side uid/gid because the defer hadn't run
yet when outputPbEntry wrote the response.

Also add nil guards for UidGidMapper in Access and createRegularFile
so tests without a mapper don't panic.

Audit of all outputPbEntry/outputFilerEntry call sites:
- Mkdir: fixed in prior commit (explicit map before output)
- Symlink: fixed in prior commit (explicit map before output)
- Link: fixed here (explicit map before output)
- Create (existing file): entry from maybeLoadEntry (already mapped)
- Create (deferred): entry has local uid/gid (never mapped to filer)
- Create (non-deferred): createRegularFile defer runs before return
- Mknod: createRegularFile defer runs before return
- Lookup: entry from lookupEntry (already mapped)
- GetAttr: entry from maybeReadEntry/maybeLoadEntry (already mapped)
- readdir: entry from cache (mapIdFromFilerToLocal) or filer (mapped)
- saveEntry: no kernel output
- flushMetadataToFiler: no kernel output
- flushFileMetadata: no kernel output

* test: fix git test for same-filesystem FUSE clone

When both the bare repo and working clone live on the same FUSE mount,
git's local transport uses hardlinks and cross-repo stat calls that
fail on FUSE. Fix:

- Use --no-local on clone to disable local transport optimizations
- Use reset --hard instead of checkout to stay on branch
- Use fetch + reset --hard origin/<branch> instead of git pull to
  avoid local transport stat failures during fetch

* adjust logging

* test: use plain git clone/pull to exercise real FUSE behavior

Remove --no-local and fetch+reset workarounds. The test should use
the same git commands users run (clone, reset --hard, pull) so it
reveals real FUSE issues rather than hiding them.

* test: enable V(1) logging for filer/mount and collect logs on failure

- Run filer and mount with -v=1 so hardlink lifecycle logs (V(0):
  create/delete/insert, V(1): read attempts) are captured
- On test failure, automatically dump last 16KB of all process logs
  (master, volume, filer, mount) to test output
- Copy process logs to /tmp/seaweedfs-fuse-logs/ for CI artifact upload
- Update CI workflow to upload SeaweedFS process logs alongside test output

* mount: clone entry for filer flush to prevent uid/gid race

flushMetadataToFiler and flushFileMetadata used entry.GetEntry() which
returns the file handle's live proto entry pointer, then mutated it
in-place via mapPbIdFromLocalToFiler. During the gRPC call window, a
concurrent Lookup (which takes entryLock.RLock but NOT fhLockTable)
could observe filer-side uid/gid (e.g., 0:0) on the file handle entry
and return it to the kernel. The kernel caches these attributes, so
subsequent opens by the local user (uid 501) fail with EACCES.

Fix: proto.Clone the entry before mapping uid/gid for the filer request.
The file handle's live entry is never mutated, so concurrent Lookup
always sees local uid/gid.

This fixes the intermittent "Permission denied" on .git/FETCH_HEAD
after the first git pull on a mount with uid/gid mapping.

* mount: add debug logging for stale lock file investigation

Add V(0) logging to trace the HEAD.lock recreation issue:
- Create: log when O_EXCL fails (file already exists) with uid/gid/mode
- completeAsyncFlush: log resolved path, saved path, dirtyMetadata,
  isDeleted at entry to trace whether async flush fires after rename
- flushMetadataToFiler: log the dir/name/fullpath being flushed

This will show whether the async flush is recreating the lock file
after git renames HEAD.lock → HEAD.

* mount: prevent async flush from recreating renamed .lock files

When git renames HEAD.lock → HEAD, the async flush from the prior
close() can run AFTER the rename and re-insert HEAD.lock into the
meta cache via its CreateEntryRequest response event. The next git
pull then sees HEAD.lock and fails with "File exists".

Fix: add isRenamed flag on FileHandle, set by Rename before waiting
for the pending async flush. The async flush checks this flag and
skips the metadata flush for renamed files (same pattern as isDeleted
for unlinked files). The data pages still flush normally.

The Rename handler flushes deferred metadata synchronously (Case 1)
before setting isRenamed, ensuring the entry exists on the filer for
the rename to proceed. For already-released handles (Case 2), the
entry was created by a prior flush.

* mount: also mark renamed inodes via entry.Attributes.Inode fallback

When GetInode fails (Forget already removed the inode mapping), the
Rename handler couldn't find the pending async flush to set isRenamed.
The async flush then recreated the .lock file on the filer.

Fix: fall back to oldEntry.Attributes.Inode to find the pending async
flush when the inode-to-path mapping is gone. Also extract
MarkInodeRenamed into a method on FileHandleToInode for clarity.

* mount: skip async metadata flush when saved path no longer maps to inode

The isRenamed flag approach failed for refs/remotes/origin/HEAD.lock
because neither GetInode nor oldEntry.Attributes.Inode could find the
inode (Forget already evicted the mapping, and the entry's stored
inode was 0).

Add a direct check in completeAsyncFlush: before flushing metadata,
verify that the saved path still maps to this inode in the inode-to-path
table. If the path was renamed or removed (inode mismatch or not found),
skip the metadata flush to avoid recreating a stale entry.

This catches all rename cases regardless of whether the Rename handler
could set the isRenamed flag.

* mount: wait for pending async flush in Unlink before filer delete

Unlink was deleting the filer entry first, then marking the draining
async-flush handle as deleted. The async flush worker could race between
these two operations and recreate the just-unlinked entry on the filer.
This caused git's .lock files (e.g. refs/remotes/origin/HEAD.lock) to
persist after git pull, breaking subsequent git operations.

Move the isDeleted marking and add waitForPendingAsyncFlush() before
the filer delete so any in-flight flush completes first. Even if the
worker raced past the isDeleted check, the wait ensures it finishes
before the filer delete cleans up any recreated entry.

* mount: reduce async flush and metadata flush log verbosity

Raise completeAsyncFlush entry log, saved-path-mismatch skip log, and
flushMetadataToFiler entry log from V(0) to V(3)/V(4). These fire for
every file close with writebackCache and are too noisy for normal use.

* filer: reduce hardlink debug log verbosity from V(0) to V(4)

HardLinkId logs in filerstore_wrapper, filerstore_hardlink, and
filer_grpc_server fire on every hardlinked file operation (git pack
files use hardlinks extensively) and produce excessive noise.

* mount/filer: reduce noisy V(0) logs for link, rmdir, and empty folder check

- weedfs_link.go: hardlink creation logs V(0) → V(4)
- weedfs_dir_mkrm.go: non-empty folder rmdir error V(0) → V(1)
- empty_folder_cleaner.go: "not empty" check log V(0) → V(4)

* filer: handle missing hardlink KV as expected, not error

A "kv: not found" on hardlink read is normal when the link blob was
already cleaned up but a stale entry still references it. Log at V(1)
for not-found; keep Error level for actual KV failures.

* test: add waitForDir before git pull in FUSE git operations test

After git reset --hard, the FUSE mount's metadata cache may need a
moment to settle on slow CI. The git pull subprocess (unpack-objects)
could fail to stat the working directory. Poll for up to 5s.

* Update git_operations_test.go

* wait

* test: simplify FUSE test framework to use weed mini

Replace the 4-process setup (master + volume + filer + mount) with
2 processes: "weed mini" (all-in-one) + "weed mount". This simplifies
startup, reduces port allocation, and is faster on CI.

* test: fix mini flag -admin → -admin.ui
2026-03-25 20:06:34 -07:00
Chris Lu
805625d06e Add FUSE integration tests for POSIX file locking (#8752)
* Add FUSE integration tests for POSIX file locking

Test flock() and fcntl() advisory locks over the FUSE mount:
- Exclusive and shared flock with conflict detection
- flock upgrade (shared to exclusive) and release on close
- fcntl F_SETLK write lock conflicts and shared read locks
- fcntl F_GETLK conflict reporting on overlapping byte ranges
- Non-overlapping byte-range locks held independently
- F_SETLKW blocking until conflicting lock is released
- Lock release on file descriptor close
- Concurrent lock contention with multiple workers

* Fix review feedback in POSIX lock integration tests

- Assert specific EAGAIN error on fcntl lock conflicts instead of generic Error
- Use O_APPEND in concurrent contention test so workers append rather than overwrite
- Verify exact line count (numWorkers * writesPerWorker) after concurrent test
- Check unlock error in F_SETLKW blocking test goroutine

* Refactor fcntl tests to use subprocesses for inter-process semantics

POSIX fcntl locks use the process's files_struct as lock owner, so all
fds in the same process share the same owner and never conflict. This
caused the fcntl tests to silently pass without exercising lock conflicts.

Changes:
- Add TestFcntlLockHelper subprocess entry point with hold/try/getlk actions
- Add lockHolder with channel-based coordination (no scanner race)
- Rewrite all fcntl tests to run contenders in separate subprocesses
- Fix F_UNLCK int16 cast in GetLk assertion for type-safe comparison
- Fix concurrent test: use non-blocking flock with retry to avoid
  exhausting go-fuse server reader goroutines (blocking FUSE SETLKW
  can starve unlock request processing, causing deadlock)

flock tests remain same-process since flock uses per-struct-file owners.

* Fix misleading comment and error handling in lock test subprocess

- Fix comment: tryLockInSubprocess tests a subprocess, not the test process
- Distinguish EAGAIN/EACCES from unexpected errors in subprocess try mode
  so real failures aren't silently masked as lock conflicts

* Fix CI race in FcntlReleaseOnClose and increase flock retry budget

- FcntlReleaseOnClose: retry lock acquisition after subprocess exits
  since the FUSE server may not process Release immediately
- ConcurrentLockContention: increase retry limit from 500 to 3000
  (5s → 30s budget) to handle CI load

* separating flock and fcntl in the in-memory lock table and cleaning them up through the right release path: PID for POSIX locks, lock owner for flock

* ReleasePosixOwner

* weed/mount: flush before releasing posix close owner

* weed/mount: keep woken lock waiters from losing inode state

* test/fuse: make blocking fcntl helper state explicit

* test/fuse: assert flock contention never overlaps

* test/fuse: stabilize concurrent lock contention check

* test/fuse: make concurrent contention writes deterministic

* weed/mount: retry synchronous metadata flushes
2026-03-24 11:43:25 -07:00
Chris Lu
2844c70ecf fix tests 2026-03-23 19:36:14 -07:00
Chris Lu
e5f72077ee fix: resolve CORS cache race condition causing stale 404 responses (#8748)
The metadata subscription handler (updateBucketConfigCacheFromEntry) was
making a separate RPC call via loadCORSFromBucketContent to load CORS
configuration. This created a race window where a slow CreateBucket
subscription event could re-cache stale data after PutBucketCors had
already cleared the cache, causing subsequent GetBucketCors to return
404 NoSuchCORSConfiguration.

Parse CORS directly from the subscription entry's Content field instead
of making a separate RPC. Also fix getBucketConfig to parse CORS from
the already-fetched entry, eliminating a redundant RPC call.

Fix TestCORSCaching to use require.NoError to prevent nil pointer
dereference panics when GetBucketCors fails.
2026-03-23 19:33:20 -07:00
Chris Lu
6bf654c25c fix: keep metadata subscriptions progressing (#8730) (#8746)
* fix: keep metadata subscriptions progressing (#8730)

* test: cancel slow metadata writers with parent context

* filer: ignore missing persisted log chunks
2026-03-23 15:26:54 -07:00
Chris Lu
d5ee35c8df Fix S3 delete for non-empty directory markers (#8740)
* Fix S3 delete for non-empty directory markers

* Address review feedback on directory marker deletes

* Stabilize FUSE concurrent directory operations
2026-03-23 13:35:16 -07:00
Chris Lu
9434d3733d mount: async flush on close() when writebackCache is enabled (#8727)
* mount: async flush on close() when writebackCache is enabled

When -writebackCache is enabled, defer data upload and metadata flush
from Flush() (triggered by close()) to a background goroutine in
Release(). This allows processes like rsync that write many small files
to proceed to the next file immediately instead of blocking on two
network round-trips (volume upload + filer metadata) per file.

Fixes #8718

* mount: add retry with backoff for async metadata flush

The metadata flush in completeAsyncFlush now retries up to 3 times
with exponential backoff (1s, 2s, 4s) on transient gRPC errors.
Since the chunk data is already safely on volume servers at this point,
only the filer metadata reference needs persisting — retrying is both
safe and effective.

Data flush (FlushData) is not retried externally because
UploadWithRetry already handles transient HTTP/gRPC errors internally;
if it still fails, the chunk memory has been freed.

* test: add integration tests for writebackCache async flush

Add comprehensive FUSE integration tests for the writebackCache
async flush feature (issue #8718):

- Basic operations: write/read, sequential files, large files, empty
  files, overwrites
- Fsync correctness: fsync forces synchronous flush even in writeback
  mode, immediate read-after-fsync
- Concurrent small files: multi-worker parallel writes (rsync-like
  workload), multi-directory, rapid create/close
- Data integrity: append after close, partial writes, file size
  correctness, binary data preservation
- Performance comparison: writeback vs synchronous flush throughput
- Stress test: 16 workers x 100 files with content verification
- Mixed concurrent operations: reads, writes, creates running together

Also fix pre-existing test infrastructure issues:
- Rename framework.go to framework_test.go (fixes Go package conflict)
- Fix undefined totalSize variable in concurrent_operations_test.go

* ci: update fuse-integration workflow to run full test suite

The workflow previously only ran placeholder tests (simple_test.go,
working_demo_test.go) in a temp directory due to a Go module conflict.
Now that framework.go is renamed to framework_test.go, the full test
suite compiles and runs correctly from test/fuse_integration/.

Changes:
- Run go test directly in test/fuse_integration/ (no temp dir copy)
- Install weed binary to /usr/local/bin for test framework discovery
- Configure /etc/fuse.conf with user_allow_other for FUSE mounts
- Install fuse3 for modern FUSE support
- Stream test output to log file for artifact upload

* mount: fix three P1 races in async flush

P1-1: Reopen overwrites data still flushing in background
ReleaseByHandle removes the old handle from fhMap before the deferred
flush finishes. A reopen of the same inode during that window would
build from stale filer metadata, overwriting the async flush.

Fix: Track in-flight async flushes per inode via pendingAsyncFlush map.
AcquireHandle now calls waitForPendingAsyncFlush(inode) to block until
any pending flush completes before reading filer metadata.

P1-2: Deferred flush races rename and unlink after close
completeAsyncFlush captured the path once at entry, but rename or
unlink after close() could cause metadata to be written under the
wrong name or recreate a deleted file.

Fix: Re-resolve path from inode via GetPath right before metadata
flush. GetPath returns the current path (reflecting renames) or
ENOENT (if unlinked), in which case we skip the metadata flush.

P1-3: SIGINT/SIGTERM bypasses the async-flush drain
grace.OnInterrupt runs hooks then calls os.Exit(0), so
WaitForAsyncFlush after server.Serve() never executes on signal.

Fix: Add WaitForAsyncFlush (with 10s timeout) to the WFS interrupt
handler, before cache cleanup. The timeout prevents hanging on Ctrl-C
when the filer is unreachable.

* mount: fix P1 races — draining handle stays in fhMap

P1-1: Reopen TOCTOU
The gap between ReleaseByHandle removing from fhMap and
submitAsyncFlush registering in pendingAsyncFlush allowed a
concurrent AcquireHandle to slip through with stale metadata.

Fix: Hold pendingAsyncFlushMu across both the counter decrement
(ReleaseByHandle) and the pending registration. The handle is
registered as pending before the lock is released, so
waitForPendingAsyncFlush always sees it.

P1-2: Rename/unlink can't find draining handle
ReleaseByHandle deleted from fhMap immediately. Rename's
FindFileHandle(inode) at line 251 could not find the handle to
update entry.Name. Unlink could not coordinate either.

Fix: When asyncFlushPending is true, ReleaseByHandle/ReleaseByInode
leave the handle in fhMap (counter=0 but maps intact). The handle
stays visible to FindFileHandle so rename can update entry.Name.
completeAsyncFlush re-resolves the path from the inode (GetPath)
right before metadata flush for correctness after rename/unlink.
After drain, RemoveFileHandle cleans up the maps.

Double-return prevention: ReleaseByHandle/ReleaseByInode return nil
if counter is already <= 0, so Forget after Release doesn't start a
second drain goroutine.

P1-3: SIGINT deletes swap files under running goroutines
After the 10s timeout, os.RemoveAll deleted the write cache dir
(containing swap files) while FlushData goroutines were still
reading from them.

Fix: Increase timeout to 30s. If timeout expires, skip write cache
dir removal so in-flight goroutines can finish reading swap files.
The OS (or next mount) cleans them up. Read cache is always removed.

* mount: never skip metadata flush when Forget drops inode mapping

Forget removes the inode→path mapping when the kernel's lookup count
reaches zero, but this does NOT mean the file was unlinked — it only
means the kernel evicted its cache entry.  completeAsyncFlush was
treating GetPath failure as "file unlinked" and skipping the metadata
flush, which orphaned the just-uploaded chunks for live files.

Fix: Save dir and name at doFlush defer time.  In completeAsyncFlush,
try GetPath first to pick up renames; if the mapping is gone, fall
back to the saved dir/name.  Always attempt the metadata flush — the
filer is the authority on whether the file exists, not the local
inode cache.

* mount: distinguish Forget from Unlink in async flush path fallback

The saved-path fallback (from the previous fix) always flushed
metadata when GetPath failed, which recreated files that were
explicitly unlinked after close().  The same stale fallback could
recreate the pre-rename path if Forget dropped the inode mapping
after a rename.

Root cause: GetPath failure has two meanings:
  1. Forget — kernel evicted the cache entry (file still exists)
  2. Unlink — file was explicitly deleted (should not recreate)

Fix (three coordinated changes):

Unlink (weedfs_file_mkrm.go): Before RemovePath, look up the inode
and find any draining handle via FindFileHandle.  Set fh.isDeleted =
true so the async flush knows the file was explicitly removed.

Rename (weedfs_rename.go): When renaming a file with a draining
handle, update asyncFlushDir/asyncFlushName to the post-rename
location.  This keeps the saved-path fallback current so Forget
after rename doesn't flush to the old (pre-rename) path.

completeAsyncFlush (weedfs_async_flush.go): Check fh.isDeleted
first — if true, skip metadata flush (file was unlinked, chunks
become orphans for volume.fsck).  Otherwise, try GetPath for the
current path (renames); fall back to saved path if Forget dropped
the mapping (file is live, just evicted from kernel cache).

* test/ci: address PR review nitpicks

concurrent_operations_test.go:
- Restore precise totalSize assertion instead of info.Size() > 0

writeback_cache_test.go:
- Check rand.Read errors in all 3 locations (lines 310, 512, 757)
- Check os.MkdirAll error in stress test (line 752)
- Remove dead verifyErrors variable (line 332)
- Replace both time.Sleep(5s) with polling via waitForFileContent
  to avoid flaky tests under CI load (lines 638, 700)

fuse-integration.yml:
- Add set -o pipefail so go test failures propagate through tee

* ci: fix fuse3/fuse package conflict on ubuntu-22.04 runner

fuse3 is pre-installed on ubuntu-22.04 runners and conflicts with
the legacy fuse package. Only install libfuse3-dev for the headers.

* mount/page_writer: remove debug println statements

Remove leftover debug println("read new data1/2") from
ReadDataAt in MemChunk and SwapFileChunk.

* test: fix findWeedBinary matching source directory instead of binary

findWeedBinary() matched ../../weed (the source directory) via
os.Stat before checking PATH, then tried to exec a directory
which fails with "permission denied" on the CI runner.

Fix: Check PATH first (reliable in CI where the binary is installed
to /usr/local/bin). For relative paths, verify the candidate is a
regular file (!info.IsDir()). Add ../../weed/weed as a candidate
for in-tree builds.

* test: fix framework — dynamic ports, output capture, data dirs

The integration test framework was failing in CI because:

1. All tests used hardcoded ports (19333/18080/18888), so sequential
   tests could conflict when prior processes hadn't fully released
   their ports yet.

2. Data subdirectories (data/master, data/volume) were not created
   before starting processes.

3. Master was started with -peers=none which is not a valid address.

4. Process stdout/stderr was not captured, making failures opaque
   ("service not ready within timeout" with no diagnostics).

5. The unmount fallback used 'umount' instead of 'fusermount -u'.

6. The mount used -cacheSizeMB (nonexistent) instead of
   -cacheCapacityMB and was missing -allowOthers=false for
   unprivileged CI runners.

Fixes:
- Dynamic port allocation via freePort() (net.Listen ":0")
- Explicit gRPC ports via -port.grpc to avoid default port conflicts
- Create data/master and data/volume directories in Setup()
- Remove invalid -peers=none and -raftBootstrap flags
- Capture process output to logDir/*.log via startProcess() helper
- dumpLog() prints tail of log file on service startup failure
- Use fusermount3/fusermount -u for unmount
- Fix mount flag names (-cacheCapacityMB, -allowOthers=false)

* test: remove explicit -port.grpc flags from test framework

SeaweedFS convention: gRPC port = HTTP port + 10000.  Volume and
filer discover the master gRPC port by this convention.  Setting
explicit -port.grpc on master/volume/filer broke inter-service
communication because the volume server computed master gRPC as
HTTP+10000 but the actual gRPC was on a different port.

Remove all -port.grpc flags and let the default convention work.
Dynamic HTTP ports already ensure uniqueness; the derived gRPC
ports (HTTP+10000) will also be unique.

---------

Co-authored-by: Copilot <copilot@github.com>
2026-03-22 15:24:08 -07:00
Chris Lu
d6a872c4b9 Preserve explicit directory markers with octet-stream MIME (#8726)
* Preserve octet-stream MIME on explicit directory markers

* Run empty directory marker regression in CI

* Run S3 Spark workflow for filer changes
2026-03-21 19:31:56 -07:00
Chris Lu
80f3079d2a fix(s3): include directory markers in ListObjects without delimiter (#8704)
* fix(s3): include directory markers in ListObjects without delimiter (#8698)

Directory key objects (zero-byte objects with keys ending in "/") created
via PutObject were omitted from ListObjects/ListObjectsV2 results when no
delimiter was specified. AWS S3 includes these as regular keys in Contents.

The issue was in doListFilerEntries: when recursing into directories in
non-delimiter mode, directory key objects were only emitted when
prefixEndsOnDelimiter was true. Added an else branch to emit them in the
general recursive case as well.

* remove issue reference from inline comment

* test: add child-under-marker and paginated listing coverage

Extend test 6 to place a child object under the directory marker
and paginate with MaxKeys=1 so the emit-then-recurse truncation
path is exercised.

* fix(test): skip directory markers in Spark temporary artifacts check

The listing check now correctly shows directory markers (keys ending
in "/") after the ListObjects fix. These 0-byte metadata objects are
not data artifacts — filter them from the listing check since the
HeadObject-based check already verifies their cleanup with a timeout.
2026-03-19 15:36:11 -07:00
dependabot[bot]
55bc363228 build(deps): bump github.com/buger/jsonparser from 1.1.1 to 1.1.2 in /test/kafka (#8703)
build(deps): bump github.com/buger/jsonparser in /test/kafka

Bumps [github.com/buger/jsonparser](https://github.com/buger/jsonparser) from 1.1.1 to 1.1.2.
- [Release notes](https://github.com/buger/jsonparser/releases)
- [Commits](https://github.com/buger/jsonparser/compare/v1.1.1...v1.1.2)

---
updated-dependencies:
- dependency-name: github.com/buger/jsonparser
  dependency-version: 1.1.2
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-03-19 12:47:50 -07:00
dependabot[bot]
0b13ab04b7 build(deps): bump google.golang.org/grpc from 1.78.0 to 1.79.3 in /test/kafka (#8691)
build(deps): bump google.golang.org/grpc in /test/kafka

Bumps [google.golang.org/grpc](https://github.com/grpc/grpc-go) from 1.78.0 to 1.79.3.
- [Release notes](https://github.com/grpc/grpc-go/releases)
- [Commits](https://github.com/grpc/grpc-go/compare/v1.78.0...v1.79.3)

---
updated-dependencies:
- dependency-name: google.golang.org/grpc
  dependency-version: 1.79.3
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-03-19 10:54:54 -07:00
Chris Lu
15f4a97029 fix: improve raft leader election reliability and failover speed (#8692)
* fix: clear raft vote state file on non-resume startup

The seaweedfs/raft library v1.1.7 added a persistent `state` file for
currentTerm and votedFor. When RaftResumeState=false (the default), the
log, conf, and snapshot directories are cleared but this state file was
not. On repeated restarts, different masters accumulate divergent terms,
causing AppendEntries rejections and preventing leader election.

Fixes #8690

* fix: recover TopologyId from snapshot before clearing raft state

When RaftResumeState=false clears log/conf/snapshot, the TopologyId
(used for license validation) was lost. Now extract it from the latest
snapshot before cleanup and restore it on the topology.

Both seaweedfs/raft and hashicorp/raft paths are handled, with a shared
recoverTopologyIdFromState helper in raft_common.go.

* fix: stagger multi-master bootstrap delay by peer index

Previously all masters used a fixed 1500ms delay before the bootstrap
check. Now the delay is proportional to the peer's sorted index with
randomization (matching the hashicorp raft path), giving the designated
bootstrap node (peer 0) a head start while later peers wait for gRPC
servers to be ready.

Also adds diagnostic logging showing why DoJoinCommand was or wasn't
called, making leader election issues easier to diagnose from logs.

* fix: skip unreachable masters during leader reconnection

When a master leader goes down, non-leader masters still redirect
clients to the stale leader address. The masterClient would follow
these redirects, fail, and retry — wasting round-trips each cycle.

Now tryAllMasters tracks which masters failed within a cycle and skips
redirects pointing to them, reducing log spam and connection overhead
during leader failover.

* fix: take snapshot after TopologyId generation for recovery

After generating a new TopologyId on the leader, immediately take a raft
snapshot so the ID can be recovered from the snapshot on future restarts
with RaftResumeState=false. Without this, short-lived clusters would
lose the TopologyId on restart since no automatic snapshot had been
taken yet.

* test: add multi-master raft failover integration tests

Integration test framework and 5 test scenarios for 3-node master
clusters:

- TestLeaderConsistencyAcrossNodes: all nodes agree on leader and
  TopologyId
- TestLeaderDownAndRecoverQuickly: leader stops, new leader elected,
  old leader rejoins as follower
- TestLeaderDownSlowRecover: leader gone for extended period, cluster
  continues with 2/3 quorum
- TestTwoMastersDownAndRestart: quorum lost (2/3 down), recovered
  when both restart
- TestAllMastersDownAndRestart: full cluster restart, leader elected,
  all nodes agree on TopologyId

* fix: address PR review comments

- peerIndex: return -1 (not 0) when self not found, add warning log
- recoverTopologyIdFromSnapshot: defer dir.Close()
- tests: check GetTopologyId errors instead of discarding them

* fix: address review comments on failover tests

- Assert no leader after quorum loss (was only logging)
- Verify follower cs.Leader matches expected leader via
  ServerAddress.ToHttpAddress() comparison
- Check GetTopologyId error in TestTwoMastersDownAndRestart
2026-03-18 23:28:07 -07:00
dependabot[bot]
558a83661e build(deps): bump org.apache.spark:spark-core_2.12 from 3.5.0 to 3.5.7 in /test/java/spark (#8674)
build(deps): bump org.apache.spark:spark-core_2.12 in /test/java/spark

Bumps org.apache.spark:spark-core_2.12 from 3.5.0 to 3.5.7.

---
updated-dependencies:
- dependency-name: org.apache.spark:spark-core_2.12
  dependency-version: 3.5.7
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-03-17 13:27:38 -07:00
Chris Lu
9984ce7dcb fix(s3): omit NotResource:null from bucket policy JSON response (#8658)
* fix(s3): omit NotResource:null from bucket policy JSON response (#8657)

Change NotResource from value type to pointer (*StringOrStringSlice) so
that omitempty properly omits it when unset, matching the existing
Principal field pattern. This prevents IaC tools (Terraform, Ansible)
from detecting false configuration drift.

Add bucket policy round-trip idempotency integration tests.

* simplify JSON comparison in bucket policy idempotency test

Use require.JSONEq directly on the raw JSON strings instead of
round-tripping through unmarshal/marshal, since JSONEq already
handles normalization internally.

* fix bucket policy test cases that locked out the admin user

The Deny+NotResource test cases used Action:"s3:*" which denied the
admin's own GetBucketPolicy call. Scope deny to s3:GetObject only,
and add an Allow+NotResource variant instead.

* fix(s3): also make Resource a pointer to fix empty string in JSON

Apply the same omitempty pointer fix to the Resource field, which was
emitting "Resource":"" when only NotResource was set. Add
NewStringOrStringSlicePtr helper, make Strings() nil-safe, and handle
*StringOrStringSlice in normalizeToStringSliceWithError.

* improve bucket policy integration tests per review feedback

- Replace time.Sleep with waitForClusterReady using ListBuckets
- Use structural hasKey check instead of brittle substring NotContains
- Assert specific NoSuchBucketPolicy error code after delete
- Handle single-statement policies in hasKey helper
2026-03-16 12:58:26 -07:00
Chris Lu
a00eddb525 Iceberg table maintenance Phase 3: multi-spec compaction, delete handling, and metrics (#8643)
* Add multi-partition-spec compaction and delete-aware compaction (Phase 3)

Multi-partition-spec compaction:
- Add SpecID to compactionBin struct and group by spec+partition key
- Remove the len(specIDs) > 1 skip that blocked spec-evolved tables
- Write per-spec manifests in compaction commit using specByID map
- Use per-bin PartitionSpec when calling NewDataFileBuilder

Delete-aware compaction:
- Add ApplyDeletes config (default: true) with readBoolConfig helper
- Implement position delete collection (file_path + pos Parquet columns)
- Implement equality delete collection (field ID to column mapping)
- Update mergeParquetFiles to filter rows via position deletes (binary
  search) and equality deletes (hash set lookup)
- Smart delete manifest carry-forward: drop when all data files compacted
- Fix EXISTING/DELETED entries to include sequence numbers

Tests for multi-spec bins, delete collection, merge filtering, and
end-to-end compaction with position/equality/mixed deletes.

* Add structured metrics and per-bin progress to iceberg maintenance

- Change return type of all four operations from (string, error) to
  (string, map[string]int64, error) with structured metric counts
  (files_merged, snapshots_expired, orphans_removed, duration_ms, etc.)
- Add onProgress callback to compactDataFiles for per-bin progress
- In Execute, pass progress callback that sends JobProgressUpdate with
  per-bin stage messages
- Accumulate per-operation metrics with dot-prefixed keys
  (e.g. compact.files_merged) into OutputValues on completion
- Update testing_api.go wrappers and integration test call sites
- Add tests: TestCompactDataFilesMetrics, TestExpireSnapshotsMetrics,
  TestExecuteCompletionOutputValues

* Address review feedback: group equality deletes by field IDs, use metric constants

- Group equality deletes by distinct equality_ids sets so different
  delete files with different equality columns are handled correctly
- Use length-prefixed type-aware encoding in buildEqualityKey to avoid
  ambiguity between types and collisions from null bytes
- Extract metric key strings into package-level constants

* Fix buildEqualityKey to use length-prefixed type-aware encoding

The previous implementation used plain String() concatenation with null
byte separators, which caused type ambiguity (int 123 vs string "123")
and separator collisions when values contain null bytes. Now each value
is serialized as "kind:length:value" for unambiguous composite keys.

This fix was missed in the prior cherry-pick due to a merge conflict.

* Address nitpick review comments

- Document patchManifestContentToDeletes workaround: explain that
  iceberg-go WriteManifest cannot create delete manifests, and note
  the fail-fast validation on pattern match
- Document makeTestEntries: note that specID field is ignored and
  callers should use makeTestEntriesWithSpec for multi-spec testing

* fmt

* Fix path normalization, manifest threshold, and artifact filename collisions

- Normalize file paths in position delete collection and lookup so that
  absolute S3 URLs and relative paths match correctly
- Fix rewriteManifests threshold check to count only data manifests
  (was including delete manifests in the count and metric)
- Add random suffix to artifact filenames in compactDataFiles and
  rewriteManifests to prevent collisions between concurrent runs
- Sort compaction bins by SpecID then PartitionKey for deterministic
  ordering across specs

* Fix pos delete read, deduplicate column resolution, minor cleanups

- Remove broken Column() guard in position delete reading that silently
  defaulted pos to 0; unconditionally extract Int64() instead
- Deduplicate column resolution in readEqualityDeleteFile by calling
  resolveEqualityColIndices instead of inlining the same logic
- Add warning log in readBoolConfig for unrecognized string values
- Fix CompactDataFiles call site in integration test to capture 3 return
  values

* Advance progress on all bins, deterministic manifest order, assert metrics

- Call onProgress for every bin iteration including skipped/failed bins
  so progress reporting never appears stalled
- Sort spec IDs before iterating specEntriesMap to produce deterministic
  manifest list ordering across runs
- Assert expected metric keys in CompactDataFiles integration test

---------

Co-authored-by: Copilot <copilot@github.com>
2026-03-15 19:18:14 -07:00
Chris Lu
e24630251c iceberg: handle filer-backed compaction inputs (#8638)
* iceberg: handle filer-backed compaction inputs

* iceberg: preserve upsert creation times

* iceberg: align compaction test schema

* iceberg: tighten compact output assertion

* iceberg: document compact output match

* iceberg: clear stale chunks in upsert helper

* iceberg: strengthen compaction integration coverage
2026-03-15 17:46:06 -07:00
Chris Lu
8cde3d4486 Add data file compaction to iceberg maintenance (Phase 2) (#8503)
* Add iceberg_maintenance plugin worker handler (Phase 1)

Implement automated Iceberg table maintenance as a new plugin worker job
type. The handler scans S3 table buckets for tables needing maintenance
and executes operations in the correct Iceberg order: expire snapshots,
remove orphan files, and rewrite manifests.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Add data file compaction to iceberg maintenance handler (Phase 2)

Implement bin-packing compaction for small Parquet data files:
- Enumerate data files from manifests, group by partition
- Merge small files using parquet-go (read rows, write merged output)
- Create new manifest with ADDED/DELETED/EXISTING entries
- Commit new snapshot with compaction metadata

Add 'compact' operation to maintenance order (runs before expire_snapshots),
configurable via target_file_size_bytes and min_input_files thresholds.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Fix memory exhaustion in mergeParquetFiles by processing files sequentially

Previously all source Parquet files were loaded into memory simultaneously,
risking OOM when a compaction bin contained many small files. Now each file
is loaded, its rows are streamed into the output writer, and its data is
released before the next file is loaded — keeping peak memory proportional
to one input file plus the output buffer.

* Validate bucket/namespace/table names against path traversal

Reject names containing '..', '/', or '\' in Execute to prevent
directory traversal via crafted job parameters.

* Add filer address failover in iceberg maintenance handler

Try each filer address from cluster context in order instead of only
using the first one. This improves resilience when the primary filer
is temporarily unreachable.

* Add separate MinManifestsToRewrite config for manifest rewrite threshold

The rewrite_manifests operation was reusing MinInputFiles (meant for
compaction bin file counts) as its manifest count threshold. Add a
dedicated MinManifestsToRewrite field with its own config UI section
and default value (5) so the two thresholds can be tuned independently.

* Fix risky mtime fallback in orphan removal that could delete new files

When entry.Attributes is nil, mtime defaulted to Unix epoch (1970),
which would always be older than the safety threshold, causing the
file to be treated as eligible for deletion. Skip entries with nil
Attributes instead, matching the safer logic in operations.go.

* Fix undefined function references in iceberg_maintenance_handler.go

Use the exported function names (ShouldSkipDetectionByInterval,
BuildDetectorActivity, BuildExecutorActivity) matching their
definitions in vacuum_handler.go.

* Remove duplicated iceberg maintenance handler in favor of iceberg/ subpackage

The IcebergMaintenanceHandler and its compaction code in the parent
pluginworker package duplicated the logic already present in the
iceberg/ subpackage (which self-registers via init()). The old code
lacked stale-plan guards, proper path normalization, CAS-based xattr
updates, and error-returning parseOperations.

Since the registry pattern (default "all") makes the old handler
unreachable, remove it entirely. All functionality is provided by
iceberg.Handler with the reviewed improvements.

* Fix MinManifestsToRewrite clamping to match UI minimum of 2

The clamp reset values below 2 to the default of 5, contradicting the
UI's advertised MinValue of 2. Clamp to 2 instead.

* Sort entries by size descending in splitOversizedBin for better packing

Entries were processed in insertion order which is non-deterministic
from map iteration. Sorting largest-first before the splitting loop
improves bin packing efficiency by filling bins more evenly.

* Add context cancellation check to drainReader loop

The row-streaming loop in drainReader did not check ctx between
iterations, making long compaction merges uncancellable. Check
ctx.Done() at the top of each iteration.

* Fix splitOversizedBin to always respect targetSize limit

The minFiles check in the split condition allowed bins to grow past
targetSize when they had fewer than minFiles entries, defeating the
OOM protection. Now bins always split at targetSize, and a trailing
runt with fewer than minFiles entries is merged into the previous bin.

* Add integration tests for iceberg table maintenance plugin worker

Tests start a real weed mini cluster, create S3 buckets and Iceberg
table metadata via filer gRPC, then exercise the iceberg.Handler
operations (ExpireSnapshots, RemoveOrphans, RewriteManifests) against
the live filer. A full maintenance cycle test runs all operations in
sequence and verifies metadata consistency.

Also adds exported method wrappers (testing_api.go) so the integration
test package can call the unexported handler methods.

* Fix splitOversizedBin dropping files and add source path to drainReader errors

The runt-merge step could leave leading bins with fewer than minFiles
entries (e.g. [80,80,10,10] with targetSize=100, minFiles=2 would drop
the first 80-byte file). Replace the filter-based approach with an
iterative merge that folds any sub-minFiles bin into its smallest
neighbor, preserving all eligible files.

Also add the source file path to drainReader error messages so callers
can identify which Parquet file caused a read/write failure.

* Harden integration test error handling

- s3put: fail immediately on HTTP 4xx/5xx instead of logging and
  continuing
- lookupEntry: distinguish NotFound (return nil) from unexpected RPC
  errors (fail the test)
- writeOrphan and orphan creation in FullMaintenanceCycle: check
  CreateEntryResponse.Error in addition to the RPC error

* go fmt

---------

Co-authored-by: Copilot <copilot@github.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-15 11:27:42 -07:00
Chris Lu
3d9f7f6f81 go 1.25 2026-03-09 23:10:27 -07:00
Chris Lu
1bd7a98a4a simplify plugin scheduler: remove configurable IdleSleepSeconds, use constant 61s
The SchedulerConfig struct and its persistence/API were unnecessary
indirection. Replace with a simple constant (reduced from 613s to 61s)
so the scheduler re-checks for detectable job types promptly after
going idle, improving the clean-install experience.
2026-03-09 22:41:03 -07:00
Chris Lu
f220328ae4 test: assert ReadFileStatusCount in batch execution test
Verify that pre-delete verification called ReadVolumeFileStatus on
both source and target for each volume move.
2026-03-09 19:33:06 -07:00
Chris Lu
5f85bf5e8a Batch volume balance: run multiple moves per job (#8561)
* proto: add BalanceMoveSpec and batch fields to BalanceTaskParams

Add BalanceMoveSpec message for encoding individual volume moves,
and max_concurrent_moves + repeated moves fields to BalanceTaskParams
to support batching multiple volume moves in a single job.

* balance handler: add batch execution with concurrent volume moves

Refactor Execute() into executeSingleMove() (backward compatible) and
executeBatchMoves() which runs multiple volume moves concurrently using
a semaphore-bounded goroutine pool. When BalanceTaskParams.Moves is
populated, the batch path is taken; otherwise the single-move path.

Includes aggregate progress reporting across concurrent moves,
per-move error collection, and partial failure support.

* balance handler: add batch config fields to Descriptor and worker config

Add max_concurrent_moves and batch_size fields to the worker config
form and deriveBalanceWorkerConfig(). These control how many volume
moves run concurrently within a batch job and the maximum batch size.

* balance handler: group detection proposals into batch jobs

When batch_size > 1, the Detect method groups detection results into
batch proposals where each proposal encodes multiple BalanceMoveSpec
entries in BalanceTaskParams.Moves. Single-result batches fall back
to the existing single-move proposal format for backward compatibility.

* admin UI: add volume balance execution plan and batch badge

Add renderBalanceExecutionPlan() for rich rendering of volume balance
jobs in the job detail modal. Single-move jobs show source/target/volume
info; batch jobs show a moves table with all volume moves.

Add batch badge (e.g., "5 moves") next to job type in the execution
jobs table when the job has batch=true label.

* Update plugin_templ.go

* fix: detection algorithm uses greedy target instead of divergent topology scores

The detection loop tracked effective volume counts via an adjustments map,
but createBalanceTask independently called planBalanceDestination which used
the topology's LoadCount — a separate, unadjusted source of truth. This
divergence caused multiple moves to pile onto the same server.

Changes:
- Add resolveBalanceDestination to resolve the detection loop's greedy
  target (minServer) rather than independently picking a destination
- Add oscillation guard: stop when max-min <= 1 since no single move
  can improve the balance beyond that point
- Track unseeded destinations: if a target server wasn't in the initial
  serverVolumeCounts, add it so subsequent iterations include it
- Add TestDetection_UnseededDestinationDoesNotOverload

* fix: handler force_move propagation, partial failure, deterministic dedupe

- Propagate ForceMove from outer BalanceTaskParams to individual move
  TaskParams so batch moves respect the force_move flag
- Fix partial failure: mark job successful if at least one move
  succeeded (succeeded > 0 || failed == 0) to avoid re-running
  already-completed moves on retry
- Use SHA-256 hash for deterministic dedupe key fallback instead of
  time.Now().UnixNano() which is non-deterministic
- Remove unused successDetails variable
- Extract maxProposalStringLength constant to replace magic number 200

* admin UI: use template literals in balance execution plan rendering

* fix: integration test handles batch proposals from batched detection

With batch_size=20, all moves are grouped into a single proposal
containing BalanceParams.Moves instead of top-level Sources/Targets.
Update assertions to handle both batch and single-move proposal formats.

* fix: verify volume size on target before deleting source during balance

Add a pre-delete safety check that reads the volume file status on both
source and target, then compares .dat file size and file count. If they
don't match, the move is aborted — leaving the source intact rather than
risking irreversible data loss.

Also removes the redundant mountVolume call since VolumeCopy already
mounts the volume on the target server.

* fix: clamp maxConcurrent, serialize progress sends, validate config as int64

- Clamp maxConcurrentMoves to defaultMaxConcurrentMoves before creating
  the semaphore so a stale or malicious job cannot request unbounded
  concurrent volume moves
- Extend progressMu to cover sender.SendProgress calls since the
  underlying gRPC stream is not safe for concurrent writes
- Perform bounds checks on max_concurrent_moves and batch_size in int64
  space before casting to int, avoiding potential overflow on 32-bit

* fix: check disk capacity in resolveBalanceDestination

Skip disks where VolumeCount >= MaxVolumeCount so the detection loop
does not propose moves to a full disk that would fail at execution time.

* test: rename unseeded destination test to match actual behavior

The test exercises a server with 0 volumes that IS seeded from topology
(matching disk type), not an unseeded destination. Rename to
TestDetection_ZeroVolumeServerIncludedInBalance and fix comments.

* test: tighten integration test to assert exactly one batch proposal

With default batch_size=20, all moves should be grouped into a single
batch proposal. Assert len(proposals)==1 and require BalanceParams with
Moves, removing the legacy single-move else branch.

* fix: propagate ctx to RPCs and restore source writability on abort

- All helper methods (markVolumeReadonly, copyVolume, tailVolume,
  readVolumeFileStatus, deleteVolume) now accept a context parameter
  instead of using context.Background(), so Execute's ctx propagates
  cancellation and timeouts into every volume server RPC
- Add deferred cleanup that restores the source volume to writable if
  any step after markVolumeReadonly fails, preventing the source from
  being left permanently readonly on abort
- Add markVolumeWritable helper using VolumeMarkWritableRequest

* fix: deep-copy protobuf messages in test recording sender

Use proto.Clone in recordingExecutionSender to store immutable snapshots
of JobProgressUpdate and JobCompleted, preventing assertions from
observing mutations if the handler reuses message pointers.

* fix: add VolumeMarkWritable and ReadVolumeFileStatus to fake volume server

The balance task now calls ReadVolumeFileStatus for pre-delete
verification and VolumeMarkWritable to restore writability on abort.
Add both RPCs to the test fake, and drop the mountCalls assertion since
BalanceTask no longer calls VolumeMount directly (VolumeCopy handles it).

* fix: use maxConcurrentMovesLimit (50) for clamp, not defaultMaxConcurrentMoves

defaultMaxConcurrentMoves (5) is the fallback when the field is unset,
not an upper bound. Clamping to it silently overrides valid config
values like 10/20/50. Introduce maxConcurrentMovesLimit (50) matching
the descriptor's MaxValue and clamp to that instead.

* fix: cancel batch moves on progress stream failure

Derive a cancellable batchCtx from the caller's ctx. If
sender.SendProgress returns an error (client disconnect, context
cancelled), capture it, skip further sends, and cancel batchCtx so
in-flight moves abort via their propagated context rather than running
blind to completion.

* fix: bound cleanup timeout and validate batch move fields

- Use a 30-second timeout for the deferred markVolumeWritable cleanup
  instead of context.Background() which can block indefinitely if the
  volume server is unreachable
- Validate required fields (VolumeID, SourceNode, TargetNode) before
  appending moves to a batch proposal, skipping invalid entries
- Fall back to a single-move proposal when filtering leaves only one
  valid move in a batch

* fix: cancel task execution on SendProgress stream failure

All handler progress callbacks previously ignored SendProgress errors,
allowing tasks to continue executing after the client disconnected.
Now each handler creates a derived cancellable context and cancels it
on the first SendProgress error, stopping the in-flight task promptly.

Handlers fixed: erasure_coding, vacuum, volume_balance (single-move),
and admin_script (breaks command loop on send failure).

* fix: validate batch moves before scheduling in executeBatchMoves

Reject empty batches, enforce a hard upper bound (100 moves), and
filter out nil or incomplete move specs (missing source/target/volume)
before allocating progress tracking and launching goroutines.

* test: add batch balance execution integration test

Tests the batch move path with 3 volumes, max concurrency 2, using
fake volume servers. Verifies all moves complete with correct readonly,
copy, tail, and delete RPC counts.

* test: add MarkWritableCount and ReadFileStatusCount accessors

Expose the markWritableCalls and readFileStatusCalls counters on the
fake volume server, following the existing MarkReadonlyCount pattern.

* fix: oscillation guard uses global effective counts for heterogeneous capacity

The oscillation guard (max-min <= 1) previously used maxServer/minServer
which are determined by utilization ratio. With heterogeneous capacity,
maxServer by utilization can have fewer raw volumes than minServer,
producing a negative diff and incorrectly triggering the guard.

Now scans all servers' effective counts to find the true global max/min
volume counts, so the guard works correctly regardless of whether
utilization-based or raw-count balancing is used.

* fix: admin script handler breaks outer loop on SendProgress failure

The break on SendProgress error inside the shell.Commands scan only
exited the inner loop, letting the outer command loop continue
executing commands on a broken stream. Use a sendBroken flag to
propagate the break to the outer execCommands loop.
2026-03-09 19:30:08 -07:00
dependabot[bot]
e1c4faba38 build(deps): bump org.apache.zookeeper:zookeeper from 3.9.4 to 3.9.5 in /test/java/spark (#8580)
* build(deps): bump org.apache.zookeeper:zookeeper in /test/java/spark

Bumps org.apache.zookeeper:zookeeper from 3.9.4 to 3.9.5.

---
updated-dependencies:
- dependency-name: org.apache.zookeeper:zookeeper
  dependency-version: 3.9.5
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>

* fix: use go-version-file instead of hardcoded Go version in CI workflows

The hardcoded go-version '1.24' is too old for go.mod which requires
go >= 1.25.0, causing build failures in Spark integration tests.

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Chris Lu <chris.lu@gmail.com>
2026-03-09 17:29:53 -07:00
Chris Lu
992db11d2b iam: add IAM group management (#8560)
* iam: add Group message to protobuf schema

Add Group message (name, members, policy_names, disabled) and
add groups field to S3ApiConfiguration for IAM group management
support (issue #7742).

* iam: add group CRUD to CredentialStore interface and all backends

Add group management methods (CreateGroup, GetGroup, DeleteGroup,
ListGroups, UpdateGroup) to the CredentialStore interface with
implementations for memory, filer_etc, postgres, and grpc stores.
Wire group loading/saving into filer_etc LoadConfiguration and
SaveConfiguration.

* iam: add group IAM response types

Add XML response types for group management IAM actions:
CreateGroup, DeleteGroup, GetGroup, ListGroups, AddUserToGroup,
RemoveUserFromGroup, AttachGroupPolicy, DetachGroupPolicy,
ListAttachedGroupPolicies, ListGroupsForUser.

* iam: add group management handlers to embedded IAM API

Add CreateGroup, DeleteGroup, GetGroup, ListGroups, AddUserToGroup,
RemoveUserFromGroup, AttachGroupPolicy, DetachGroupPolicy,
ListAttachedGroupPolicies, and ListGroupsForUser handlers with
dispatch in ExecuteAction.

* iam: add group management handlers to standalone IAM API

Add group handlers (CreateGroup, DeleteGroup, GetGroup, ListGroups,
AddUserToGroup, RemoveUserFromGroup, AttachGroupPolicy, DetachGroupPolicy,
ListAttachedGroupPolicies, ListGroupsForUser) and wire into DoActions
dispatch. Also add helper functions for user/policy side effects.

* iam: integrate group policies into authorization

Add groups and userGroups reverse index to IdentityAccessManagement.
Populate both maps during ReplaceS3ApiConfiguration and
MergeS3ApiConfiguration. Modify evaluateIAMPolicies to evaluate
policies from user's enabled groups in addition to user policies.
Update VerifyActionPermission to consider group policies when
checking hasAttachedPolicies.

* iam: add group side effects on user deletion and rename

When a user is deleted, remove them from all groups they belong to.
When a user is renamed, update group membership references. Applied
to both embedded and standalone IAM handlers.

* iam: watch /etc/iam/groups directory for config changes

Add groups directory to the filer subscription watcher so group
file changes trigger IAM configuration reloads.

* admin: add group management page to admin UI

Add groups page with CRUD operations, member management, policy
attachment, and enable/disable toggle. Register routes in admin
handlers and add Groups entry to sidebar navigation.

* test: add IAM group management integration tests

Add comprehensive integration tests for group CRUD, membership,
policy attachment, policy enforcement, disabled group behavior,
user deletion side effects, and multi-group membership. Add
"group" test type to CI matrix in s3-iam-tests workflow.

* iam: address PR review comments for group management

- Fix XSS vulnerability in groups.templ: replace innerHTML string
  concatenation with DOM APIs (createElement/textContent) for rendering
  member and policy lists
- Use userGroups reverse index in embedded IAM ListGroupsForUser for
  O(1) lookup instead of iterating all groups
- Add buildUserGroupsIndex helper in standalone IAM handlers; use it
  in ListGroupsForUser and removeUserFromAllGroups for efficient lookup
- Add note about gRPC store load-modify-save race condition limitation

* iam: add defensive copies, validation, and XSS fixes for group management

- Memory store: clone groups on store/retrieve to prevent mutation
- Admin dash: deep copy groups before mutation, validate user/policy exists
- HTTP handlers: translate credential errors to proper HTTP status codes,
  use *bool for Enabled field to distinguish missing vs false
- Groups templ: use data attributes + event delegation instead of inline
  onclick for XSS safety, prevent stale async responses

* iam: add explicit group methods to PropagatingCredentialStore

Add CreateGroup, GetGroup, DeleteGroup, ListGroups, and UpdateGroup
methods instead of relying on embedded interface fallthrough. Group
changes propagate via filer subscription so no RPC propagation needed.

* iam: detect postgres unique constraint violation and add groups index

Return ErrGroupAlreadyExists when INSERT hits SQLState 23505 instead of
a generic error. Add index on groups(disabled) for filtered queries.

* iam: add Marker field to group list response types

Add Marker string field to GetGroupResult, ListGroupsResult,
ListAttachedGroupPoliciesResult, and ListGroupsForUserResult to
match AWS IAM pagination response format.

* iam: check group attachment before policy deletion

Reject DeletePolicy if the policy is attached to any group, matching
AWS IAM behavior. Add PolicyArn to ListAttachedGroupPolicies response.

* iam: include group policies in IAM authorization

Merge policy names from user's enabled groups into the IAMIdentity
used for authorization, so group-attached policies are evaluated
alongside user-attached policies.

* iam: check for name collision before renaming user in UpdateUser

Scan identities and inline policies for newUserName before mutating,
returning EntityAlreadyExists if a collision is found. Reuse the
already-loaded policies instead of loading them again inside the loop.

* test: use t.Cleanup for bucket cleanup in group policy test

* iam: wrap ErrUserNotInGroup sentinel in RemoveGroupMember error

Wrap credential.ErrUserNotInGroup so errors.Is works in
groupErrorToHTTPStatus, returning proper 400 instead of 500.

* admin: regenerate groups_templ.go with XSS-safe data attributes

Regenerated from groups.templ which uses data-group-name attributes
instead of inline onclick with string interpolation.

* iam: add input validation and persist groups during migration

- Validate nil/empty group name in CreateGroup and UpdateGroup
- Save groups in migrateToMultiFile so they survive legacy migration

* admin: use groupErrorToHTTPStatus in GetGroupMembers and GetGroupPolicies

* iam: short-circuit UpdateUser when newUserName equals current name

* iam: require empty PolicyNames before group deletion

Reject DeleteGroup when group has attached policies, matching the
existing members check. Also fix GetGroup error handling in
DeletePolicy to only skip ErrGroupNotFound, not all errors.

* ci: add weed/pb/** to S3 IAM test trigger paths

* test: replace time.Sleep with require.Eventually for propagation waits

Use polling with timeout instead of fixed sleeps to reduce flakiness
in integration tests waiting for IAM policy propagation.

* fix: use credentialManager.GetPolicy for AttachGroupPolicy validation

Policies created via CreatePolicy through credentialManager are stored
in the credential store, not in s3cfg.Policies (which only has static
config policies). Change AttachGroupPolicy to use credentialManager.GetPolicy()
for policy existence validation.

* feat: add UpdateGroup handler to embedded IAM API

Add UpdateGroup action to enable/disable groups and rename groups
via the IAM API. This is a SeaweedFS extension (not in AWS SDK) used
by tests to toggle group disabled status.

* fix: authenticate raw IAM API calls in group tests

The embedded IAM endpoint rejects anonymous requests. Replace
callIAMAPI with callIAMAPIAuthenticated that uses JWT bearer token
authentication via the test framework.

* feat: add UpdateGroup handler to standalone IAM API

Mirror the embedded IAM UpdateGroup handler in the standalone IAM API
for parity.

* fix: add omitempty to Marker XML tags in group responses

Non-truncated responses should not emit an empty <Marker/> element.

* fix: distinguish backend errors from missing policies in AttachGroupPolicy

Return ServiceFailure for credential manager errors instead of masking
them as NoSuchEntity. Also switch ListGroupsForUser to use s3cfg.Groups
instead of in-memory reverse index to avoid stale data. Add duplicate
name check to UpdateGroup rename.

* fix: standalone IAM AttachGroupPolicy uses persisted policy store

Check managed policies from GetPolicies() instead of s3cfg.Policies
so dynamically created policies are found. Also add duplicate name
check to UpdateGroup rename.

* fix: rollback inline policies on UpdateUser PutPolicies failure

If PutPolicies fails after moving inline policies to the new username,
restore both the identity name and the inline policies map to their
original state to avoid a partial-write window.

* fix: correct test cleanup ordering for group tests

Replace scattered defers with single ordered t.Cleanup in each test
to ensure resources are torn down in reverse-creation order:
remove membership, detach policies, delete access keys, delete users,
delete groups, delete policies. Move bucket cleanup to parent test
scope and delete objects before bucket.

* fix: move identity nil check before map lookup and refine hasAttachedPolicies

Move the nil check on identity before accessing identity.Name to
prevent panic. Also refine hasAttachedPolicies to only consider groups
that are enabled and have actual policies attached, so membership in
a no-policy group doesn't incorrectly trigger IAM authorization.

* fix: fail group reload on unreadable or corrupt group files

Return errors instead of logging and continuing when group files
cannot be read or unmarshaled. This prevents silently applying a
partial IAM config with missing group memberships or policies.

* fix: use errors.Is for sql.ErrNoRows comparison in postgres group store

* docs: explain why group methods skip propagateChange

Group changes propagate to S3 servers via filer subscription
(watching /etc/iam/groups/) rather than gRPC RPCs, since there
are no group-specific RPCs in the S3 cache protocol.

* fix: remove unused policyNameFromArn and strings import

* fix: update service account ParentUser on user rename

When renaming a user via UpdateUser, also update ParentUser references
in service accounts to prevent them from becoming orphaned after the
next configuration reload.

* fix: wrap DetachGroupPolicy error with ErrPolicyNotAttached sentinel

Use credential.ErrPolicyNotAttached so groupErrorToHTTPStatus maps
it to 400 instead of falling back to 500.

* fix: use admin S3 client for bucket cleanup in enforcement test

The user S3 client may lack permissions by cleanup time since the
user is removed from the group in an earlier subtest. Use the admin
S3 client to ensure bucket and object cleanup always succeeds.

* fix: add nil guard for group param in propagating store log calls

Prevent potential nil dereference when logging group.Name in
CreateGroup and UpdateGroup of PropagatingCredentialStore.

* fix: validate Disabled field in UpdateGroup handlers

Reject values other than "true" or "false" with InvalidInputException
instead of silently treating them as false.

* fix: seed mergedGroups from existing groups in MergeS3ApiConfiguration

Previously the merge started with empty group maps, dropping any
static-file groups. Now seeds from existing iam.groups before
overlaying dynamic config, and builds the reverse index after
merging to avoid stale entries from overridden groups.

* fix: use errors.Is for filer_pb.ErrNotFound comparison in group loading

Replace direct equality (==) with errors.Is() to correctly match
wrapped errors, consistent with the rest of the codebase.

* fix: add ErrUserNotFound and ErrPolicyNotFound to groupErrorToHTTPStatus

Map these sentinel errors to 404 so AddGroupMember and
AttachGroupPolicy return proper HTTP status codes.

* fix: log cleanup errors in group integration tests

Replace fire-and-forget cleanup calls with error-checked versions
that log failures via t.Logf for debugging visibility.

* fix: prevent duplicate group test runs in CI matrix

The basic lane's -run "TestIAM" regex also matched TestIAMGroup*
tests, causing them to run in both the basic and group lanes.
Replace with explicit test function names.

* fix: add GIN index on groups.members JSONB for membership lookups

Without this index, ListGroupsForUser and membership queries
require full table scans on the groups table.

* fix: handle cross-directory moves in IAM config subscription

When a file is moved out of an IAM directory (e.g., /etc/iam/groups),
the dir variable was overwritten with NewParentPath, causing the
source directory change to be missed. Now also notifies handlers
about the source directory for cross-directory moves.

* fix: validate members/policies before deleting group in admin handler

AdminServer.DeleteGroup now checks for attached members and policies
before delegating to credentialManager, matching the IAM handler guards.

* fix: merge groups by name instead of blind append during filer load

Match the identity loader's merge behavior: find existing group
by name and replace, only append when no match exists. Prevents
duplicates when legacy and multi-file configs overlap.

* fix: check DeleteEntry response error when cleaning obsolete group files

Capture and log resp.Error from filer DeleteEntry calls during
group file cleanup, matching the pattern used in deleteGroupFile.

* fix: verify source user exists before no-op check in UpdateUser

Reorder UpdateUser to find the source identity first and return
NoSuchEntityException if not found, before checking if the rename
is a no-op. Previously a non-existent user renamed to itself
would incorrectly return success.

* fix: update service account parent refs on user rename in embedded IAM

The embedded IAM UpdateUser handler updated group membership but
not service account ParentUser fields, unlike the standalone handler.

* fix: replay source-side events for all handlers on cross-dir moves

Pass nil newEntry to bucket, IAM, and circuit-breaker handlers for
the source directory during cross-directory moves, so all watchers
can clear caches for the moved-away resource.

* fix: don't seed mergedGroups from existing iam.groups in merge

Groups are always dynamic (from filer), never static (from s3.config).
Seeding from iam.groups caused stale deleted groups to persist.
Now only uses config.Groups from the dynamic filer config.

* fix: add deferred user cleanup in TestIAMGroupUserDeletionSideEffect

Register t.Cleanup for the created user so it gets cleaned up
even if the test fails before the inline DeleteUser call.

* fix: assert UpdateGroup HTTP status in disabled group tests

Add require.Equal checks for 200 status after UpdateGroup calls
so the test fails immediately on API errors rather than relying
on the subsequent Eventually timeout.

* fix: trim whitespace from group name in filer store operations

Trim leading/trailing whitespace from group.Name before validation
in CreateGroup and UpdateGroup to prevent whitespace-only filenames.
Also merge groups by name during multi-file load to prevent duplicates.

* fix: add nil/empty group validation in gRPC store

Guard CreateGroup and UpdateGroup against nil group or empty name
to prevent panics and invalid persistence.

* fix: add nil/empty group validation in postgres store

Guard CreateGroup and UpdateGroup against nil group or empty name
to prevent panics from nil member access and empty-name row inserts.

* fix: add name collision check in embedded IAM UpdateUser

The embedded IAM handler renamed users without checking if the
target name already existed, unlike the standalone handler.

* fix: add ErrGroupNotEmpty sentinel and map to HTTP 409

AdminServer.DeleteGroup now wraps conflict errors with
ErrGroupNotEmpty, and groupErrorToHTTPStatus maps it to
409 Conflict instead of 500.

* fix: use appropriate error message in GetGroupDetails based on status

Return "Group not found" only for 404, use "Failed to retrieve group"
for other error statuses instead of always saying "Group not found".

* fix: use backend-normalized group.Name in CreateGroup response

After credentialManager.CreateGroup may normalize the name (e.g.,
trim whitespace), use group.Name instead of the raw input for
the returned GroupData to ensure consistency.

* fix: add nil/empty group validation in memory store

Guard CreateGroup and UpdateGroup against nil group or empty name
to prevent panics from nil pointer dereference on map access.

* fix: reorder embedded IAM UpdateUser to verify source first

Find the source identity before checking for collisions, matching
the standalone handler's logic. Previously a non-existent user
renamed to an existing name would get EntityAlreadyExists instead
of NoSuchEntity.

* fix: handle same-directory renames in metadata subscription

Replay a delete event for the old entry name during same-directory
renames so handlers like onBucketMetadataChange can clean up stale
state for the old name.

* fix: abort GetGroups on non-ErrGroupNotFound errors

Only skip groups that return ErrGroupNotFound. Other errors (e.g.,
transient backend failures) now abort the handler and return the
error to the caller instead of silently producing partial results.

* fix: add aria-label and title to icon-only group action buttons

Add accessible labels to View and Delete buttons so screen readers
and tooltips provide meaningful context.

* fix: validate group name in saveGroup to prevent invalid filenames

Trim whitespace and reject empty names before writing group JSON
files, preventing creation of files like ".json".

* fix: add /etc/iam/groups to filer subscription watched directories

The groups directory was missing from the watched directories list,
so S3 servers in a cluster would not detect group changes made by
other servers via filer. The onIamConfigChange handler already had
code to handle group directory changes but it was never triggered.

* add direct gRPC propagation for group changes to S3 servers

Groups now have the same dual propagation as identities and policies:
direct gRPC push via propagateChange + async filer subscription.

- Add PutGroup/RemoveGroup proto messages and RPCs
- Add PutGroup/RemoveGroup in-memory cache methods on IAM
- Add PutGroup/RemoveGroup gRPC server handlers
- Update PropagatingCredentialStore to call propagateChange on group mutations

* reduce log verbosity for config load summary

Change ReplaceS3ApiConfiguration log from Infof to V(1).Infof
to avoid noisy output on every config reload.

* admin: show user groups in view and edit user modals

- Add Groups field to UserDetails and populate from credential manager
- Show groups as badges in user details view modal
- Add group management to edit user modal: display current groups,
  add to group via dropdown, remove from group via badge x button

* fix: remove duplicate showAlert that broke modal-alerts.js

admin.js defined showAlert(type, message) which overwrote the
modal-alerts.js version showAlert(message, type), causing broken
unstyled alert boxes. Remove the duplicate and swap all callers
in admin.js to use the correct (message, type) argument order.

* fix: unwrap groups API response in edit user modal

The /api/groups endpoint returns {"groups": [...]}, not a bare array.

* Update object_store_users_templ.go

* test: assert AccessDenied error code in group denial tests

Replace plain assert.Error checks with awserr.Error type assertion
and AccessDenied code verification, matching the pattern used in
other IAM integration tests.

* fix: propagate GetGroups errors in ShowGroups handler

getGroupsPageData was swallowing errors and returning an empty page
with 200 status. Now returns the error so ShowGroups can respond
with a proper error status.

* fix: reject AttachGroupPolicy when credential manager is nil

Previously skipped policy existence validation when credentialManager
was nil, allowing attachment of nonexistent policies. Now returns
a ServiceFailureException error.

* fix: preserve groups during partial MergeS3ApiConfiguration updates

UpsertIdentity calls MergeS3ApiConfiguration with a partial config
containing only the updated identity (nil Groups). This was wiping
all in-memory group state. Now only replaces groups when
config.Groups is non-nil (full config reload).

* fix: propagate errors from group lookup in GetObjectStoreUserDetails

ListGroups and GetGroup errors were silently ignored, potentially
showing incomplete group data in the UI.

* fix: use DOM APIs for group badge remove button to prevent XSS

Replace innerHTML with onclick string interpolation with DOM
createElement + addEventListener pattern. Also add aria-label
and title to the add-to-group button.

* fix: snapshot group policies under RLock to prevent concurrent map access

evaluateIAMPolicies was copying the map reference via groupMap :=
iam.groups under RLock then iterating after RUnlock, while PutGroup
mutates the map in-place. Now copies the needed policy names into
a slice while holding the lock.

* fix: add nil IAM check to PutGroup and RemoveGroup gRPC handlers

Match the nil guard pattern used by PutPolicy/DeletePolicy to
prevent nil pointer dereference when IAM is not initialized.
2026-03-09 11:54:32 -07:00
Chris Lu
bff084ff6a update go version 2026-03-09 11:12:05 -07:00
Chris Lu
78a3441b30 fix: volume balance detection returns multiple tasks per run (#8559)
* fix: volume balance detection now returns multiple tasks per run (#8551)

Previously, detectForDiskType() returned at most 1 balance task per disk
type, making the MaxJobsPerDetection setting ineffective. The detection
loop now iterates within each disk type, planning multiple moves until
the imbalance drops below threshold or maxResults is reached. Effective
volume counts are adjusted after each planned move so the algorithm
correctly re-evaluates which server is overloaded.

* fix: factor pending tasks into destination scoring and use UnixNano for task IDs

- Use UnixNano instead of Unix for task IDs to avoid collisions when
  multiple tasks are created within the same second
- Adjust calculateBalanceScore to include LoadCount (pending + assigned
  tasks) in the utilization estimate, so the destination picker avoids
  stacking multiple planned moves onto the same target disk

* test: add comprehensive balance detection tests for complex scenarios

Cover multi-server convergence, max-server shifting, destination
spreading, pre-existing pending task skipping, no-duplicate-volume
invariant, and parameterized convergence verification across different
cluster shapes and thresholds.

* fix: address PR review findings in balance detection

- hasMore flag: compute from len(results) >= maxResults so the scheduler
  knows more pages may exist, matching vacuum/EC handler pattern
- Exhausted server fallthrough: when no eligible volumes remain on the
  current maxServer (all have pending tasks) or destination planning
  fails, mark the server as exhausted and continue to the next
  overloaded server instead of stopping the entire detection loop
- Return canonical destination server ID directly from createBalanceTask
  instead of resolving via findServerIDByAddress, eliminating the
  fragile address→ID lookup for adjustment tracking
- Fix bestScore sentinel: use math.Inf(-1) instead of -1.0 so disks
  with negative scores (high pending load, same rack/DC) are still
  selected as the best available destination
- Add TestDetection_ExhaustedServerFallsThrough covering the scenario
  where the top server's volumes are all blocked by pre-existing tasks

* test: fix computeEffectiveCounts and add len guard in no-duplicate test

- computeEffectiveCounts now takes a servers slice to seed counts for all
  known servers (including empty ones) and uses an address→ID map from
  the topology spec instead of scanning metrics, so destination servers
  with zero initial volumes are tracked correctly
- TestDetection_NoDuplicateVolumesAcrossIterations now asserts len > 1
  before checking duplicates, so the test actually fails if Detection
  regresses to returning a single task

* fix: remove redundant HasAnyTask check in createBalanceTask

The HasAnyTask check in createBalanceTask duplicated the same check
already performed in detectForDiskType's volume selection loop.
Since detection runs single-threaded (MaxDetectionConcurrency: 1),
no race can occur between the two points.

* fix: consistent hasMore pattern and remove double-counted LoadCount in scoring

- Adopt vacuum_handler's hasMore pattern: over-fetch by 1, check
  len > maxResults, and truncate — consistent truncation semantics
- Remove direct LoadCount penalty in calculateBalanceScore since
  LoadCount is already factored into effectiveVolumeCount for
  utilization scoring; bump utilization weight from 40 to 50 to
  compensate for the removed 10-point load penalty

* fix: handle zero maxResults as no-cap, emit trace after trim, seed empty servers

- When MaxResults is 0 (omitted), treat as no explicit cap instead of
  defaulting to 1; only apply the +1 over-fetch probe when caller
  supplies a positive limit
- Move decision trace emission after hasMore/trim so the trace
  accurately reflects the returned proposals
- Seed serverVolumeCounts from ActiveTopology so servers that have a
  matching disk type but zero volumes are included in the imbalance
  calculation and MinServerCount check

* fix: nil-guard clusterInfo, uncap legacy DetectionFunc, deterministic disk type order

- Add early nil guard for clusterInfo in Detection to prevent panics
  in downstream helpers (detectForDiskType, createBalanceTask)
- Change register.go DetectionFunc wrapper from maxResults=1 to 0
  (no cap) so the legacy code path returns all detected tasks
- Sort disk type keys before iteration so results are deterministic
  when maxResults spans multiple disk types (HDD/SSD)

* fix: don't over-fetch in stateful detection to avoid orphaned pending tasks

Detection registers planned moves in ActiveTopology via AddPendingTask,
so requesting maxResults+1 would create an extra pending task that gets
discarded during trim. Use len(results) >= maxResults as the hasMore
signal instead, which is correct since Detection already caps internally.

* fix: return explicit truncated flag from Detection instead of approximating

Detection now returns (results, truncated, error) where truncated is true
only when the loop stopped because it hit maxResults, not when it ran out
of work naturally. This eliminates false hasMore signals when detection
happens to produce exactly maxResults results by resolving the imbalance.

* cleanup: simplify detection logic and remove redundancies

- Remove redundant clusterInfo nil check in detectForDiskType since
  Detection already guards against nil clusterInfo
- Remove adjustments loop for destination servers not in
  serverVolumeCounts — topology seeding ensures all servers with
  matching disk type are already present
- Merge two-loop min/max calculation into a single loop: min across
  all servers, max only among non-exhausted servers
- Replace magic number 100 with len(metrics) for minC initialization
  in convergence test

* fix: accurate truncation flag, deterministic server order, indexed volume lookup

- Track balanced flag to distinguish "hit maxResults cap" from "cluster
  balanced at exactly maxResults" — truncated is only true when there's
  genuinely more work to do
- Sort servers for deterministic iteration and tie-breaking when
  multiple servers have equal volume counts
- Pre-index volumes by server with per-server cursors to avoid
  O(maxResults * volumes) rescanning on each iteration
- Add truncation flag assertions to RespectsMaxResults test: true when
  capped, false when detection finishes naturally

* fix: seed trace server counts from ActiveTopology to match detection logic

The decision trace was building serverVolumeCounts only from metrics,
missing zero-volume servers seeded from ActiveTopology by Detection.
This could cause the trace to report wrong server counts, incorrect
imbalance ratios, or spurious "too few servers" messages. Pass
activeTopology into the trace function and seed server counts the
same way Detection does.

* fix: don't exhaust server on per-volume planning failure, sort volumes by ID

- When createBalanceTask returns nil, continue to the next volume on
  the same server instead of marking the entire server as exhausted.
  The failure may be volume-specific (not found in topology, pending
  task registration failed) and other volumes on the server may still
  be viable candidates.
- Sort each server's volume slice by VolumeID after pre-indexing so
  volume selection is fully deterministic regardless of input order.

* fix: use require instead of assert to prevent nil dereference panic in CORS test

The test used assert.NoError (non-fatal) for GetBucketCors, then
immediately accessed getResp.CORSRules. When the API returns an error,
getResp is nil causing a panic. Switch to require.NoError/NotNil/Len
so the test stops before dereferencing a nil response.

* fix: deterministic disk tie-breaking and stronger pre-existing task test

- Sort available disks by NodeID then DiskID before scoring so
  destination selection is deterministic when two disks score equally
- Add task count bounds assertion to SkipsPreExistingPendingTasks test:
  with 15 of 20 volumes already having pending tasks, at most 5 new
  tasks should be created and at least 1 (imbalance still exists)

* fix: seed adjustments from existing pending/assigned tasks to prevent over-scheduling

Detection now calls ActiveTopology.GetTaskServerAdjustments() to
initialize the adjustments map with source/destination deltas from
existing pending and assigned balance tasks. This ensures
effectiveCounts reflects in-flight moves, preventing the algorithm
from planning additional moves in the same direction when prior
moves already address the imbalance.

Added GetTaskServerAdjustments(taskType) to ActiveTopology which
iterates pending and assigned tasks, decrementing source servers
and incrementing destination servers for the given task type.
2026-03-08 21:34:03 -07:00
Chris Lu
540fc97e00 s3/iam: reuse one request id per request (#8538)
* request_id: add shared request middleware

* s3err: preserve request ids in responses and logs

* iam: reuse request ids in XML responses

* sts: reuse request ids in XML responses

* request_id: drop legacy header fallback

* request_id: use AWS-style request id format

* iam: fix AWS-compatible XML format for ErrorResponse and field ordering

- ErrorResponse uses bare <RequestId> at root level instead of
  <ResponseMetadata> wrapper, matching the AWS IAM error response spec
- Move CommonResponse to last field in success response structs so
  <ResponseMetadata> serializes after result elements
- Add randomness to request ID generation to avoid collisions
- Add tests for XML ordering and ErrorResponse format

* iam: remove duplicate error_response_test.go

Test is already covered by responses_test.go.

* address PR review comments

- Guard against typed nil pointers in SetResponseRequestID before
  interface assertion (CodeRabbit)
- Use regexp instead of strings.Index in test helpers for extracting
  request IDs (Gemini)

* request_id: prevent spoofing, fix nil-error branch, thread reqID to error writers

- Ensure() now always generates a server-side ID, ignoring client-sent
  x-amz-request-id headers to prevent request ID spoofing. Uses a
  private context key (contextKey{}) instead of the header string.
- writeIamErrorResponse in both iamapi and embedded IAM now accepts
  reqID as a parameter instead of calling Ensure() internally, ensuring
  a single request ID per request lifecycle.
- The nil-iamError branch in writeIamErrorResponse now writes a 500
  Internal Server Error response instead of returning silently.
- Updated tests to set request IDs via context (not headers) and added
  tests for spoofing prevention and context reuse.

* sts: add request-id consistency assertions to ActionInBody tests

* test: update admin test to expect server-generated request IDs

The test previously sent a client x-amz-request-id header and expected
it echoed back. Since Ensure() now ignores client headers to prevent
spoofing, update the test to verify the server returns a non-empty
server-generated request ID instead.

* iam: add generic WithRequestID helper alongside reflection-based fallback

Add WithRequestID[T] that uses generics to take the address of a value
type, satisfying the pointer receiver on SetRequestId without reflection.

The existing SetResponseRequestID is kept for the two call sites that
operate on interface{} (from large action switches where the concrete
type varies at runtime). Generics cannot replace reflection there since
Go cannot infer type parameters from interface{}.

* Remove reflection and generics from request ID setting

Call SetRequestId directly on concrete response types in each switch
branch before boxing into interface{}, eliminating the need for
WithRequestID (generics) and SetResponseRequestID (reflection).

* iam: return pointer responses in action dispatch

* Fix IAM error handling consistency and ensure request IDs on all responses

- UpdateUser/CreatePolicy error branches: use writeIamErrorResponse instead
  of s3err.WriteErrorResponse to preserve IAM formatting and request ID
- ExecuteAction: accept reqID parameter and generate one if empty, ensuring
  every response carries a RequestId regardless of caller

* Clean up inline policies on DeleteUser and UpdateUser rename

DeleteUser: remove InlinePolicies[userName] from policy storage before
removing the identity, so policies are not orphaned.

UpdateUser: move InlinePolicies[userName] to InlinePolicies[newUserName]
when renaming, so GetUserPolicy/DeleteUserPolicy work under the new name.

Both operations persist the updated policies and return an error if
the storage write fails, preventing partial state.
2026-03-06 15:22:39 -08:00
Chris Lu
df5e8210df Implement IAM managed policy operations (#8507)
* feat: Implement IAM managed policy operations (GetPolicy, ListPolicies, DeletePolicy, AttachUserPolicy, DetachUserPolicy)

- Add response type aliases in iamapi_response.go for managed policy operations
- Implement 6 handler methods in iamapi_management_handlers.go:
  - GetPolicy: Lookup managed policy by ARN
  - DeletePolicy: Remove managed policy
  - ListPolicies: List all managed policies
  - AttachUserPolicy: Attach managed policy to user, aggregating inline + managed actions
  - DetachUserPolicy: Detach managed policy from user
  - ListAttachedUserPolicies: List user's attached managed policies
- Add computeAllActionsForUser() to aggregate actions from both inline and managed policies
- Wire 6 new DoActions switch cases for policy operations
- Add comprehensive tests for all new handlers
- Fixes #8506

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: address PR review feedback for IAM managed policy operations

- Add parsePolicyArn() helper with proper ARN prefix validation, replacing
  fragile strings.Split parsing in GetPolicy, DeletePolicy, AttachUserPolicy,
  and DetachUserPolicy
- DeletePolicy now detaches the policy from all users and recomputes their
  aggregated actions, preventing stale permissions after deletion
- Set changed=true for DeletePolicy DoActions case so identity updates persist
- Make PolicyId consistent: CreatePolicy now uses Hash(&policyName) matching
  GetPolicy and ListPolicies
- Remove redundant nil map checks (Go handles nil map lookups safely)
- DRY up action deduplication in computeAllActionsForUser with addUniqueActions
  closure
- Add tests for invalid/empty ARN rejection and DeletePolicy identity cleanup

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: add integration tests for managed policy lifecycle (#8506)

Add two integration tests covering the user-reported use case where
managed policy operations returned 500 errors:

- TestS3IAMManagedPolicyLifecycle: end-to-end workflow matching the
  issue report — CreatePolicy, ListPolicies, GetPolicy, AttachUserPolicy,
  ListAttachedUserPolicies, idempotent re-attach, DeletePolicy while
  attached (expects DeleteConflict), DetachUserPolicy, DeletePolicy,
  and verification that deleted policy is gone

- TestS3IAMManagedPolicyErrorCases: covers error paths — nonexistent
  policy/user for GetPolicy, DeletePolicy, AttachUserPolicy,
  DetachUserPolicy, and ListAttachedUserPolicies

Also fixes DeletePolicy to reject deletion when policy is still attached
to a user (AWS-compatible DeleteConflictException), and adds the 409
status code mapping for DeleteConflictException in the error response
handler.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: nil map panic in CreatePolicy, add PolicyId test assertions

- Initialize policies.Policies map in CreatePolicy if nil (prevents panic
  when no policies exist yet); also handle filer_pb.ErrNotFound like other
  callers
- Add PolicyId assertions in TestGetPolicy and TestListPolicies to lock in
  the consistent Hash(&policyName) behavior
- Remove redundant time.Sleep calls from new integration tests (startMiniCluster
  already blocks on waitForS3Ready)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: PutUserPolicy and DeleteUserPolicy now preserve managed policy actions

PutUserPolicy and DeleteUserPolicy were calling computeAggregatedActionsForUser
(inline-only), overwriting ident.Actions and dropping managed policy actions.
Both now call computeAllActionsForUser which unions inline + managed actions.

Add TestManagedPolicyActionsPreservedAcrossInlineMutations regression test:
attaches a managed policy, adds an inline policy (verifies both actions present),
deletes the inline policy, then asserts managed policy actions still persist.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: PutUserPolicy verifies user exists before persisting inline policy

Previously the inline policy was written to storage before checking if the
target user exists in s3cfg.Identities, leaving orphaned policy data when
the user was absent. Now validates the user first, returning
NoSuchEntityException immediately if not found.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: prevent stale/lost actions on computeAllActionsForUser failure

- PutUserPolicy: on recomputation failure, preserve existing ident.Actions
  instead of falling back to only the current inline policy's actions
- DeleteUserPolicy: on recomputation failure, preserve existing ident.Actions
  instead of assigning nil (which wiped all permissions)
- AttachUserPolicy: roll back ident.PolicyNames and return error if
  action recomputation fails, keeping identity consistent
- DetachUserPolicy: roll back ident.PolicyNames and return error if
  GetPolicies or action recomputation fails
- Add doc comment on newTestIamApiServer noting it only sets s3ApiConfig

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-04 14:18:07 -08:00
Chris Lu
10a30a83e1 s3api: add GetObjectAttributes API support (#8504)
* s3api: add error code and header constants for GetObjectAttributes

Add ErrInvalidAttributeName error code and header constants
(X-Amz-Object-Attributes, X-Amz-Max-Parts, X-Amz-Part-Number-Marker,
X-Amz-Delete-Marker) needed by the S3 GetObjectAttributes API.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* s3api: implement GetObjectAttributes handler

Add GetObjectAttributesHandler that returns selected object metadata
(ETag, Checksum, StorageClass, ObjectSize, ObjectParts) without
returning the object body. Follows the same versioning and conditional
header patterns as HeadObjectHandler.

The handler parses the X-Amz-Object-Attributes header to determine
which attributes to include in the XML response, and supports
ObjectParts pagination via X-Amz-Max-Parts and X-Amz-Part-Number-Marker.

Ref: https://docs.aws.amazon.com/AmazonS3/latest/API/API_GetObjectAttributes.html

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* s3api: register GetObjectAttributes route

Register the GET /{object}?attributes route for the
GetObjectAttributes API, placed before other object query
routes to ensure proper matching.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* s3api: add integration tests for GetObjectAttributes

Test coverage:
- Basic: simple object with all attribute types
- MultipartObject: multipart upload with parts pagination
- SelectiveAttributes: requesting only specific attributes
- InvalidAttribute: server rejects invalid attribute names
- NonExistentObject: returns NoSuchKey for missing objects

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* s3api: add versioned object test for GetObjectAttributes

Test puts two versions of the same object and verifies that:
- GetObjectAttributes returns the latest version by default
- GetObjectAttributes with versionId returns the specific version
- ObjectSize and VersionId are correct for each version

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* s3api: fix combined conditional header evaluation per RFC 7232

Per RFC 7232:
- Section 3.4: If-Unmodified-Since MUST be ignored when If-Match is
  present (If-Match is the more accurate replacement)
- Section 3.3: If-Modified-Since MUST be ignored when If-None-Match is
  present (If-None-Match is the more accurate replacement)

Previously, all four conditional headers were evaluated independently.
This caused incorrect 412 responses when If-Match succeeded but
If-Unmodified-Since failed (should return 200 per AWS S3 behavior).

Fix applied to both validateConditionalHeadersForReads (GET/HEAD) and
validateConditionalHeaders (PUT) paths.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* s3api: add conditional header combination tests for GetObjectAttributes

Test the RFC 7232 combined conditional header semantics:
- If-Match=true + If-Unmodified-Since=false => 200 (If-Unmodified-Since ignored)
- If-None-Match=false + If-Modified-Since=true => 304 (If-Modified-Since ignored)
- If-None-Match=true + If-Modified-Since=false => 200 (If-Modified-Since ignored)
- If-Match=true + If-Unmodified-Since=true => 200
- If-Match=false => 412 regardless

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* s3api: document Checksum attribute as not yet populated

Checksum is accepted in validation (so clients requesting it don't get
a 400 error, matching AWS behavior for objects without checksums) but
SeaweedFS does not yet store S3 checksums. Add a comment explaining
this and noting where to populate it when checksum storage is added.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* s3api: add s3:GetObjectAttributes IAM action for ?attributes query

Previously, GET /{object}?attributes resolved to s3:GetObject via the
fallback path since resolveFromQueryParameters had no case for the
"attributes" query parameter.

Add S3_ACTION_GET_OBJECT_ATTRIBUTES constant ("s3:GetObjectAttributes")
and a branch in resolveFromQueryParameters to return it for GET requests
with the "attributes" query parameter, so IAM policies can distinguish
GetObjectAttributes from GetObject.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* s3api: evaluate conditional headers after version resolution

Move conditional header evaluation (If-Match, If-None-Match, etc.) to
after the version resolution step in GetObjectAttributesHandler. This
ensures that when a specific versionId is requested, conditions are
checked against the correct version entry rather than always against
the latest version.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* s3api: use bounded HTTP client in GetObjectAttributes tests

Replace http.DefaultClient with a timeout-aware http.Client (10s) in
the signedGetObjectAttributes helper and testGetObjectAttributesInvalid
to prevent tests from hanging indefinitely.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* s3api: check attributes query before versionId in action resolver

Move the GetObjectAttributes action check before the versionId check
in resolveFromQueryParameters. This fixes GET /bucket/key?attributes&versionId=xyz
being incorrectly classified as s3:GetObjectVersion instead of
s3:GetObjectAttributes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* s3api: add tests for versioned conditional headers and action resolver

Add integration test that verifies conditional headers (If-Match,
If-None-Match) are evaluated against the requested version entry, not
the latest version. This covers the fix in 55c409dec.

Add unit test for ResolveS3Action verifying that the attributes query
parameter takes precedence over versionId, so GET ?attributes&versionId
resolves to s3:GetObjectAttributes. This covers the fix in b92c61c95.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* s3api: guard negative chunk indices and rename PartsCount field

Add bounds checks for b.StartChunk >= 0 and b.EndChunk >= 0 in
buildObjectAttributesParts to prevent panics from corrupted metadata
with negative index values.

Rename ObjectAttributesParts.PartsCount to TotalPartsCount to match
the AWS SDK v2 Go field naming convention, while preserving the XML
element name "PartsCount" via the struct tag.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* s3api: reject malformed max-parts and part-number-marker headers

Return ErrInvalidMaxParts and ErrInvalidPartNumberMarker when the
X-Amz-Max-Parts or X-Amz-Part-Number-Marker headers contain
non-integer or negative values, matching ListObjectPartsHandler
behavior. Previously these were silently ignored with defaults.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-04 12:52:09 -08:00
Chris Lu
18ccc9b773 Plugin scheduler: sequential iterations with max runtime (#8496)
* pb: add job type max runtime setting

* plugin: default job type max runtime

* plugin: redesign scheduler loop

* admin ui: update scheduler settings

* plugin: fix scheduler loop state name

* plugin scheduler: restore backlog skip

* plugin scheduler: drop legacy detection helper

* admin api: require scheduler config body

* admin ui: preserve detection interval on save

* plugin scheduler: use job context and drain cancels

* plugin scheduler: respect detection intervals

* plugin scheduler: gate runs and drain queue

* ec test: reuse req/resp vars

* ec test: add scheduler debug logs

* Adjust scheduler idle sleep and initial run delay

* Clear pending job queue before scheduler runs

* Log next detection time in EC integration test

* Improve plugin scheduler debug logging in EC test

* Expose scheduler next detection time

* Log scheduler next detection time in EC test

* Wake scheduler on config or worker updates

* Expose scheduler sleep interval in UI

* Fix scheduler sleep save value selection

* Set scheduler idle sleep default to 613s

* Show scheduler next run time in plugin UI

---------

Co-authored-by: Copilot <copilot@github.com>
2026-03-03 23:09:49 -08:00
Chris Lu
fb944f0071 test: add Polaris S3 tables integration tests (#8489)
* test: add polaris integration test harness

* test: add polaris integration coverage

* ci: run polaris s3 tables tests

* test: harden polaris harness

* test: DRY polaris integration tests

* ci: pre-pull Polaris image

* test: extend Polaris pull timeout

* test: refine polaris credentials selection

* test: keep Polaris tables inside allowed location

* test: use fresh context for polaris cleanup

* test: prefer specific Polaris storage credential

* test: tolerate Polaris credential variants

* test: request Polaris vended credentials

* test: load Polaris table credentials

* test: allow polaris vended access via bucket policy

* test: align Polaris object keys with table location

* test: rename Polaris vended role references

* test: simplify Polaris vended credential extraction

* test: marshal Polaris bucket policy
2026-03-02 12:10:03 -08:00
Chris Lu
340339f678 Add Apache Polaris integration tests (#8478)
* test: add polaris integration test harness

* test: add polaris integration coverage

* ci: run polaris s3 tables tests

* test: harden polaris harness

* test: DRY polaris integration tests

* ci: pre-pull Polaris image

* test: extend Polaris pull timeout

* test: refine polaris credentials selection

* test: keep Polaris tables inside allowed location

* test: use fresh context for polaris cleanup

* test: prefer specific Polaris storage credential

* test: tolerate Polaris credential variants

* test: request Polaris vended credentials

* test: load Polaris table credentials

* test: allow polaris vended access via bucket policy

* test: align Polaris object keys with table location
2026-03-01 23:08:50 -08:00
dependabot[bot]
2fc47a48ec build(deps): bump com.fasterxml.jackson.core:jackson-core from 2.18.2 to 2.18.6 in /test/java/spark (#8476) 2026-03-01 13:14:04 -08:00
Chris Lu
c5d5b517f6 Add lakekeeper table bucket integration test (#8470)
* Add lakekeeper table bucket integration test

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Convert lakekeeper table bucket test to Go

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Add multipart upload to lakekeeper table bucket test

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Remove lakekeeper test skips

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Convert lakekeeper repro to Go

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-02-28 12:06:41 -08:00