Commit Graph

650 Commits

Author SHA1 Message Date
Chris Lu
75a6a34528 dlm: resilient distributed locks via consistent hashing + backup replication (#8860)
* dlm: replace modulo hashing with consistent hash ring

Introduce HashRing with virtual nodes (CRC32-based consistent hashing)
to replace the modulo-based hashKeyToServer. When a filer node is
removed, only keys that hashed to that node are remapped to the next
server on the ring, leaving all other mappings stable. This is the
foundation for backup replication — the successor on the ring is
always the natural takeover node.

* dlm: add Generation and IsBackup fields to Lock

Lock now carries IsBackup (whether this node holds the lock as a backup
replica) and Generation (a monotonic fencing token that increments on
each fresh acquisition, stays the same on renewal). Add helper methods:
AllLocks, PromoteLock, DemoteLock, InsertBackupLock, RemoveLock, GetLock.

* dlm: add ReplicateLock RPC and generation/is_backup proto fields

Add generation field to LockResponse for fencing tokens.
Add generation and is_backup fields to Lock message.
Add ReplicateLock RPC for primary-to-backup lock replication.
Add ReplicateLockRequest/ReplicateLockResponse messages.

* dlm: add async backup replication to DistributedLockManager

Route lock/unlock via consistent hash ring's GetPrimaryAndBackup().
After a successful lock or unlock on the primary, asynchronously
replicate the operation to the backup server via ReplicateFunc
callback. Single-server deployments skip replication.

* dlm: add ReplicateLock handler and backup-aware topology changes

Add ReplicateLock gRPC handler for primary-to-backup replication.
Revise OnDlmChangeSnapshot to handle three cases on topology change:
- Promote backup locks when this node becomes primary
- Demote primary locks when this node becomes backup
- Transfer locks when this node is neither primary nor backup
Wire up SetupDlmReplication during filer server initialization.

* dlm: expose generation fencing token in lock client

LiveLock now captures the generation from LockResponse and exposes it
via Generation() method. Consumers can use this as a fencing token to
detect stale lock holders.

* dlm: update empty folder cleaner to use consistent hash ring

Replace local modulo-based hashKeyToServer with LockRing.GetPrimary()
which uses the shared consistent hash ring for folder ownership.

* dlm: add unit tests for consistent hash ring

Test basic operations, consistency on server removal (only keys from
removed server move), backup-is-successor property (backup becomes
new primary when primary is removed), and key distribution balance.

* dlm: add integration tests for lock replication failure scenarios

Test cases:
- Primary crash with backup promotion (backup has valid token)
- Backup crash with primary continuing
- Both primary and backup crash (lock lost, re-acquirable)
- Rolling restart across all nodes
- Generation fencing token increments on new acquisition
- Replication failure (primary still works independently)
- Unlock replicates deletion to backup
- Lock survives server addition (topology change)
- Consistent hashing minimal disruption (only removed server's keys move)

* dlm: address PR review findings

1. Causal replication ordering: Add per-lock sequence number (Seq) that
   increments on every mutation. Backup rejects incoming mutations with
   seq <= current seq, preventing stale async replications from
   overwriting newer state. Unlock replication also carries seq and is
   rejected if stale.

2. Demote-after-handoff: OnDlmChangeSnapshot now transfers the lock to
   the new primary first and only demotes to backup after a successful
   TransferLocks RPC. If the transfer fails, the lock stays as primary
   on this node.

3. SetSnapshot candidateServers leak: Replace the candidateServers map
   entirely instead of appending, so removed servers don't linger.

4. TransferLocks preserves Generation and Seq: InsertLock now accepts
   generation and seq parameters. After accepting a transferred lock,
   the receiving node re-replicates to its backup.

5. Rolling restart test: Add re-replication step after promotion and
   assert survivedCount > 0. Add TestDLM_StaleReplicationRejected.

6. Mixed-version upgrade note: Add comment on HashRing documenting that
   all filer nodes must be upgraded together.

* dlm: serve renewals locally during transfer window on node join

When a new node joins and steals hash ranges from surviving nodes,
there's a window between ring update and lock transfer where the
client gets redirected to a node that doesn't have the lock yet.

Fix: if the ring says primary != self but we still hold the lock
locally (non-backup, matching token), serve the renewal/unlock here
rather than redirecting. The lock will be transferred by
OnDlmChangeSnapshot, and subsequent requests will go to the new
primary once the transfer completes.

Add tests:
- TestDLM_NodeDropAndJoin_OwnershipDisruption: measures disruption
  when a node drops and a new one joins (14/100 surviving-node locks
  disrupted, all handled by transfer logic)
- TestDLM_RenewalDuringTransferWindow: verifies renewal succeeds on
  old primary during the transfer window

* dlm: master-managed lock ring with stabilization batching

The master now owns the lock ring membership. Instead of filers
independently reacting to individual ClusterNodeUpdate add/remove
events, the master:

1. Tracks filer membership in LockRingManager
2. Batches rapid changes with a 1-second stabilization timer
   (e.g., a node drop + join within 1 second → single ring update)
3. Broadcasts the complete ring snapshot atomically via the new
   LockRingUpdate message in KeepConnectedResponse

Filers receive the ring as a complete snapshot and apply it via
SetSnapshot, ensuring all filers converge to the same ring state
without intermediate churn.

This eliminates the double-churn problem where a rapid drop+join
would fire two separate ring mutations, each triggering lock
transfers and disrupting ownership on surviving nodes.

* dlm: track ring version, reject stale updates, remove dead code

SetSnapshot now takes a version parameter from the master. Stale
updates (version < current) are rejected, preventing reordered
messages from overwriting a newer ring state. Version 0 is always
accepted for bootstrap.

Remove AddServer/RemoveServer from LockRing — the ring is now
exclusively managed by the master via SetSnapshot. Remove the
candidateServers map that was only used by those methods.

* dlm: fix SelectLocks data race, advance generation on backup insert

- SelectLocks: change RLock to Lock since the function deletes map
  entries, which is a write operation and causes a data race under RLock.
- InsertBackupLock: advance nextGeneration to at least the incoming
  generation so that after failover promotion, new lock acquisitions
  get a generation strictly greater than any replicated lock.
- Bump replication failure log from V(1) to Warningf for production
  visibility.

* dlm: fix SetSnapshot race, test reliability, timer edge cases

- SetSnapshot: hold LockRing lock through both version update and
  Ring.SetServers() so they're atomic. Prevents a concurrent caller
  from seeing the new version but applying stale servers.
- Transfer window test: search for a key that actually moves primary
  when filer4 joins, instead of relying on a fixed key that may not.
- renewLock redirect: pass the existing token to the new primary
  instead of empty string, so redirected renewals work correctly.
- scheduleBroadcast: check timer.Stop() return value. If the timer
  already fired, the callback picks up latest state.
- FlushPending: only broadcast if timer.Stop() returns true (timer
  was still pending). If false, the callback is already running.
- Fix test comment: "idempotent" → "accepted, state-changing".

* dlm: use wall-clock nanoseconds for lock ring version

The lock ring version was an in-memory counter that reset to 0 on
master restart. A filer that had seen version 5 would reject version 1
from the restarted master.

Fix: use time.Now().UnixNano() as the version. This survives master
restarts without persistence — the restarted master produces a
version greater than any pre-restart value.

* dlm: treat expired lock owners as missing

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* dlm: reject stale lock transfers

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* dlm: order replication by generation

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* dlm: bootstrap lock ring on reconnect

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-03-30 23:29:56 -07:00
Chris Lu
ced2236cc6 Adjust rename events metadata format (#8854)
* rename metadata events

* fix subscription filter to use NewEntry.Name for rename path matching

The server-side subscription filter constructed the new path using
OldEntry.Name instead of NewEntry.Name when checking if a rename
event's destination matches the subscriber's path prefix. This could
cause events to be incorrectly filtered when a rename changes the
file name.

* fix bucket events to handle rename of bucket directories

onBucketEvents only checked IsCreate and IsDelete. A bucket directory
rename via AtomicRenameEntry now emits a single rename event (both
OldEntry and NewEntry non-nil), which matched neither check. Handle
IsRename by deleting the old bucket and creating the new one.

* fix replicator to handle rename events across directory boundaries

Two issues fixed:

1. The replicator filtered events by checking if the key (old path)
   was under the source directory. Rename events now use the old path
   as key, so renames from outside into the watched directory were
   silently dropped. Now both old and new paths are checked, and
   cross-boundary renames are converted to create or delete.

2. NewParentPath was passed to the sink without remapping to the
   sink's target directory structure, causing the sink to write
   entries at the wrong location. Now NewParentPath is remapped
   alongside the key.

* fix filer sync to handle rename events crossing directory boundaries

The early directory-prefix filter only checked resp.Directory (old
parent). Rename events now carry the old parent as Directory, so
renames from outside the source path into it were dropped before
reaching the existing cross-boundary handling logic. Check both old
and new directories against sourcePath and excludePaths so the
downstream old-key/new-key logic can properly convert these to
create or delete operations.

* fix metadata event path matching

* fix metadata event consumers for rename targets

* Fix replication rename target keys

Logical rename events now reach replication sinks with distinct source and target paths.\n\nHandle non-filer sinks as delete-plus-create on the translated target key, and make the rename fallback path create at the translated target key too.\n\nAdd focused tests covering non-filer renames, filer rename updates, and the fallback path.\n\nCo-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Fix filer sync rename path scoping

Use directory-boundary matching instead of raw prefix checks when classifying source and target paths during filer sync.\n\nAlso apply excludePaths per side so renames across excluded boundaries downgrade cleanly to create/delete instead of being misclassified as in-scope updates.\n\nAdd focused tests for boundary matching and rename classification.\n\nCo-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Fix replicator directory boundary checks

Use directory-boundary matching instead of raw prefix checks when deciding whether a source or target path is inside the watched tree or an excluded subtree.\n\nThis prevents sibling paths such as /foo and /foobar from being misclassified during rename handling, and preserves the earlier rename-target-key fix.\n\nAdd focused tests for boundary matching and rename classification across sibling/excluded directories.\n\nCo-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Fix etc-remote rename-out handling

Use boundary-safe source/target directory membership when classifying metadata events under DirectoryEtcRemote.\n\nThis prevents rename-out events from being processed as config updates, while still treating them as removals where appropriate for the remote sync and remote gateway command paths.\n\nAdd focused tests for update/removal classification and sibling-prefix handling.\n\nCo-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Defer rename events until commit

Queue logical rename metadata events during atomic and streaming renames and publish them only after the transaction commits successfully.\n\nThis prevents subscribers from seeing delete or logical rename events for operations that later fail during delete or commit.\n\nAlso serialize notification.Queue swaps in rename tests and add failure-path coverage.\n\nCo-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Skip descendant rename target lookups

Avoid redundant target lookups during recursive directory renames once the destination subtree is known absent.\n\nThe recursive move path now inserts known-absent descendants directly, and the test harness exercises prefixed directory listing so the optimization is covered by a directory rename regression test.\n\nCo-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Tighten rename review tests

Return filer_pb.ErrNotFound from the bucket tracking store test stub so it follows the FilerStore contract, and add a webhook filter case for same-name renames across parent directories.\n\nCo-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* fix HardLinkId format verb in InsertEntryKnownAbsent error

HardLinkId is a byte slice. %d prints each byte as a decimal number
which is not useful for an identifier. Use %x to match the log line
two lines above.

* only skip descendant target lookup when source and dest use same store

moveFolderSubEntries unconditionally passed skipTargetLookup=true for
every descendant. This is safe when all paths resolve to the same
underlying store, but with path-specific store configuration a child's
destination may map to a different backend that already holds an entry
at that path. Use FilerStoreWrapper.SameActualStore to check per-child
and fall back to the full CreateEntry path when stores differ.

* add nil and create edge-case tests for metadata event scope helpers

* extract pathIsEqualOrUnder into util.IsEqualOrUnder

Identical implementations existed in both replication/replicator.go and
command/filer_sync.go. Move to util.IsEqualOrUnder (alongside the
existing FullPath.IsUnder) and remove the duplicates.

* use MetadataEventTargetDirectory for new-side directory in filer sync

The new-side directory checks and sourceNewKey computation used
message.NewParentPath directly. If NewParentPath were empty (legacy
events, older filer versions during rolling upgrades), sourceNewKey
would be wrong (/filename instead of /dir/filename) and the
UpdateEntry parent path rewrite would panic on slice bounds.

Derive targetDir once from MetadataEventTargetDirectory, which falls
back to resp.Directory when NewParentPath is empty, and use it
consistently for all new-side checks and the sink parent path.
2026-03-30 18:25:11 -07:00
Chris Lu
2eaf98a7a2 Use Unix sockets for gRPC in mini mode (#8856)
* Use Unix sockets for gRPC between co-located services in mini mode

In `weed mini`, all services run in one process. Previously, inter-service
gRPC traffic (volume↔master, filer↔master, S3↔filer, worker↔admin, etc.)
went through TCP loopback. This adds a gRPC Unix socket registry in the pb
package: mini mode registers a socket path per gRPC port at startup, each
gRPC server additionally listens on its socket, and GrpcDial transparently
routes to the socket via WithContextDialer when a match is found.

Standalone commands (weed master, weed filer, etc.) are unaffected since
no sockets are registered. TCP listeners are kept for external clients.

* Handle Serve error and clean up socket file in ServeGrpcOnLocalSocket

Log non-expected errors from grpcServer.Serve (ignoring
grpc.ErrServerStopped) and always remove the Unix socket file
when Serve returns, ensuring cleanup on Stop/GracefulStop.
2026-03-30 18:18:52 -07:00
Chris Lu
c2c58419b8 filer.sync: send log file chunk fids to clients for direct volume server reads (#8792)
* filer.sync: send log file chunk fids to clients for direct volume server reads

Instead of the server reading persisted log files from volume servers, parsing
entries, and streaming them over gRPC (serial bottleneck), clients that opt in
via client_supports_metadata_chunks receive log file chunk references (fids)
and read directly from volume servers in parallel.

New proto messages:
- LogFileChunkRef: chunk fids + timestamp + filer ID for one log file
- SubscribeMetadataRequest.client_supports_metadata_chunks: client opt-in
- SubscribeMetadataResponse.log_file_refs: server sends refs during backlog

Server changes:
- CollectLogFileRefs: lists log files and returns chunk refs without any
  volume server I/O (metadata-only operation)
- SubscribeMetadata/SubscribeLocalMetadata: when client opts in, sends refs
  during persisted log phase, then falls back to normal streaming for
  in-memory events

Client changes:
- ReadLogFileRefs: reads log files from volume servers, parses entries,
  filters by path prefix, invokes processEventFn
- MetadataFollowOption.LogFileReaderFn: factory for chunk readers,
  enables metadata chunks when non-nil
- Both filer_pb_tail.go and meta_aggregator.go recv loops accumulate
  refs then process them at the disk→memory transition

Backward compatible: old clients don't set the flag, get existing behavior.

Ref: #8771

* filer.sync: merge entries across filers in timestamp order on client side

ReadLogFileRefs now groups refs by filer ID and merges entries from
multiple filers using a min-heap priority queue — the same algorithm
the server uses in OrderedLogVisitor + LogEntryItemPriorityQueue.

This ensures events are processed in correct timestamp order even when
log files from different filers have interleaved timestamps. Single-filer
case takes the fast path (no heap allocation).

* filer.sync: integration tests for direct-read metadata chunks

Three test categories:

1. Merge correctness (TestReadLogFileRefsMergeOrder):
   Verifies entries from 3 filers are delivered in strict timestamp order,
   matching the server-side OrderedLogVisitor guarantee.

2. Path filtering (TestReadLogFileRefsPathFilter):
   Verifies client-side path prefix filtering works correctly.

3. Throughput comparison (TestDirectReadVsServerSideThroughput):
   3 filers × 7 files × 300 events = 6300 events, 2ms per file read:

   server-side:  6300 events  218ms   28,873 events/sec
   direct-read:  6300 events   51ms  123,566 events/sec  (4.3x)
   parallel:     6300 events   17ms  378,628 events/sec  (13.1x)

   Direct-read eliminates gRPC send overhead per event (4.3x).
   Parallel per-filer reading eliminates serial file I/O (13.1x).

* filer.sync: parallel per-filer reads with prefetching in ReadLogFileRefs

ReadLogFileRefs now has two levels of I/O overlap:

1. Cross-filer parallelism: one goroutine per filer reads its files
   concurrently. Entries feed into per-filer channels, merged by the
   main goroutine via min-heap (same ordering guarantee as the server's
   OrderedLogVisitor).

2. Within-filer prefetching: while the current file's entries are being
   consumed by the merge heap, the next file is already being read from
   the volume server in a background goroutine.

Single-filer fast path avoids the heap and channels.

Test results (3 filers × 7 files × 300 events, 2ms per file read):

  server-side sequential:  6300 events  212ms   29,760 events/sec
  parallel + prefetch:     6300 events   36ms  177,443 events/sec
  Speedup: 6.0x

* filer.sync: address all review comments on metadata chunks PR

Critical fixes:
- sendLogFileRefs: bypass pipelinedSender, send directly on gRPC stream.
  Ref messages have TsNs=0 and were being incorrectly batched into the
  Events field by the adaptive batching logic, corrupting ref delivery.
- readLogFileEntries: use io.ReadFull instead of reader.Read to prevent
  partial reads from corrupting size values or protobuf data.
- Error handling: only skip chunk-not-found errors (matching server-side
  isChunkNotFoundError). Other I/O or decode failures are propagated so
  the follower can retry.

High-priority fixes:
- CollectLogFileRefs: remove incorrect +24h padding from stopTime. The
  extra day caused unnecessary log file refs to be collected.
- Path filtering: ReadLogFileRefs now accepts PathFilter struct with
  PathPrefix, AdditionalPathPrefixes, and DirectoriesToWatch. Uses
  util.Join for path construction (avoids "//foo" on root). Excludes
  /.system/log/ internal entries. Matches server-side
  eachEventNotificationFn filtering logic.

Medium-priority fixes:
- CollectLogFileRefs: accept context.Context, propagate to
  ListDirectoryEntries calls for cancellation support.
- NewChunkStreamReaderFromLookup: accept context.Context, propagate to
  doNewChunkStreamReader.

Test fixes:
- Check error returns from ReadLogFileRefs in all test call sites.

---------

Co-authored-by: Copilot <copilot@github.com>
2026-03-27 11:01:29 -07:00
Chris Lu
d97660d0cd filer.sync: pipelined subscription with adaptive batching for faster catch-up (#8791)
* filer.sync: pipelined subscription with adaptive batching for faster catch-up

The SubscribeMetadata pipeline was fully serial: reading a log entry from a
volume server, unmarshaling, filtering, and calling stream.Send() all happened
one-at-a-time. stream.Send() blocked the entire pipeline until the client
acknowledged each event, limiting throughput to ~80 events/sec regardless of
the -concurrency setting.

Three server-side optimizations that stack:

1. Pipelined sender: decouple stream.Send() from the read loop via a buffered
   channel (1024 messages). A dedicated goroutine handles gRPC delivery while
   the reader continues processing the next events.

2. Adaptive batching: when event timestamps are >2min behind wall clock
   (backlog catch-up), drain multiple events from the channel and pack them
   into a single stream.Send() using a new `repeated events` field on
   SubscribeMetadataResponse. When events are recent (real-time), send
   one-by-one for low latency. Old clients ignore the new field (backward
   compatible).

3. Persisted log readahead: run the OrderedLogVisitor in a background
   goroutine so volume server I/O for the next log file overlaps with event
   processing and gRPC delivery.

4. Event-driven aggregated subscription: replace time.Sleep(1127ms) polling
   in SubscribeMetadata with notification-driven wake-up using the
   MetaLogBuffer subscriber mechanism, reducing real-time latency from
   ~1127ms to sub-millisecond.

Combined, these create a 3-stage pipeline:
  [Volume I/O → readahead buffer] → [Filter → send buffer] → [gRPC Send]

Test results (simulated backlog with 50µs gRPC latency per Send):
  direct (old):        2100 events  2100 sends  168ms   12,512 events/sec
  pipelined+batched:   2100 events    14 sends   40ms   52,856 events/sec
  Speedup: 4.2x single-stream throughput

Ref: #8771

* filer.sync: require client opt-in for batch event delivery

Add ClientSupportsBatching field to SubscribeMetadataRequest. The server
only packs events into the Events batch field when the client explicitly
sets this flag to true. Old clients (Java SDK, third-party) that don't
set the flag get one-event-per-Send, preserving backward compatibility.

All Go callers (FollowMetadata, MetaAggregator) set the flag to true
since their recv loops already unpack batched events.

* filer.sync: clear batch Events field after Send to release references

Prevents the envelope message from holding references to the rest of the
batch after gRPC serialization, allowing the GC to collect them sooner.

* filer.sync: fix Send deadlock, add error propagation test, event-driven local subscribe

- pipelinedSender.Send: add case <-s.done to unblock when sender goroutine
  exits (fixes deadlock when errCh was already consumed by a prior Send).
- pipelinedSender.reportErr: remove for-range drain on sendCh that could
  block indefinitely. Send() now detects exit via s.done instead.
- SubscribeLocalMetadata: replace remaining time.Sleep(1127ms) in the
  gap-detected-no-memory-data path with event-driven listenersCond.Wait(),
  consistent with the rest of the subscription paths.
- Add TestPipelinedSenderErrorPropagation: verifies error surfaces via
  Send and Close when the underlying stream fails.
- Replace goto with labeled break in test simulatePipeline.

* filer.sync: check error returns in test code

- direct_send: check slowStream.Send error return
- pipelined_batched_send: check sender.Close error return
- simulatePipeline: return error from sender.Close, propagate to callers

---------

Co-authored-by: Copilot <copilot@github.com>
2026-03-26 23:55:42 -07:00
Chris Lu
ba624f1f34 Rust volume server implementation with CI (#8539)
* Match Go gRPC client transport defaults

* Honor Go HTTP idle timeout

* Honor maintenanceMBps during volume copy

* Honor images.fix.orientation on uploads

* Honor cpuprofile when pprof is disabled

* Match Go memory status payloads

* Propagate request IDs across gRPC calls

* Format pending Rust source updates

* Match Go stats endpoint payloads

* Serve Go volume server UI assets

* Enforce Go HTTP whitelist guards

* Align Rust metrics admin-port test with Go behavior

* Format pending Rust server updates

* Honor access.ui without per-request JWT checks

* Honor keepLocalDatFile in tier upload shortcut

* Honor Go remote volume write mode

* Load tier backends from master config

* Check master config before loading volumes

* Remove vif files on volume destroy

* Delete remote tier data on volume destroy

* Honor vif version defaults and overrides

* Reject mismatched vif bytes offsets

* Load remote-only tiered volumes

* Report Go tail offsets in sync status

* Stream remote dat in incremental copy

* Honor collection vif for EC shard config

* Persist EC expireAtSec in vif metadata

* Stream remote volume reads through HTTP

* Serve HTTP ranges from backend source

* Match Go ReadAllNeedles scan order

* Match Go CopyFile zero-stop metadata

* Delete EC volumes with collection cleanup

* Drop deleted collection metrics

* Match Go tombstone ReadNeedleMeta

* Match Go TTL parsing: all-digit default to minutes, two-pass fit algorithm

* Match Go needle ID/cookie formatting and name size computation

* Match Go image ext checks: webp resize only, no crop; empty healthz body

* Match Go Prometheus metric names and add missing handler counter constants

* Match Go ReplicaPlacement short string parsing with zero-padding

* Add missing EC constants MAX_SHARD_COUNT and MIN_TOTAL_DISKS

* Add walk_ecx_stats for accurate EC volume file counts and size

* Match Go VolumeStatus dat file size, EC shard stats, and disk pct precision

* Match Go needle map: unconditional delete counter, fix redb idx walk offset

* Add CompactMapSegment overflow panic guard matching Go

* Match Go volume: vif creation, version from superblock, TTL expiry, dedup data_size, garbage_level fallback

* Match Go 304 Not Modified: return bare status with no headers

* Match Go JWT error message: use "wrong jwt" instead of detailed error

* Match Go read handler bare 400, delete error prefix, download throttle timeout

* Match Go pretty JSON 1-space indent and "Deletion Failed:" error prefix

* Match Go heartbeat: keep is_heartbeating on error, add EC shard identification

* Match Go needle ReadBytes V2: tolerate EOF on truncated body

* Match Go volume: cookie check on any existing needle, return DataSize, 128KB meta guard

* Match Go DeleteCollection: propagate destroy errors

* Match Go gRPC: BatchDelete no flag, IncrementalCopy error, FetchAndWrite concurrent, VolumeUnmount/DeleteCollection errors, tail draining, query error code

* Match Go Content-Disposition RFC 6266 formatting with RFC 2231 encoding

* Match Go Guard isWriteActive: combine whitelist and signing key check

* Match Go DeleteCollectionMetrics: use partial label matching

* Match Go heartbeat: send state-only delta on volume state changes

* Match Go ReadNeedleMeta paged I/O: read header+tail only, skip data; add EIO tracking

* Match Go ScrubVolume INDEX mode dispatch; add VolumeCopy preallocation and EC NeedleStatus TODOs

* Add read_ec_shard_needle for full needle reconstruction from local EC shards

* Make heartbeat master config helpers pub for VolumeCopy preallocation

* Match Go gRPC: VolumeCopy preallocation, EC NeedleStatus full read, error message wording

* Match Go HTTP responses: omitempty fields, 2-space JSON indent, JWT JSON error, delete pretty/JSONP, 304 Last-Modified, raw write error

* Match Go WriteNeedleBlob V3 timestamp patching, fix makeup_diff double padding, count==0 read handling

* Add rebuild_ecx_file for EC index reconstruction from data shards

* Match Go gRPC: tail header first-chunk-only, EC cleanup on failure, copy append mode, ecx rebuild, compact cancellation

* Add EC volume read and delete support in HTTP handlers

* Add per-shard EC mount/unmount, location predicate search, idx directory for EC

* Add CheckVolumeDataIntegrity on volume load matching Go

* Match Go gRPC: EC multi-disk placement, per-shard mount/unmount, no auto-mount on reconstruct, streaming ReadAll/EcShardRead, ReceiveFile cleanup, version check, proxy streaming, redirect Content-Type

* Match Go heartbeat metric accounting

* Match Go duplicate UUID heartbeat retries

* Delete expired EC volumes during heartbeat

* Match Go volume heartbeat pruning

* Honor master preallocate in volume max

* Report remote storage info in heartbeats

* Emit EC heartbeat deltas on shard changes

* Match Go throttle boundary: use <= instead of <, fix pretty JSON to 1-space

* Match Go write_needle_blob monotonic appendAtNs via get_append_at_ns

* Match Go VolumeUnmount: idempotent success when volume not found

* Match Go TTL Display: return empty string when unit is Empty

Go checks `t.Unit == Empty` separately and returns "" for TTLs
with nonzero count but Empty unit. Rust only checked is_empty()
(count==0 && unit==0), so count>0 with unit=0 would format as
"5 " instead of "".

* Match Go error behavior for truncated needle data in read_body_v2

Go's readNeedleDataVersion2 returns "index out of range %d" errors
(indices 1-7) when needle body or metadata fields are truncated.
Rust was silently tolerating truncation and returning Ok. Now returns
NeedleError::IndexOutOfRange with the matching index for each field.

* Match Go download throttle: return JSON error instead of plain text

* Match Go crop params: default x1/y1 to 0 when not provided

* Match Go ScrubEcVolume: accumulate total_files from EC shards

* Match Go ScrubVolume: count total_files even on scrub error

* Match Go VolumeEcShardsCopy: set ignore_source_file_not_found for .vif

* Match Go VolumeTailSender: send needle_header on every chunk

* Match Go read_super_block: apply replication override from .vif

* Match Go check_volume_data_integrity: verify all 10 entries, detect trailing corruption

* Match Go WriteNeedleBlob: dedup check before writing during replication

* handlers: use meta-only reads for HEAD

* handlers: align range parsing and responses with Go

* handlers: align upload parsing with Go

* deps: enable webp support

* Make 5bytes the default feature for idx entry compatibility

* Match Go TTL: preserve original unit when count fits in byte

* Fix EC locate_needle: use get_actual_size for full needle size

* Fix raw body POST: only parse multipart when Content-Type contains form-data

* Match Go ReceiveFile: return protocol errors in response body, not gRPC status

* add docs

* Match Go VolumeEcShardsCopy: append to .ecj file instead of truncating

* Match Go ParsePath: support _delta suffix on file IDs for sub-file addressing

* Match Go chunk manifest: add Accept-Ranges, Content-Disposition, filename fallback, MIME detection

* Match Go privateStoreHandler: use proper JSON error for unsupported methods

* Match Go Destroy: add only_empty parameter to reject non-empty volume deletion

* Fix compilation: set_read_only_persist and set_writable return ()

These methods fire-and-forget save_vif internally, so gRPC callers
should not try to chain .map_err() on the unit return type.

* Match Go SaveVolumeInfo: check writability and propagate errors in save_vif

* Match Go VolumeDelete: propagate only_empty to delete_volume for defense in depth

The gRPC VolumeDelete handler had a pre-check for only_empty but then
passed false to store.delete_volume(), bypassing the store-level check.
Go passes req.OnlyEmpty directly to DeleteVolume. Now Rust does the same
for defense in depth against TOCTOU races (though the store write lock
makes this unlikely).

* Match Go ProcessRangeRequest: return full content for empty/oversized ranges

Go returns nil from ProcessRangeRequest when ranges are empty or total
range size exceeds content length, causing the caller to serve the full
content as a normal 200 response. Rust was returning an empty 200 body.

* Match Go Query: quote JSON keys in output records

Go's ToJson produces valid JSON with quoted keys like {"name":"Alice"}.
Rust was producing invalid JSON with unquoted keys like {name:"Alice"}.

* Match Go VolumeCopy: reject when no suitable disk location exists

Go returns ErrVolumeNoSpaceLeft when no location matches the disk type
and has sufficient space. Rust had an unsafe fallback that silently
picked the first location regardless of type or available space.

* Match Go DeleteVolumeNeedle: check noWriteOrDelete before allowing delete

Go checks v.noWriteOrDelete before proceeding with needle deletion,
returning "volume is read only" if true. Rust was skipping this check.

* Match Go ReceiveFile: prefer HardDrive location for EC and use response-level write errors

Two fixes: (1) Go prefers HardDriveType disk location for EC volumes,
falling back to first location. Returns "no storage location available"
when no locations exist. (2) Write failures are now response-level
errors (in response body) instead of gRPC status errors, matching Go.

* Match Go CopyFile: sync EC volume journal to disk before copying

Go calls ecVolume.Sync() before copying EC volume files to ensure the
.ecj journal is flushed to disk. Added sync_to_disk() to EcVolume and
call it in the CopyFile EC branch.

* Match Go readSuperBlock: propagate replication parse errors

Go returns an error when parsing the replication string from the .vif
file fails. Rust was silently ignoring the parse failure and using the
super block's replication as-is.

* Match Go TTL expiry: remove append_at_ns > 0 guard

Go computes TTL expiry from AppendAtNs without guarding against zero.
When append_at_ns is 0, the expiry is epoch + TTL which is in the past,
correctly returning NotFound. Rust's extra guard skipped the check,
incorrectly returning success for such needles.

* Match Go delete_collection: skip volumes with compaction in progress

Go checks !v.isCompactionInProgress.Load() before destroying a volume
during collection deletion, skipping compacting volumes. Also changed
destroy errors to log instead of aborting the entire collection delete.

* Match Go MarkReadonly/MarkWritable: always notify master even on local error

Go always notifies the master regardless of whether the local
set_read_only_persist or set_writable step fails. The Rust code was
using `?` which short-circuited on error, skipping the final master
notification. Save the result and defer the `?` until after the
notify call.

* Match Go PostHandler: return 500 for all write errors

Go returns 500 (InternalServerError) for all write failures. Rust was
returning 404 for volume-not-found and 403 for read-only volumes.

* Match Go makeupDiff: validate .cpd compaction revision is old + 1

Go reads the new .cpd file's super block and verifies the compaction
revision is exactly old + 1. Rust only validated the old revision.

* Match Go VolumeStatus: check data backend before returning status

Go checks v.DataBackend != nil before building the status response,
returning an error if missing. Rust was silently returning size 0.

* Match Go PostHandler: always include mime field in upload response JSON

Go always serializes the mime field even when empty ("mime":""). Rust was
omitting it when empty due to Option<String> with skip_serializing_if.

* Match Go FindFreeLocation: account for EC shards in free slot calculation

Go subtracts EC shard equivalents when computing available volume slots.
Rust was only comparing volume count, potentially over-counting free
slots on locations with many EC shards.

* Match Go privateStoreHandler: use INVALID as metrics label for unsupported methods

Go records the method as INVALID in metrics for unsupported HTTP methods.
Rust was using the actual method name.

* Match Go volume: add commit_compact guard and scrub data size validation

Two fixes: (1) commit_compact now checks/sets is_compacting flag to
prevent concurrent commits, matching Go's CompareAndSwap guard.
(2) scrub now validates total needle sizes against .dat file size.

* Match Go gRPC: fix TailSender error propagation, EcShardsInfo all slots, EcShardRead .ecx check

Three fixes: (1) VolumeTailSender now propagates binary search errors
instead of silently falling back to start. (2) VolumeEcShardsInfo
returns entries for all shard slots including unmounted. (3)
VolumeEcShardRead checks .ecx index for deletions instead of .ecj.

* Match Go metrics: add BuildInfo gauge and connection tracking functions

Go exposes a BuildInfo Prometheus metric with version labels, and tracks
open connections via stats.ConnectionOpen/Close. Added both to Rust.

* Match Go NeedleMap.Delete: use !is_deleted() instead of is_valid()

Go's CompactMap.Delete checks !IsDeleted() not IsValid(), so needles
with size==0 (live but anomalous) can still be deleted. The Rust code
was using is_valid() which returns false for size==0, preventing
deletion of such needles.

* Match Go fitTtlCount: always normalize TTL to coarsest unit

Go's fitTtlCount always converts to seconds first, then finds the
coarsest unit that fits in one byte (e.g., 120m → 2h). Rust had an
early return for count<=255 that skipped normalization, producing
different binary encodings for the same duration.

* Match Go BuildInfo metric: correct name and add missing labels

Go uses SeaweedFS_build_info (Namespace=SeaweedFS, Subsystem=build,
Name=info) with labels [version, commit, sizelimit, goos, goarch].
Rust had SeaweedFS_volumeServer_buildInfo with only [version].

* Match Go HTTP handlers: fix UploadResult fields, DiskStatus JSON, chunk manifest ETag

- UploadResult.mime: add skip_serializing_if to omit empty MIME (Go uses omitempty)
- UploadResult.contentMd5: only include when request provided Content-MD5 header
- Content-MD5 response header: only set when request provided it
- DiskStatuses: use camelCase field names (percentFree, percentUsed, diskType)
  to match Go's protobuf JSON marshaling
- Chunk manifest: preserve needle ETag in expanded response headers

* Match Go volume: fix version(), integrity check, scrub, and commit_compact

- version(): use self.version() instead of self.super_block.version in
  read_all_needles, check_volume_data_integrity, scan_raw_needles_from
  to respect volumeInfo.version override
- check_volume_data_integrity: initialize healthy_index_size to idx_size
  (matching Go) and continue on EOF instead of returning error
- scrub(): count deleted needles in total_read since they still occupy
  space in the .dat file (matches Go's totalRead += actualSize for deleted)
- commit_compact: clean up .cpd/.cpx files on makeup_diff failure
  (matches Go's error path cleanup)

* Match Go write queue: add 4MB batch byte limit

Go's startWorker breaks the batch at either 128 requests or 4MB of
accumulated write data. Rust only had the 128-request limit, allowing
large writes to accumulate unbounded latency.

* Add TTL normalization tests for Go parity verification

Test that fit_ttl_count normalizes 120m→2h, 24h→1d, 7d→1w even
when count fits in a byte, matching Go's fitTtlCount behavior.

* Match Go FindFreeLocation: account for EC shards in free slot calculation

Go's free volume count subtracts both regular volumes and EC volumes
from max_volume_count. Rust was only counting regular volumes, which
could over-report available slots when EC shards are mounted.

* Match Go EC volume: mark deletions in .ecx and replay .ecj at startup

Go's DeleteNeedleFromEcx marks needles as deleted in the .ecx index
in-place (writing TOMBSTONE_FILE_SIZE at the size field) in addition
to appending to the .ecj journal. Go's RebuildEcxFile replays .ecj
entries into .ecx on startup, then removes the .ecj file.

Rust was only appending to .ecj without marking .ecx, which meant
deleted EC needles remained readable via .ecx binary search. This
fix:
- Opens .ecx in read/write mode (was read-only)
- Adds mark_needle_deleted_in_ecx: binary search + in-place write
- Calls it from journal_delete before appending to .ecj
- Adds rebuild_ecx_from_journal: replays .ecj into .ecx on startup

* Match Go check_all_ec_shards_deleted: use MAX_SHARD_COUNT instead of hardcoded 14

Go's TotalShardsCount is DataShardsCount + ParityShardsCount = 14 by
default, but custom EC configs via .vif can have more shards (up to
MaxShardCount = 32). Using MAX_SHARD_COUNT ensures all shard files
are checked regardless of EC configuration.

* Match Go EC locate: subtract 1 from shard size and use datFileSize override

Go's LocateEcShardNeedleInterval passes shard.ecdFileSize-1 to
LocateData (shards are padded, -1 avoids overcounting large block
rows). When datFileSize is known, Go uses datFileSize/DataShards
instead. Rust was passing the raw shard file size without adjustment.

* Fix TTL parsing and DiskStatus field names to match Go exactly

TTL::read: Go's ReadTTL preserves the original unit (7d stays 7d,
not 1w) and errors on count > 255. The previous normalization change
was incorrect — Go only normalizes internally via fitTtlCount, not
during string parsing.

DiskStatus: Go uses encoding/json on protobuf structs, which reads
the json struct tags (snake_case: percent_free, percent_used,
disk_type), not the protobuf JSON names (camelCase). Revert to
snake_case to match Go's actual output.

* Fix heartbeat: check leader != current master before redirect, process duplicated UUIDs first

Match Go's volume_grpc_client_to_master.go behavior:
1. Only trigger leader redirect when the leader address differs from the
   current master (prevents unnecessary reconnect loops when master confirms
   its own address).
2. Process duplicated_uuids before leader redirect check, matching Go's
   ordering where duplicate UUID detection takes priority.

* Remove SetState version check to match Go behavior

Go's SetState unconditionally applies the state without any version
mismatch check. The Rust version had an extra optimistic concurrency
check that would reject valid requests from Go clients that don't
track versions.

* Fix TTL::read() to normalize via fit_ttl_count matching Go's ReadTTL

Go's ReadTTL calls fitTtlCount which converts to seconds and normalizes
to the coarsest unit that fits in a byte count (e.g. 120m->2h, 7d->1w,
24h->1d). The Rust version was preserving the original unit, producing
different binary encodings on disk and in heartbeat messages.

* Always return Content-MD5 header and JSON field on successful writes

Go always sets Content-MD5 in the response regardless of whether the
request included it. The Rust version was conditionally including it
only when the request provided Content-MD5.

* Include name and size in UploadResult JSON even when empty/zero

Go's encoding/json always includes empty strings and zero values in
the upload response. The Rust version was using skip_serializing_if
to omit them, causing JSON structure differences.

* Include deleted needles in scan_raw_needles_from to match Go

Go's ScanVolumeFileFrom visits ALL needles including deleted ones.
Skipping deleted entries during incremental copy would cause tombstones
to not be propagated, making deleted files reappear on the receiving side.

* Match Go NeedleMap.Delete: always write tombstone to idx file

Go's NeedleMap.Delete unconditionally writes a tombstone entry to the
idx file and updates metrics, even if the needle doesn't exist or is
already deleted. This is important for replication where every delete
operation must produce an idx write. The Rust version was skipping the
tombstone write for non-existent or already-deleted needles.

* Limit MIME type to 255 bytes matching Go's CreateNeedleFromRequest

* Title-case Seaweed-* pair keys to match Go HTTP header canonicalization

* Unify DiskType::Hdd into HardDrive to match Go's single HardDriveType

* Skip tombstone entries in walk_ecx_stats total_size matching Go's Raw()

* Return EMPTY TTL when computed seconds is zero matching Go's fitTtlCount

* Include disk-space-low in Volume.is_read_only() matching Go

* Log error on CIDR parse failure in whitelist matching Go's glog.Errorf

* Log cookie mismatch in gRPC Query matching Go's V(0).Infof

* Fix is_expired volume_size comparison to use < matching Go

Go checks `volumeSize < super_block.SuperBlockSize` (strict less-than),
but Rust used `<=`. This meant Rust would fail to expire a volume that
is exactly SUPER_BLOCK_SIZE bytes.

* Apply Go's JWT expiry defaults: 10s write, 60s read

Go calls v.SetDefault("jwt.signing.expires_after_seconds", 10) and
v.SetDefault("jwt.signing.read.expires_after_seconds", 60). Rust
defaulted to 0 for both, which meant tokens would never expire when
security.toml has a signing key but omits expires_after_seconds.

* Stop [grpc.volume].ca from overriding [grpc].ca matching Go

Go reads the gRPC CA file only from config.GetString("grpc.ca"), i.e.
the [grpc] section. The [grpc.volume] section only provides cert and
key. Rust was also reading ca from [grpc.volume] which would silently
override the [grpc].ca value when both were present.

* Fix free_volume_count to use EC shard count matching Go

Was counting EC volumes instead of EC shards, which underestimates EC
space usage. One EC volume with 14 shards uses ~1.4 volume slots, not 1.
Now uses Go's formula: ((max - volumes) * DataShardsCount - ecShardCount) / DataShardsCount.

* Include preallocate in compaction space check matching Go

Go uses max(preallocate, estimatedCompactSize) for the free space check.
Rust was only using the estimated volume size, which could start a
compaction that fails mid-way if preallocate exceeds the volume size.

* Check gzip magic bytes before setting Content-Encoding matching Go

Go checks both Accept-Encoding contains "gzip" AND IsGzippedContent
(data starts with 0x1f 0x8b) before setting Content-Encoding: gzip.
Rust only checked Accept-Encoding, which could incorrectly declare
gzip encoding for non-gzip compressed data.

* Only set upload response name when needle HasName matching Go

Go checks reqNeedle.HasName() before setting ret.Name. Rust always set
the name from the filename variable, which could return the fid portion
of the path as the name for raw PUT requests without a filename.

* Treat MaxVolumeCount==0 as unlimited matching Go's hasFreeDiskLocation

Go's hasFreeDiskLocation returns true immediately when MaxVolumeCount
is 0, treating it as unlimited. Rust was computing effective_free as
<= 0 for max==0, rejecting the location. This could fail volume
creation during early startup before the first heartbeat adjusts max.

* Read lastAppendAtNs from deleted V3 entries in integrity check

Go's doCheckAndFixVolumeData reads AppendAtNs from both live entries
(verifyNeedleIntegrity) and deleted tombstones (verifyDeletedNeedleIntegrity).
Rust was skipping deleted entries, which could result in a stale
last_append_at_ns if the last index entry is a deletion.

* Return empty body for empty/oversized range requests matching Go

Go's ProcessRangeRequest returns nil (empty body, 200 OK) when
parsed ranges are empty or combined range size exceeds total content
size. The Rust buffered path incorrectly returned the full file data
for both cases. The streaming path already handled this correctly.

* Dispatch ScrubEcVolume by mode matching Go's INDEX/LOCAL/FULL

Go's ScrubEcVolume switches on mode: INDEX calls v.ScrubIndex()
(ecx integrity only), LOCAL calls v.ScrubLocal(), FULL calls
vs.store.ScrubEcVolume(). Rust was ignoring the mode and always
running verify_ec_shards. Now INDEX mode checks ecx index integrity
(sorted overlap detection + file size validation) without shard I/O,
while LOCAL/FULL modes run the existing shard verification.

* Fix TTL test expectation: 7d normalizes to 1w matching Go's fitTtlCount

Go's ReadTTL calls fitTtlCount which normalizes to the coarsest unit
that fits: 7 days = 1 week, so "7d" becomes {Count:1, Unit:Week}
which displays as "1w". Both Go and Rust normalize identically.

* Add version mismatch check to SetState matching Go's State.Update

Go's State.Update compares the incoming version with the stored
version and returns "version mismatch" error if they differ. This
provides optimistic concurrency control. The Rust implementation
was accepting any version unconditionally.

* Use unquoted keys in Query JSON output matching Go's json.ToJson

Go's json.ToJson produces records with unquoted keys like
{score:12} not {"score":12}. This is a custom format used
internally by SeaweedFS for query results.

* Fix TTL test expectation in VolumeNeedleStatus: 7d normalizes to 1w

Same normalization as the HTTP test: Go's ReadTTL calls fitTtlCount
which converts 7 days to 1 week.

* Include ETag header in 304 Not Modified responses matching Go behavior

Go sets ETag on the response writer (via SetEtag) before the
If-Modified-Since and If-None-Match conditional checks, so both
304 response paths include the ETag header. The Rust implementation
was only adding ETag to 200 responses.

* Remove needle-name fallback in chunk manifest filename resolution

Go's tryHandleChunkedFile only falls back from URL filename to
manifest name. Rust had an extra fallback to needle.name that
Go does not perform, which could produce different
Content-Disposition filenames for chunk manifests.

* Validate JWT nbf (Not Before) claim matching Go's jwt-go/v5

Go's jwt.ParseWithClaims validates the nbf claim when present,
rejecting tokens whose nbf is in the future. The Rust jsonwebtoken
crate defaults validate_nbf to false, so tokens with future nbf
were incorrectly accepted.

* Set isHeartbeating to true at startup matching Go's VolumeServer init

Go unconditionally sets isHeartbeating: true in the VolumeServer
struct literal. Rust was starting with false when masters are
configured, causing /healthz to return 503 until the first
heartbeat succeeds.

* Call store.close() on shutdown matching Go's Shutdown()

Go's Shutdown() calls vs.store.Close() which closes all volumes
and flushes file handles. The Rust server was relying on process
exit for cleanup, which could leave data unflushed.

* Include server ID in maintenance mode error matching Go's format

Go returns "volume server %s is in maintenance mode" with the
store ID. Rust was returning a generic "maintenance mode" message.

* Fix DiskType test: use HardDrive variant matching Go's HddType=""

Go maps both "" and "hdd" to HardDriveType (empty string). The
Rust enum variant is HardDrive, not Hdd. The test referenced a
nonexistent Hdd variant causing compilation failure.

* Do not include ETag in 304 responses matching Go's GetOrHeadHandler

Go sets ETag at L235 AFTER the If-Modified-Since and If-None-Match
304 return paths, so Go's 304 responses do not include the ETag header.
The Rust code was incorrectly including ETag in both 304 response paths.

* Return 400 on malformed query strings in PostHandler matching Go's ParseForm

Go's r.ParseForm() returns HTTP 400 with "form parse error: ..." when
the query string is malformed. Rust was silently falling back to empty
query params via unwrap_or_default().

* Load EC volume version from .vif matching Go's NewEcVolume

Go sets ev.Version = needle.Version(volumeInfo.Version) from the .vif
file. Rust was always using Version::current() (V3), which would produce
wrong needle actual size calculations for volumes created with V1 or V2.

* Sync .ecx file before close matching Go's EcVolume.Close

Go calls ev.ecxFile.Sync() before closing to ensure in-place deletion
marks are flushed to disk. Without this, deletion marks written via
MarkNeedleDeleted could be lost on crash.

* Validate SuperBlock extra data size matching Go's Bytes() guard

Go checks extraSize > 256*256-2 and calls glog.Fatalf to prevent
corrupt super block headers. Rust was silently truncating via u16 cast,
which would write an incorrect extra_size field.

* Update quinn-proto 0.11.13 -> 0.11.14 to fix GHSA-6xvm-j4wr-6v98

Fixes Dependency Review CI failure: quinn-proto < 0.11.14 is vulnerable
to unauthenticated remote DoS via panic in QUIC transport parameter
parsing.

* Skip TestMultipartUploadUsesFormFieldsForTimestampAndTTL for Go server

Go's r.FormValue() cannot read multipart text fields after
r.MultipartReader() consumes the body, so ts/ttl sent as multipart
form fields only work with the Rust volume server. Skip this test
when VOLUME_SERVER_IMPL != "rust" to fix CI failure.

* Flush .ecx in EC volume sync_to_disk matching Go's Sync()

Go's EcVolume.Sync() flushes both the .ecj journal and the .ecx index
to disk. The Rust version only flushed .ecj, leaving in-place deletion
marks in .ecx unpersisted until close(). This could cause data
inconsistency if the server crashes after marking a needle deleted in
.ecx but before close().

* Remove .vif file in EC volume destroy matching Go's Destroy()

Go's EcVolume.Destroy() removes .ecx, .ecj, and .vif files. The Rust
version only removed .ecx and .ecj, leaving orphaned .vif files on
disk after EC volume destruction (e.g., after TTL expiry).

* Fix is_expired to use <= for SuperBlockSize check matching Go

Go checks contentSize <= SuperBlockSize to detect empty volumes (no
needles). Rust used < which would incorrectly allow a volume with
exactly SuperBlockSize bytes (header only, no data) to proceed to
the TTL expiry check and potentially be marked as expired.

* Fix read_append_at_ns to read timestamps from tombstone entries

Go reads the full needle body for all entries including tombstones
(deleted needles with size=0) to extract the actual AppendAtNs
timestamp. The Rust version returned 0 early for size <= 0 entries,
which would cause the binary search in incremental copy to produce
incorrect results for positions containing deleted needles.

Now uses get_actual_size to compute the on-disk size (which handles
tombstones correctly) and only returns 0 when the actual size is 0.

* Add X-Request-Id response header matching Go's requestIDMiddleware

Go sets both X-Request-Id and x-amz-request-id response headers.
The Rust server only set x-amz-request-id, missing X-Request-Id.

* Add skip_serializing_if for UploadResult name and size fields

Go's UploadResult uses json:"name,omitempty" and json:"size,omitempty",
omitting these fields from JSON when they are zero values (empty
string / 0). The Rust struct always serialized them, producing
"name":"" and "size":0 where Go would omit them.

* Support JSONP/pretty-print for write success responses

Go's writeJsonQuiet checks for callback (JSONP) and pretty query
parameters on all JSON responses including write success. The Rust
write success path used axum::Json directly, bypassing JSONP and
pretty-print support. Now uses json_result_with_query to match Go.

* Include actual limit in file size limit error message

Go returns "file over the limited %d bytes" with the actual limit
value included. Rust returned a generic "file size limit exceeded"
without the limit value, making it harder to debug.

* Extract extension from 2-segment URL paths for image operations

Go's parseURLPath extracts the file extension from all URL formats
including 2-segment paths like /vid,fid.jpg. The Rust version only
handled 3-segment paths (/vid/fid/filename.ext), so extensions in
2-segment paths were lost. This caused image resize/crop operations
requested via query params to be silently skipped for those paths.

* Add size_hint to TrackedBody so throttled downloads get Content-Length

TrackedBody (used for download throttling) did not implement
size_hint(), causing HTTP/1.1 to fall back to chunked transfer
encoding instead of setting Content-Length. Go always sets
Content-Length explicitly for non-range responses.

* Add Last-Modified, pairs, and S3 headers to chunk manifest responses

Go sets Last-Modified, needle pairs, and S3 pass-through headers on
the response writer BEFORE calling tryHandleChunkedFile. Since the
Rust chunk manifest handler created fresh response headers and
returned early, these headers were missing from chunk manifest
responses. Now passes last_modified_str into the chunk manifest
handler and applies pairs and S3 pass-through query params
(response-cache-control, response-content-encoding, etc.) to the
chunk manifest response headers.

* Fix multipart fallback to use first part data when no filename

Go reads the first part's data unconditionally, then looks for a
part with a filename. If none found, Go uses the first part's data
(with empty filename). Rust only captured parts with filenames, so
when no part had a filename it fell back to the raw multipart body
bytes (including boundary delimiters), producing corrupt needle data.

* Set HasName and HasMime flags for empty values matching Go

Go's CreateNeedleFromRequest sets HasName and HasMime flags even when
the filename or MIME type is empty (len < 256 is true for len 0).
Rust skipped empty values, causing the on-disk needle format to
differ: Go-written needles include extra bytes for the empty name/mime
size fields, changing the serialized needle size in the idx entry.
This ensures binary format compatibility between Go and Rust servers.

* Add is_stopping guard to vacuum_volume_commit matching Go

Go's CommitCompactVolume (store_vacuum.go L53-54) checks
s.isStopping before committing compaction to prevent file
swaps during shutdown. The Rust handler was missing this
check, which could allow compaction commits while the
server is stopping.

* Remove disk_type from required status fields since Go omits it

Go's default DiskType is "" (HardDriveType), and protobuf's omitempty
tag causes empty strings to be dropped from JSON output.

* test: honor rust env in dual volume harness

* grpc: notify master after volume lifecycle changes

* http: proxy to replicas before download-limit timeout

* test: pass readMode to rust volume harnesses

* fix store free-location predicate selection

* fix volume copy disk placement and heartbeat notification

* fix chunk manifest delete replication

* fix write replication to survive client disconnects

* fix download limit proxy and wait flow

* fix crop gating for streamed reads

* fix upload limit wait counter behavior

* fix chunk manifest image transforms

* fix has_resize_ops to check width/height > 0 instead of is_some()

Go's shouldResizeImages condition is `width > 0 || height > 0`, so
`?width=0` correctly evaluates to false. Rust was using `is_some()`
which made `?width=0` evaluate to true, unnecessarily disabling
streaming reads for those requests.

* fix Content-MD5 to only compute and return when provided by client

Go only computes the MD5 of uncompressed data when a Content-MD5
header or multipart field is provided. Rust was always computing and
returning it. Also fix the mismatch error message to include size,
matching Go's format.

* fix save_vif to compute ExpireAtSec from TTL

Go's SaveVolumeInfo always computes ExpireAtSec = now + ttlSeconds
when the volume has a TTL. The save_vif path (used by set_read_only
and set_writable) was missing this computation, causing .vif files
to be written without the correct expiration timestamp for TTL volumes.

* fix set_writable to not modify no_write_can_delete

Go's MarkVolumeWritable only sets noWriteOrDelete=false and persists.
Rust was additionally setting no_write_can_delete=has_remote_file,
which could incorrectly change the write mode for remote-file volumes
when the master explicitly asks to make the volume writable.

* fix write_needle_blob_and_index to error on too-small V3 blob

Go returns an error when the needle blob is too small for timestamp
patching. Rust was silently skipping the patch and writing the blob
with a stale/zero timestamp, which could cause data integrity issues
during incremental replication that relies on AppendAtNs ordering.

* fix VolumeEcShardsToVolume to validate dataShards range

Go validates that dataShards is > 0 and <= MaxShardCount before
proceeding with EC-to-volume reconstruction. Without this check,
a zero or excessively large data_shards value could cause confusing
downstream failures.

* fix destroy to use VolumeError::NotEmpty instead of generic Io error

The dedicated NotEmpty variant exists in the enum but was not being
used. This makes error matching consistent with Go's ErrVolumeNotEmpty.

* fix SetState to persist state to disk with rollback on failure

Go's State.Update saves VolumeServerState to a state.pb file after
each SetState call, and rolls back the in-memory state if persistence
fails. Rust was only updating in-memory atomics, so maintenance mode
would be lost on server restart. Now saves protobuf-encoded state.pb
and loads it on startup.

* fix VolumeTierMoveDatToRemote to close local dat backend after upload

Go calls v.LoadRemoteFile() after saving volume info, which closes
the local DataBackend before transitioning to remote storage. Without
this, the volume holds a stale file handle to the deleted local .dat
file, causing reads to fail until server restart.

* fix VolumeTierMoveDatFromRemote to close remote dat backend after download

Go calls v.DataBackend.Close() and sets DataBackend=nil after removing
the remote file reference. Without this, the stale remote backend
state lingers and reads may not discover the newly downloaded local
.dat file until server restart.

* fix redirect to use internal url instead of public_url

Go's proxyReqToTargetServer builds the redirect Location header from
loc.Url (the internal URL), not publicUrl. Using public_url could
cause redirect failures when internal and external URLs differ.

* fix redirect test and add state_file_path to integration test

Update redirect unit test to expect internal url (matching the
previous fix). Add missing state_file_path field to the integration
test VolumeServerState constructor.

* fix FetchAndWriteNeedle to await all writes before checking errors

Go uses a WaitGroup to await all writes (local + replicas) before
checking errors. Rust was short-circuiting on local write failure,
which could leave replica writes in-flight without waiting for
completion.

* fix shutdown to send deregister heartbeat before pre_stop delay

Go's StopHeartbeat() closes stopChan immediately on interrupt, causing
the heartbeat goroutine to send the deregister heartbeat right away,
before the preStopSeconds delay. Rust was only setting is_stopping=true
without waking the heartbeat loop, so the deregister was delayed until
after the pre_stop sleep. Now we call volume_state_notify.notify_one()
to wake the heartbeat immediately.

* fix heartbeat response ordering to check duplicate UUIDs first

Go processes heartbeat responses in this order: DuplicatedUuids first,
then volume options (prealloc/size limit), then leader redirect. Rust
was applying volume options before checking for duplicate UUIDs, which
meant volume option changes would take effect even when the response
contained a duplicate UUID error that should cause an immediate return.

* the test thread was blocked

* fix(deps): update aws-lc-sys 0.38.0 → 0.39.0 to resolve security advisories

Bumps aws-lc-rs 1.16.1 → 1.16.2, pulling in aws-lc-sys 0.39.0 which
fixes GHSA-394x-vwmw-crm3 (X.509 Name Constraints wildcard/unicode
bypass) and GHSA-9f94-5g5w-gf6r (CRL Distribution Point scope check
logic error).

* fix: match Go Content-MD5 mismatch error message format

Go uses "Content-MD5 did not match md5 of file data expected [X]
received [Y] size Z" while Rust had a shorter format. Match the
exact Go error string so clients see identical messages.

* fix: match Go Bearer token length check (> 7, not >= 7)

Go requires len(bearer) > 7 ensuring at least one char after
"Bearer ". Rust used >= 7 which would accept an empty token.

* fix(deps): drop legacy rustls 0.21 to resolve rustls-webpki GHSA-pwjx-qhcg-rvj4

aws-sdk-s3's default "rustls" feature enables tls-rustls in
aws-smithy-runtime, which pulls in legacy-rustls-ring (rustls 0.21
→ rustls-webpki 0.101.7, moderate CRL advisory). Replace with
explicit default-https-client which uses only rustls 0.23 /
rustls-webpki 0.103.9.

* fix: use uploaded filename for auto-compression extension detection

Go extracts the file extension from pu.FileName (the uploaded
filename) for auto-compression decisions. Rust was using the URL
path, which typically has no extension for SeaweedFS file IDs.

* fix: add CRC legacy Value() backward-compat check on needle read

Go double-checks CRC: n.Checksum != crc && uint32(n.Checksum) !=
crc.Value(). The Value() path is a deprecated transform for compat
with seaweed versions prior to commit 056c480eb. Rust had the
legacy_value() method but wasn't using it in validation.

* fix: remove /stats/* endpoints to match Go (commented out since L130)

Go's volume_server.go has the /stats/counter, /stats/memory, and
/stats/disk endpoints commented out (lines 130-134). Remove them
from the Rust router along with the now-unused whitelist_guard
middleware.

* fix: filter application/octet-stream MIME for chunk manifests

Go's tryHandleChunkedFile (L334) filters out application/octet-stream
from chunk manifest MIME types, falling back to extension-based
detection. Rust was returning the stored MIME as-is for manifests.

* fix: VolumeMarkWritable returns error before notifying master

Go returns early at L200 if MarkVolumeWritable fails, before
reaching the master notification at L206. Rust was notifying master
even on failure, creating inconsistent state where master thinks
the volume is writable but local marking failed.

* fix: check volume existence before maintenance in MarkReadonly/Writable

Go's VolumeMarkReadonly (L239-241) and VolumeMarkWritable (L253-255)
look up the volume first, then call makeVolumeReadonly/Writable which
checks maintenance. Rust was checking maintenance first, returning
"maintenance mode" instead of "not found" for missing volumes.

* feat: implement ScrubVolume mark_broken_volumes_readonly (PR #8360)

Add the mark_broken_volumes_readonly flag from PR #8360:
- Sync proto field (tag 3) to local volume_server.proto
- After scrubbing, if flag is set, call makeVolumeReadonly on each
  broken volume (notify master, mark local readonly, notify again)
- Collect errors via joined error semantics matching Go's errors.Join
- Factor out make_volume_readonly helper reused by both
  VolumeMarkReadonly and ScrubVolume

Also refactors VolumeMarkReadonly to use the shared helper.

* fix(deps): update rustls-webpki 0.103.9 → 0.103.10 (GHSA-pwjx-qhcg-rvj4)

CRL Distribution Point matching logic fix for moderate severity
advisory about CRLs not considered authoritative.

* test: update integration tests for removed /stats/* endpoints

Replace tests that expected /stats/* routes to return 200/401 with
tests confirming they now fall through to the store handler (400),
matching Go's commented-out stats endpoints.

* docs: fix misleading comment about default offset feature

The comment said "4-byte offsets unless explicitly built with 5-byte
support" but the default feature enables 5bytes. This is intentional
for production parity with Go -tags 5BytesOffset builds. Fix the
comment to match reality.
2026-03-26 17:24:35 -07:00
Lisandro Pin
e5cf2d2a19 Give the ScrubVolume() RPC an option to flag found broken volumes as read-only. (#8360)
* Give the `ScrubVolume()` RPC an option to flag found broken volumes as read-only.

Also exposes this option in the shell `volume.scrub` command.

* Remove redundant test in `TestVolumeMarkReadonlyWritableErrorPaths`.

417051bb slightly rearranges the logic for `VolumeMarkReadonly()` and `VolumeMarkWritable()`,
so calling them for invalid volume IDs will actually yield that error, instead of checking
maintnenance mode first.
2026-03-26 10:20:57 -07:00
Chris Lu
94bfa2b340 mount: stream all filer mutations over single ordered gRPC stream (#8770)
* filer: add StreamMutateEntry bidi streaming RPC

Add a bidirectional streaming RPC that carries all filer mutation
types (create, update, delete, rename) over a single ordered stream.
This eliminates per-request connection overhead for pipelined
operations and guarantees mutation ordering within a stream.

The server handler delegates each request to the existing unary
handlers (CreateEntry, UpdateEntry, DeleteEntry) and uses a proxy
stream adapter for rename operations to reuse StreamRenameEntry logic.

The is_last field signals completion for multi-response operations
(rename sends multiple events per request; create/update/delete
always send exactly one response with is_last=true).

* mount: add streaming mutation multiplexer (streamMutateMux)

Implement a client-side multiplexer that routes all filer mutation RPCs
(create, update, delete, rename) over a single bidirectional gRPC
stream. Multiple goroutines submit requests through a send channel;
a dedicated sendLoop serializes them on the stream; a recvLoop
dispatches responses to waiting callers via per-request channels.

Key features:
- Lazy stream opening on first use
- Automatic reconnection on stream failure
- Permanent fallback to unary RPCs if filer returns Unimplemented
- Monotonic request_id for response correlation
- Multi-response support for rename operations (is_last signaling)

The mux is initialized on WFS and closed during unmount cleanup.
No call sites use it yet — wiring comes in subsequent commits.

* mount: route CreateEntry and UpdateEntry through streaming mux

Wire all CreateEntry call sites to use wfs.streamCreateEntry() which
routes through the StreamMutateEntry stream when available, falling
back to unary RPCs otherwise. Also wire Link's UpdateEntry calls
through wfs.streamUpdateEntry().

Updated call sites:
- flushMetadataToFiler (file flush after write)
- Mkdir (directory creation)
- Symlink (symbolic link creation)
- createRegularFile non-deferred path (Mknod)
- flushFileMetadata (periodic metadata flush)
- Link (hard link: update source + create link + rollback)

* mount: route UpdateEntry and DeleteEntry through streaming mux

Wire remaining mutation call sites through the streaming mux:

- saveEntry (Setattr/chmod/chown/utimes) → streamUpdateEntry
- Unlink → streamDeleteEntry (replaces RemoveWithResponse)
- Rmdir → streamDeleteEntry (replaces RemoveWithResponse)

All filer mutations except Rename now go through StreamMutateEntry
when the filer supports it, with automatic unary RPC fallback.

* mount: route Rename through streaming mux

Wire Rename to use streamMutate.Rename() when available, with
fallback to the existing StreamRenameEntry unary stream.

The streaming mux sends rename as a StreamRenameEntryRequest oneof
variant. The server processes it through the existing rename logic
and sends multiple StreamRenameEntryResponse events (one per moved
entry), with is_last=true on the final response.

All filer mutations now go through a single ordered stream.

* mount: fix stream mux connection ownership

WithGrpcClient(streamingMode=true) closes the gRPC connection when
the callback returns, destroying the stream. Own the connection
directly via pb.GrpcDial so it stays alive for the stream's lifetime.
Close it explicitly in recvLoop on stream failure and in Close on
shutdown.

* mount: fix rename failure for deferred-create files

Three fixes for rename operations over the streaming mux:

1. lookupEntry: fall back to local metadata store when filer returns
   "not found" for entries in uncached directories. Files created with
   deferFilerCreate=true exist only in the local leveldb store until
   flushed; lookupEntry skipped the local store when the parent
   directory had never been readdir'd, causing rename to fail with
   ENOENT.

2. Rename: wait for pending async flushes and force synchronous flush
   of dirty metadata before sending rename to the filer. Covers the
   writebackCache case where close() defers the flush to a background
   worker that may not complete before rename fires.

3. StreamMutateEntry: propagate rename errors from server to client.
   Add error/errno fields to StreamMutateEntryResponse so the mount
   can map filer errors to correct FUSE status codes instead of
   silently returning OK. Also fix the existing Rename error handler
   which could return fuse.OK on unrecognized errors.

* mount: fix streaming mux error handling, sendLoop lifecycle, and fallback

Address PR review comments:

1. Server: populate top-level Error/Errno on StreamMutateEntryResponse for
   create/update/delete errors, not just rename. Previously update errors
   were silently dropped and create/delete errors were only in nested
   response fields that the client didn't check.

2. Client: check nested error fields in CreateEntry (ErrorCode, Error)
   and DeleteEntry (Error) responses, matching CreateEntryWithResponse
   behavior.

3. Fix sendLoop lifecycle: give each stream generation a stopSend channel.
   recvLoop closes it on error to stop the paired sendLoop. Previously a
   reconnect left the old sendLoop draining sendCh, breaking ordering.

4. Transparent fallback: stream helpers and doRename fall back to unary
   RPCs on transport errors (ErrStreamTransport), including the first
   Unimplemented from ensureStream. Previously the first call failed
   instead of degrading.

5. Filer rotation in openStream: try all filer addresses on dial failure,
   matching WithFilerClient behavior. Stop early on Unimplemented.

6. Pass metadata-bearing context to StreamMutateEntry RPC call so
   sw-client-id header is actually sent.

7. Gate lookupEntry local-cache fallback on open dirty handle or pending
   async flush to avoid resurrecting deleted/renamed entries.

8. Remove dead code in flushFileMetadata (err=nil followed by if err!=nil).

9. Use string matching for rename error-to-errno mapping in the mount to
   stay portable across Linux/macOS (numeric errno values differ).

* mount: make failAllPending idempotent with delete-before-close

Change failAllPending to collect pending entries into a local slice
(deleting from the sync.Map first) before closing channels. This
prevents double-close panics if called concurrently. Also remove the
unused err parameter.

* mount: add stream generation tracking and teardownStream

Introduce a generation counter on streamMutateMux that increments each
time a new stream is created. Requests carry the generation they were
enqueued for so sendLoop can reject stale requests after reconnect.

Add teardownStream(gen) which is idempotent (only acts when gen matches
current generation and stream is non-nil). Both sendLoop and recvLoop
call it on error, replacing the inline cleanup in recvLoop. sendLoop now
actively triggers teardown on send errors instead of silently exiting.

ensureStream waits for the prior generation's recvDone before creating a
new stream, ensuring all old pending waiters are failed before reconnect.
recvLoop now takes the stream, generation, and recvDone channel as
parameters to avoid accessing shared fields without the lock.

* mount: harden Close to prevent races with teardownStream

Nil out stream, cancel, and grpcConn under the lock so that any
concurrent teardownStream call from recvLoop/sendLoop becomes a no-op.
Call failAllPending before closing sendCh to unblock waiters promptly.
Guard recvDone with a nil check for the case where Close is called
before any stream was ever opened.

* mount: make errCh receive ctx-aware in doUnary and Rename

Replace the blocking <-sendReq.errCh with a select that also observes
ctx.Done(). If sendLoop exits via stopSend without consuming a buffered
request, the caller now returns ctx.Err() instead of blocking forever.
The buffered errCh (capacity 1) ensures late acknowledgements from
sendLoop don't block the sender.

* mount: fix sendLoop/Close race and recvLoop/teardown pending channel race

Three related fixes:

1. Stop closing sendCh in Close(). Closing the shared producer channel
   races with callers who passed ensureStream() but haven't sent yet,
   causing send-on-closed-channel panics. sendCh is now left open;
   ensureStream checks m.closed to reject new callers.

2. Drain buffered sendCh items on shutdown. sendLoop defers drainSendCh()
   on exit so buffered requests get an ErrStreamTransport on their errCh
   instead of blocking forever. Close() drains again for any stragglers
   enqueued between sendLoop's drain and the final shutdown.

3. Move failAllPending from teardownStream into recvLoop's defer.
   teardownStream (called from sendLoop on send error) was closing
   pending response channels while recvLoop could be between
   pending.Load and the channel send — a send-on-closed-channel panic.
   recvLoop is now the sole closer of pending channels, eliminating
   the race. Close() waits on recvDone (with cancel() to guarantee
   Recv unblocks) so pending cleanup always completes.

* filer/mount: add debug logging for hardlink lifecycle

Add V(0) logging at every point where a HardLinkId is created, stored,
read, or deleted to trace orphaned hardlink references. Logging covers:

- gRPC server: CreateEntry/UpdateEntry when request carries HardLinkId
- FilerStoreWrapper: InsertEntry/UpdateEntry when entry has HardLinkId
- handleUpdateToHardLinks: entry path, HardLinkId, counter, chunk count
- setHardLink: KvPut with blob size
- maybeReadHardLink: V(1) on read attempt and successful decode
- DeleteHardLink: counter decrement/deletion events
- Mount Link(): when NewHardLinkId is generated and link is created

This helps diagnose how a git pack .rev file ended up with a
HardLinkId during a clone (no hard links should be involved).

* test: add git clone/pull integration test for FUSE mount

Shell script that exercises git operations on a SeaweedFS mount:
1. Creates a bare repo on the mount
2. Clones locally, makes 3 commits, pushes to mount
3. Clones from mount bare repo into an on-mount working dir
4. Verifies clone integrity (files, content, commit hashes)
5. Pushes 2 more commits with renames and deletes
6. Checks out an older revision on the mount clone
7. Returns to branch and pulls with real changes
8. Verifies file content, renames, deletes after pull
9. Checks git log integrity and clean status

27 assertions covering file existence, content, commit hashes,
file counts, renames, deletes, and git status. Run against any
existing mount: bash test-git-on-mount.sh /path/to/mount

* test: add git clone/pull FUSE integration test to CI suite

Add TestGitOperations to the existing fuse_integration test framework.
The test exercises git's full file operation surface on the mount:

  1. Creates a bare repo on the mount (acts as remote)
  2. Clones locally, makes 3 commits (files, bulk data, renames), pushes
  3. Clones from mount bare repo into an on-mount working dir
  4. Verifies clone integrity (content, commit hash, file count)
  5. Pushes 2 more commits with new files, renames, and deletes
  6. Checks out an older revision on the mount clone
  7. Returns to branch and pulls with real fast-forward changes
  8. Verifies post-pull state: content, renames, deletes, file counts
  9. Checks git log integrity (5 commits) and clean status

Runs automatically in the existing fuse-integration.yml CI workflow.

* mount: fix permission check with uid/gid mapping

The permission checks in createRegularFile() and Access() compared
the caller's local uid/gid against the entry's filer-side uid/gid
without applying the uid/gid mapper. With -map.uid 501:0, a directory
created as uid 0 on the filer would not match the local caller uid
501, causing hasAccess() to fall through to "other" permission bits
and reject write access (0755 → other has r-x, no w).

Fix: map entry uid/gid from filer-space to local-space before the
hasAccess() call so both sides are in the same namespace.

This fixes rsync -a failing with "Permission denied" on mkstempat
when using uid/gid mapping.

* mount: fix Mkdir/Symlink returning filer-side uid/gid to kernel

Mkdir and Symlink used `defer wfs.mapPbIdFromFilerToLocal(entry)` to
restore local uid/gid, but `outputPbEntry` writes the kernel response
before the function returns — so the kernel received filer-side
uid/gid (e.g., 0:0). macFUSE then caches these and rejects subsequent
child operations (mkdir, create) because the caller uid (501) doesn't
match the directory owner (0), and "other" bits (0755 → r-x) lack
write permission.

Fix: replace the defer with an explicit call to mapPbIdFromFilerToLocal
before outputPbEntry, so the kernel gets local uid/gid.

Also add nil guards for UidGidMapper in Access and createRegularFile
to prevent panics in tests that don't configure a mapper.

This fixes rsync -a "Permission denied" on mkpathat for nested
directories when using uid/gid mapping.

* mount: fix Link outputting filer-side uid/gid to kernel, add nil guards

Link had the same defer-before-outputPbEntry bug as Mkdir and Symlink:
the kernel received filer-side uid/gid because the defer hadn't run
yet when outputPbEntry wrote the response.

Also add nil guards for UidGidMapper in Access and createRegularFile
so tests without a mapper don't panic.

Audit of all outputPbEntry/outputFilerEntry call sites:
- Mkdir: fixed in prior commit (explicit map before output)
- Symlink: fixed in prior commit (explicit map before output)
- Link: fixed here (explicit map before output)
- Create (existing file): entry from maybeLoadEntry (already mapped)
- Create (deferred): entry has local uid/gid (never mapped to filer)
- Create (non-deferred): createRegularFile defer runs before return
- Mknod: createRegularFile defer runs before return
- Lookup: entry from lookupEntry (already mapped)
- GetAttr: entry from maybeReadEntry/maybeLoadEntry (already mapped)
- readdir: entry from cache (mapIdFromFilerToLocal) or filer (mapped)
- saveEntry: no kernel output
- flushMetadataToFiler: no kernel output
- flushFileMetadata: no kernel output

* test: fix git test for same-filesystem FUSE clone

When both the bare repo and working clone live on the same FUSE mount,
git's local transport uses hardlinks and cross-repo stat calls that
fail on FUSE. Fix:

- Use --no-local on clone to disable local transport optimizations
- Use reset --hard instead of checkout to stay on branch
- Use fetch + reset --hard origin/<branch> instead of git pull to
  avoid local transport stat failures during fetch

* adjust logging

* test: use plain git clone/pull to exercise real FUSE behavior

Remove --no-local and fetch+reset workarounds. The test should use
the same git commands users run (clone, reset --hard, pull) so it
reveals real FUSE issues rather than hiding them.

* test: enable V(1) logging for filer/mount and collect logs on failure

- Run filer and mount with -v=1 so hardlink lifecycle logs (V(0):
  create/delete/insert, V(1): read attempts) are captured
- On test failure, automatically dump last 16KB of all process logs
  (master, volume, filer, mount) to test output
- Copy process logs to /tmp/seaweedfs-fuse-logs/ for CI artifact upload
- Update CI workflow to upload SeaweedFS process logs alongside test output

* mount: clone entry for filer flush to prevent uid/gid race

flushMetadataToFiler and flushFileMetadata used entry.GetEntry() which
returns the file handle's live proto entry pointer, then mutated it
in-place via mapPbIdFromLocalToFiler. During the gRPC call window, a
concurrent Lookup (which takes entryLock.RLock but NOT fhLockTable)
could observe filer-side uid/gid (e.g., 0:0) on the file handle entry
and return it to the kernel. The kernel caches these attributes, so
subsequent opens by the local user (uid 501) fail with EACCES.

Fix: proto.Clone the entry before mapping uid/gid for the filer request.
The file handle's live entry is never mutated, so concurrent Lookup
always sees local uid/gid.

This fixes the intermittent "Permission denied" on .git/FETCH_HEAD
after the first git pull on a mount with uid/gid mapping.

* mount: add debug logging for stale lock file investigation

Add V(0) logging to trace the HEAD.lock recreation issue:
- Create: log when O_EXCL fails (file already exists) with uid/gid/mode
- completeAsyncFlush: log resolved path, saved path, dirtyMetadata,
  isDeleted at entry to trace whether async flush fires after rename
- flushMetadataToFiler: log the dir/name/fullpath being flushed

This will show whether the async flush is recreating the lock file
after git renames HEAD.lock → HEAD.

* mount: prevent async flush from recreating renamed .lock files

When git renames HEAD.lock → HEAD, the async flush from the prior
close() can run AFTER the rename and re-insert HEAD.lock into the
meta cache via its CreateEntryRequest response event. The next git
pull then sees HEAD.lock and fails with "File exists".

Fix: add isRenamed flag on FileHandle, set by Rename before waiting
for the pending async flush. The async flush checks this flag and
skips the metadata flush for renamed files (same pattern as isDeleted
for unlinked files). The data pages still flush normally.

The Rename handler flushes deferred metadata synchronously (Case 1)
before setting isRenamed, ensuring the entry exists on the filer for
the rename to proceed. For already-released handles (Case 2), the
entry was created by a prior flush.

* mount: also mark renamed inodes via entry.Attributes.Inode fallback

When GetInode fails (Forget already removed the inode mapping), the
Rename handler couldn't find the pending async flush to set isRenamed.
The async flush then recreated the .lock file on the filer.

Fix: fall back to oldEntry.Attributes.Inode to find the pending async
flush when the inode-to-path mapping is gone. Also extract
MarkInodeRenamed into a method on FileHandleToInode for clarity.

* mount: skip async metadata flush when saved path no longer maps to inode

The isRenamed flag approach failed for refs/remotes/origin/HEAD.lock
because neither GetInode nor oldEntry.Attributes.Inode could find the
inode (Forget already evicted the mapping, and the entry's stored
inode was 0).

Add a direct check in completeAsyncFlush: before flushing metadata,
verify that the saved path still maps to this inode in the inode-to-path
table. If the path was renamed or removed (inode mismatch or not found),
skip the metadata flush to avoid recreating a stale entry.

This catches all rename cases regardless of whether the Rename handler
could set the isRenamed flag.

* mount: wait for pending async flush in Unlink before filer delete

Unlink was deleting the filer entry first, then marking the draining
async-flush handle as deleted. The async flush worker could race between
these two operations and recreate the just-unlinked entry on the filer.
This caused git's .lock files (e.g. refs/remotes/origin/HEAD.lock) to
persist after git pull, breaking subsequent git operations.

Move the isDeleted marking and add waitForPendingAsyncFlush() before
the filer delete so any in-flight flush completes first. Even if the
worker raced past the isDeleted check, the wait ensures it finishes
before the filer delete cleans up any recreated entry.

* mount: reduce async flush and metadata flush log verbosity

Raise completeAsyncFlush entry log, saved-path-mismatch skip log, and
flushMetadataToFiler entry log from V(0) to V(3)/V(4). These fire for
every file close with writebackCache and are too noisy for normal use.

* filer: reduce hardlink debug log verbosity from V(0) to V(4)

HardLinkId logs in filerstore_wrapper, filerstore_hardlink, and
filer_grpc_server fire on every hardlinked file operation (git pack
files use hardlinks extensively) and produce excessive noise.

* mount/filer: reduce noisy V(0) logs for link, rmdir, and empty folder check

- weedfs_link.go: hardlink creation logs V(0) → V(4)
- weedfs_dir_mkrm.go: non-empty folder rmdir error V(0) → V(1)
- empty_folder_cleaner.go: "not empty" check log V(0) → V(4)

* filer: handle missing hardlink KV as expected, not error

A "kv: not found" on hardlink read is normal when the link blob was
already cleaned up but a stale entry still references it. Log at V(1)
for not-found; keep Error level for actual KV failures.

* test: add waitForDir before git pull in FUSE git operations test

After git reset --hard, the FUSE mount's metadata cache may need a
moment to settle on slow CI. The git pull subprocess (unpack-objects)
could fail to stat the working directory. Poll for up to 5s.

* Update git_operations_test.go

* wait

* test: simplify FUSE test framework to use weed mini

Replace the 4-process setup (master + volume + filer + mount) with
2 processes: "weed mini" (all-in-one) + "weed mount". This simplifies
startup, reduces port allocation, and is faster on CI.

* test: fix mini flag -admin → -admin.ui
2026-03-25 20:06:34 -07:00
Chris Lu
0b3867dca3 filer: add structured error codes to CreateEntryResponse (#8767)
* filer: add FilerError enum and error_code field to CreateEntryResponse

Add a machine-readable error code alongside the existing string error
field. This follows the precedent set by PublishMessageResponse in the
MQ broker proto. The string field is kept for human readability and
backward compatibility.

Defined codes: OK, ENTRY_NAME_TOO_LONG, PARENT_IS_FILE,
EXISTING_IS_DIRECTORY, EXISTING_IS_FILE, ENTRY_ALREADY_EXISTS.

* filer: add sentinel errors and error code mapping in filer_pb

Define sentinel errors (ErrEntryNameTooLong, ErrParentIsFile, etc.) in
the filer_pb package so both the filer and consumers can reference them
without circular imports.

Add FilerErrorToSentinel() to map proto error codes to sentinels, and
update CreateEntryWithResponse() to check error_code first, falling back
to the string-based path for backward compatibility with old servers.

* filer: return wrapped sentinel errors and set proto error codes

Replace fmt.Errorf string errors in filer.CreateEntry, UpdateEntry, and
ensureParentDirectoryEntry with wrapped filer_pb sentinel errors (using
%w). This preserves errors.Is() traversal on the server side.

In the gRPC CreateEntry handler, map sentinel errors to the
corresponding FilerError proto codes using errors.Is(), setting both
resp.Error (string, for backward compat) and resp.ErrorCode (enum).

* S3: use errors.Is() with filer sentinels instead of string matching

Replace fragile string-based error matching in filerErrorToS3Error and
other S3 API consumers with errors.Is() checks against filer_pb sentinel
errors. This works because the updated CreateEntryWithResponse helper
reconstructs sentinel errors from the proto FilerError code.

Update iceberg stage_create and metadata_files to check resp.ErrorCode
instead of parsing resp.Error strings. Update SSE-S3 to use errors.Is()
for the already-exists check.

String matching is retained only for non-filer errors (gRPC transport
errors, checksum validation) that don't go through CreateEntryResponse.

* filer: remove backward-compat string fallbacks for error codes

Clients and servers are always deployed together, so there is no need
for backward-compatibility fallback paths that parse resp.Error strings
when resp.ErrorCode is unset. Simplify all consumers to rely solely on
the structured error code.

* iceberg: ensure unknown non-OK error codes are not silently ignored

When FilerErrorToSentinel returns nil for an unrecognized error code,
return an error including the code and message rather than falling
through to return nil.

* filer: fix redundant error message and restore error wrapping in helper

Use request path instead of resp.Error in the sentinel error format
string to avoid duplicating the sentinel message (e.g. "entry already
exists: entry already exists"). Restore %w wrapping with errors.New()
in the fallback paths so callers can use errors.Is()/errors.As().

* filer: promote file to directory on path conflict instead of erroring

S3 allows both "foo/bar" (object) and "foo/bar/xyzzy" (another object)
to coexist because S3 has a flat key space. When ensureParentDirectoryEntry
finds a parent path that is a file instead of a directory, promote it to
a directory by setting ModeDir while preserving the original content and
chunks. Use Store.UpdateEntry directly to bypass the Filer.UpdateEntry
type-change guard.

This fixes the S3 compatibility test failures where creating overlapping
keys (e.g. "foo/bar" then "foo/bar/xyzzy") returned ExistingObjectIsFile.
2026-03-24 17:08:22 -07:00
Chris Lu
81369b8a83 improve: large file sync throughput for remote.cache and filer.sync (#8676)
* improve large file sync throughput for remote.cache and filer.sync

Three main throughput improvements:

1. Adaptive chunk sizing for remote.cache: targets ~32 chunks per file
   instead of always starting at 5MB. A 500MB file now uses ~16MB chunks
   (32 chunks) instead of 5MB chunks (100 chunks), reducing per-chunk
   overhead (volume assign, gRPC call, needle write) by 3x.

2. Configurable concurrency at every layer:
   - remote.cache chunk concurrency: -chunkConcurrency flag (default 8)
   - remote.cache S3 download concurrency: -downloadConcurrency flag
     (default raised from 1 to 5 per chunk)
   - filer.sync chunk concurrency: -chunkConcurrency flag (default 32)

3. S3 multipart download concurrency raised from 1 to 5: the S3 manager
   downloader was using Concurrency=1, serializing all part downloads
   within each chunk. This alone can 5x per-chunk download speed.

The concurrency values flow through the gRPC request chain:
  shell command → CacheRemoteObjectToLocalClusterRequest →
  FetchAndWriteNeedleRequest → S3 downloader

Zero values in the request mean "use server defaults", maintaining
full backward compatibility with existing callers.

Ref #8481

* fix: use full maxMB for chunk size cap and remove loop guard

Address review feedback:
- Use full maxMB instead of maxMB/2 for maxChunkSize to avoid
  unnecessarily limiting chunk size for very large files.
- Remove chunkSize < maxChunkSize guard from the safety loop so it
  can always grow past maxChunkSize when needed to stay under 1000
  chunks (e.g., extremely large files with small maxMB).

* address review feedback: help text, validation, naming, docs

- Fix help text for -chunkConcurrency and -downloadConcurrency flags
  to say "0 = server default" instead of advertising specific numeric
  defaults that could drift from the server implementation.
- Validate chunkConcurrency and downloadConcurrency are within int32
  range before narrowing, returning a user-facing error if out of range.
- Rename ReadRemoteErr to readRemoteErr to follow Go naming conventions.
- Add doc comment to SetChunkConcurrency noting it must be called
  during initialization before replication goroutines start.
- Replace doubling loop in chunk size safety check with direct
  ceil(remoteSize/1000) computation to guarantee the 1000-chunk cap.

* address Copilot review: clamp concurrency, fix chunk count, clarify proto docs

- Use ceiling division for chunk count check to avoid overcounting
  when file size is an exact multiple of chunk size.
- Clamp chunkConcurrency (max 1024) and downloadConcurrency (max 1024
  at filer, max 64 at volume server) to prevent excessive goroutines.
- Always use ReadFileWithConcurrency when the client supports it,
  falling back to the implementation's default when value is 0.
- Clarify proto comments that download_concurrency only applies when
  the remote storage client supports it (currently S3).
- Include specific server defaults in help text (e.g., "0 = server
  default 8") so users see the actual values in -h output.

* fix data race on executionErr and use %w for error wrapping

- Protect concurrent writes to executionErr in remote.cache worker
  goroutines with a sync.Mutex to eliminate the data race.
- Use %w instead of %v in volume_grpc_remote.go error formatting
  to preserve the error chain for errors.Is/errors.As callers.
2026-03-17 16:49:56 -07:00
Chris Lu
acea36a181 filer: add conditional update preconditions (#8647)
* filer: add conditional update preconditions

* iceberg: tighten metadata CAS preconditions
2026-03-16 12:33:32 -07:00
Chris Lu
a838661b83 feat(plugin): EC shard balance handler for plugin worker (#8629)
* feat(ec_balance): add TaskTypeECBalance constant and protobuf definitions

Add the ec_balance task type constant to both topology and worker type
systems. Define EcBalanceTaskParams, EcShardMoveSpec, and
EcBalanceTaskConfig protobuf messages for EC shard balance operations.

* feat(ec_balance): add configuration for EC shard balance task

Config includes imbalance threshold, min server count, collection
filter, disk type, and preferred tags for tag-aware placement.

* feat(ec_balance): add multi-phase EC shard balance detection algorithm

Implements four detection phases adapted from the ec.balance shell
command:
1. Duplicate shard detection and removal proposals
2. Cross-rack shard distribution balancing
3. Within-rack node-level shard balancing
4. Global shard count equalization across nodes

Detection is side-effect-free: it builds an EC topology view from
ActiveTopology and generates move proposals without executing them.

* feat(ec_balance): add EC shard move task execution

Implements the shard move sequence using the same VolumeEcShardsCopy,
VolumeEcShardsMount, VolumeEcShardsUnmount, and VolumeEcShardsDelete
RPCs as the shell ec.balance command. Supports both regular shard
moves and dedup-phase deletions (unmount+delete without copy).

* feat(ec_balance): add task registration and scheduling

Register EC balance task definition with auto-config update support.
Scheduling respects max concurrent limits and worker capabilities.

* feat(ec_balance): add plugin handler for EC shard balance

Implements the full plugin handler with detection, execution, admin
and worker config forms, proposal building, and decision trace
reporting. Supports collection/DC/disk type filtering, preferred tag
placement, and configurable detection intervals. Auto-registered via
init() with the handler registry.

* test(ec_balance): add tests for detection algorithm and plugin handler

Detection tests cover: duplicate shard detection, cross-rack imbalance,
within-rack imbalance, global rebalancing, topology building, collection
filtering, and edge cases. Handler tests cover: config derivation with
clamping, proposal building, protobuf encode/decode round-trip, fallback
parameter decoding, capability, and config policy round-trip.

* fix(ec_balance): address PR review feedback and fix CI test failure

- Update TestWorkerDefaultJobTypes to expect 6 handlers (was 5)
- Extract threshold constants (ecBalanceMinImbalanceThreshold, etc.)
  to eliminate magic numbers in Descriptor and config derivation
- Remove duplicate ShardIdsToUint32 helper (use erasure_coding package)
- Add bounds checks for int64→int/uint32 conversions to fix CodeQL
  integer conversion warnings

* fix(ec_balance): address code review findings

storage_impact.go:
- Add TaskTypeECBalance case returning shard-level reservation
  (ShardSlots: -1/+1) instead of falling through to default which
  incorrectly reserves a full volume slot on target.

detection.go:
- Use dc:rack composite key to avoid cross-DC rack name collisions.
  Only create rack entries after confirming node has matching disks.
- Add exceedsImbalanceThreshold check to cross-rack, within-rack,
  and global phases so trivial skews below the configured threshold
  are ignored. Dedup phase always runs since duplicates are errors.
- Reserve destination capacity after each planned move (decrement
  destNode.freeSlots, update rackShardCount/nodeShardCount) to
  prevent overbooking the same destination.
- Skip nodes with freeSlots <= 0 when selecting minNode in global
  balance to avoid proposing moves to full nodes.
- Include loop index and source/target node IDs in TaskID to
  guarantee uniqueness across moves with the same volumeID/shardID.

ec_balance_handler.go:
- Fail fast with error when shard_id is absent in fallback parameter
  decoding instead of silently defaulting to shard 0.

ec_balance_task.go:
- Delegate GetProgress() to BaseTask.GetProgress() so progress
  updates from ReportProgressWithStage are visible to callers.
- Add fail-fast guard rejecting multiple sources/targets until
  batch execution is implemented.

Findings verified but not changed (matches existing codebase pattern
in vacuum/balance/erasure_coding handlers):
- register.go globalTaskDef.Config race: same unsynchronized pattern
  in all 4 task packages.
- CreateTask using generated ID: same fmt.Sprintf pattern in all 4
  task packages.

* fix(ec_balance): harden parameter decoding, progress tracking, and validation

ec_balance_handler.go (decodeECBalanceTaskParams):
- Validate execution-critical fields (Sources[0].Node, ShardIds,
  Targets[0].Node, ShardIds) after protobuf deserialization.
- Require source_disk_id and target_disk_id in legacy fallback path
  so Targets[0].DiskId is populated for VolumeEcShardsCopyRequest.
- All error messages reference decodeECBalanceTaskParams and the
  specific missing field (TaskParams, shard_id, Targets[0].DiskId,
  EcBalanceTaskParams) for debuggability.

ec_balance_task.go:
- Track progress in ECBalanceTask.progress field, updated via
  reportProgress() helper called before ReportProgressWithStage(),
  so GetProgress() returns real stage progress instead of stale 0.
- Validate: require exactly 1 source and 1 target (mirrors Execute
  guard), require ShardIds on both, with error messages referencing
  ECBalanceTask.Validate and the specific field.

* fix(ec_balance): fix dedup execution path, stale topology, collection filter, timeout, and dedupeKey

detection.go:
- Dedup moves now set target=source so isDedupPhase() triggers the
  unmount+delete-only execution path instead of attempting a copy.
- Apply moves to in-memory topology between phases via
  applyMovesToTopology() so subsequent phases see updated shard
  placement and don't conflict with already-planned moves.
- detectGlobalImbalance now accepts allowedVids and filters both
  shard counting and shard selection to respect CollectionFilter.

ec_balance_task.go:
- Apply EcBalanceTaskParams.TimeoutSeconds to the context via
  context.WithTimeout so all RPC operations respect the configured
  timeout instead of hanging indefinitely.

ec_balance_handler.go:
- Include source node ID in dedupeKey so dedup deletions from
  different source nodes for the same shard aren't collapsed.
- Clamp minServerCountRaw and minIntervalRaw lower bounds on int64
  before narrowing to int, preventing undefined overflow on 32-bit.

* fix(ec_balance): log warning before cancelling on progress send failure

Log the error, job ID, job type, progress percentage, and stage
before calling execCancel() in the progress callback so failed
progress sends are diagnosable instead of silently cancelling.
2026-03-14 21:34:53 -07:00
Chris Lu
baae672b6f feat: auto-disable master vacuum when plugin worker is active (#8624)
* feat: auto-disable master vacuum when plugin vacuum worker is active

When a vacuum-capable plugin worker connects to the admin server, the
admin server calls DisableVacuum on the master to prevent the automatic
scheduled vacuum from conflicting with the plugin worker's vacuum. When
the worker disconnects, EnableVacuum is called to restore the default
behavior. A safety net in the topology refresh loop re-enables vacuum
if the admin server disconnects without cleanup.

* rename isAdminServerConnected to isAdminServerConnectedFunc

* add 5s timeout to DisableVacuum/EnableVacuum gRPC calls

Prevents the monitor goroutine from blocking indefinitely if the
master is unresponsive.

* track plugin ownership of vacuum disable to avoid overriding operator

- Add vacuumDisabledByPlugin flag to Topology, set when DisableVacuum
  is called while admin server is connected (i.e., by plugin monitor)
- Safety net only re-enables vacuum when it was disabled by plugin,
  not when an operator intentionally disabled it via shell command
- EnableVacuum clears the plugin flag

* extract syncVacuumState for testability, add fake toggler tests

Extract the single sync step into syncVacuumState() with a
vacuumToggler interface. Add TestSyncVacuumState with a fake
toggler that verifies disable/enable calls on state transitions.

* use atomic.Bool for isDisableVacuum and vacuumDisabledByPlugin

Both fields are written by gRPC handlers and read by the vacuum
goroutine, causing a data race. Use atomic.Bool with Store/Load
for thread-safe access.

* use explicit by_plugin field instead of connection heuristic

Add by_plugin bool to DisableVacuumRequest proto so the caller
declares intent explicitly. The admin server monitor sets it to
true; shell commands leave it false. This prevents an operator's
intentional disable from being auto-reversed by the safety net.

* use setter for admin server callback instead of function parameter

Move isAdminServerConnected from StartRefreshWritableVolumes
parameter to Topology.SetAdminServerConnectedFunc() setter.
Keeps the function signature stable and decouples the topology
layer from the admin server concept.

* suppress repeated log messages on persistent sync failures

Add retrying parameter to syncVacuumState so the initial
state transition is logged at V(0) but subsequent retries
of the same transition are silent until the call succeeds.

* clear plugin ownership flag on manual DisableVacuum

Prevents stale plugin flag from causing incorrect auto-enable
when an operator manually disables vacuum after a plugin had
previously disabled it.

* add by_plugin to EnableVacuumRequest for symmetric ownership tracking

Plugin-driven EnableVacuum now only re-enables if the plugin was
the one that disabled it. If an operator manually disabled vacuum
after the plugin, the plugin's EnableVacuum is a no-op. This
prevents the plugin monitor from overriding operator intent on
worker disconnect.

* use cancellable context for monitorVacuumWorker goroutine

Replace context.Background() with a cancellable context stored
as bgCancel on AdminServer. Shutdown() calls bgCancel() so
monitorVacuumWorker exits cleanly via ctx.Done().

* track operator and plugin vacuum disables independently

Replace single isDisableVacuum flag with two independent flags:
vacuumDisabledByOperator and vacuumDisabledByPlugin. Each caller
only flips its own flag. The effective disabled state is the OR
of both. This prevents a plugin connect/disconnect cycle from
overriding an operator's manual disable, and vice versa.

* fix safety net to clear plugin flag, not operator flag

The safety net should call EnableVacuumByPlugin() to clear only
the plugin disable flag when the admin server disconnects. The
previous call to EnableVacuum() incorrectly cleared the operator
flag instead.
2026-03-13 22:49:12 -07:00
Chris Lu
f3c5ba3cd6 feat(filer): add lazy directory listing for remote mounts (#8615)
* feat(filer): add lazy directory listing for remote mounts

Directory listings on remote mounts previously only queried the local
filer store. With lazy mounts the listing was empty; with eager mounts
it went stale over time.

Add on-demand directory listing that fetches from remote and caches
results with a 5-minute TTL:

- Add `ListDirectory` to `RemoteStorageClient` interface (delimiter-based,
  single-level listing, separate from recursive `Traverse`)
- Implement in S3, GCS, and Azure backends using each platform's
  hierarchical listing API
- Add `maybeLazyListFromRemote` to filer: before each directory listing,
  check if the directory is under a remote mount with an expired cache,
  fetch from remote, persist entries to the local store, then let existing
  listing logic run on the populated store
- Use singleflight to deduplicate concurrent requests for the same directory
- Skip local-only entries (no RemoteEntry) to avoid overwriting unsynced uploads
- Errors are logged and swallowed (availability over consistency)

* refactor: extract xattr key to constant xattrRemoteListingSyncedAt

* feat: make listing cache TTL configurable per mount via listing_cache_ttl_seconds

Add listing_cache_ttl_seconds field to RemoteStorageLocation protobuf.
When 0 (default), lazy directory listing is disabled for that mount.
When >0, enables on-demand directory listing with the specified TTL.

Expose as -listingCacheTTL flag on remote.mount command.

* refactor: address review feedback for lazy directory listing

- Add context.Context to ListDirectory interface and all implementations
- Capture startTime before remote call for accurate TTL tracking
- Simplify S3 ListDirectory using ListObjectsV2PagesWithContext
- Make maybeLazyListFromRemote return void (errors always swallowed)
- Remove redundant trailing-slash path manipulation in caller
- Update tests to match new signatures

* When an existing entry has Remote != nil, we should merge remote metadata   into it rather than replacing it.

* fix(gcs): wrap ListDirectory iterator error with context

The raw iterator error was returned without bucket/path context,
making it harder to debug. Wrap it consistently with the S3 pattern.

* fix(s3): guard against nil pointer dereference in Traverse and ListDirectory

Some S3-compatible backends may return nil for LastModified, Size, or
ETag fields. Check for nil before dereferencing to prevent panics.

* fix(filer): remove blanket 2-minute timeout from lazy listing context

Individual SDK operations (S3, GCS, Azure) already have per-request
timeouts and retry policies. The blanket timeout could cut off large
directory listings mid-operation even though individual pages were
succeeding.

* fix(filer): preserve trace context in lazy listing with WithoutCancel

Use context.WithoutCancel(ctx) instead of context.Background() so
trace/span values from the incoming request are retained for
distributed tracing, while still decoupling cancellation.

* fix(filer): use Store.FindEntry for internal lookups, add Uid/Gid to files, fix updateDirectoryListingSyncedAt

- Use f.Store.FindEntry instead of f.FindEntry for staleness check and
  child lookups to avoid unnecessary lazy-fetch overhead
- Set OS_UID/OS_GID on new file entries for consistency with directories
- In updateDirectoryListingSyncedAt, use Store.UpdateEntry for existing
  directories instead of CreateEntry to avoid deleteChunksIfNotNew and
  NotifyUpdateEvent side effects

* fix(filer): distinguish not-found from store errors in lazy listing

Previously, any error from Store.FindEntry was treated as "not found,"
which could cause entry recreation/overwrite on transient DB failures.
Now check for filer_pb.ErrNotFound explicitly and skip entries or
bail out on real store errors.

* refactor(filer): use errors.Is for ErrNotFound comparisons
2026-03-13 09:36:54 -07:00
Chris Lu
5f85bf5e8a Batch volume balance: run multiple moves per job (#8561)
* proto: add BalanceMoveSpec and batch fields to BalanceTaskParams

Add BalanceMoveSpec message for encoding individual volume moves,
and max_concurrent_moves + repeated moves fields to BalanceTaskParams
to support batching multiple volume moves in a single job.

* balance handler: add batch execution with concurrent volume moves

Refactor Execute() into executeSingleMove() (backward compatible) and
executeBatchMoves() which runs multiple volume moves concurrently using
a semaphore-bounded goroutine pool. When BalanceTaskParams.Moves is
populated, the batch path is taken; otherwise the single-move path.

Includes aggregate progress reporting across concurrent moves,
per-move error collection, and partial failure support.

* balance handler: add batch config fields to Descriptor and worker config

Add max_concurrent_moves and batch_size fields to the worker config
form and deriveBalanceWorkerConfig(). These control how many volume
moves run concurrently within a batch job and the maximum batch size.

* balance handler: group detection proposals into batch jobs

When batch_size > 1, the Detect method groups detection results into
batch proposals where each proposal encodes multiple BalanceMoveSpec
entries in BalanceTaskParams.Moves. Single-result batches fall back
to the existing single-move proposal format for backward compatibility.

* admin UI: add volume balance execution plan and batch badge

Add renderBalanceExecutionPlan() for rich rendering of volume balance
jobs in the job detail modal. Single-move jobs show source/target/volume
info; batch jobs show a moves table with all volume moves.

Add batch badge (e.g., "5 moves") next to job type in the execution
jobs table when the job has batch=true label.

* Update plugin_templ.go

* fix: detection algorithm uses greedy target instead of divergent topology scores

The detection loop tracked effective volume counts via an adjustments map,
but createBalanceTask independently called planBalanceDestination which used
the topology's LoadCount — a separate, unadjusted source of truth. This
divergence caused multiple moves to pile onto the same server.

Changes:
- Add resolveBalanceDestination to resolve the detection loop's greedy
  target (minServer) rather than independently picking a destination
- Add oscillation guard: stop when max-min <= 1 since no single move
  can improve the balance beyond that point
- Track unseeded destinations: if a target server wasn't in the initial
  serverVolumeCounts, add it so subsequent iterations include it
- Add TestDetection_UnseededDestinationDoesNotOverload

* fix: handler force_move propagation, partial failure, deterministic dedupe

- Propagate ForceMove from outer BalanceTaskParams to individual move
  TaskParams so batch moves respect the force_move flag
- Fix partial failure: mark job successful if at least one move
  succeeded (succeeded > 0 || failed == 0) to avoid re-running
  already-completed moves on retry
- Use SHA-256 hash for deterministic dedupe key fallback instead of
  time.Now().UnixNano() which is non-deterministic
- Remove unused successDetails variable
- Extract maxProposalStringLength constant to replace magic number 200

* admin UI: use template literals in balance execution plan rendering

* fix: integration test handles batch proposals from batched detection

With batch_size=20, all moves are grouped into a single proposal
containing BalanceParams.Moves instead of top-level Sources/Targets.
Update assertions to handle both batch and single-move proposal formats.

* fix: verify volume size on target before deleting source during balance

Add a pre-delete safety check that reads the volume file status on both
source and target, then compares .dat file size and file count. If they
don't match, the move is aborted — leaving the source intact rather than
risking irreversible data loss.

Also removes the redundant mountVolume call since VolumeCopy already
mounts the volume on the target server.

* fix: clamp maxConcurrent, serialize progress sends, validate config as int64

- Clamp maxConcurrentMoves to defaultMaxConcurrentMoves before creating
  the semaphore so a stale or malicious job cannot request unbounded
  concurrent volume moves
- Extend progressMu to cover sender.SendProgress calls since the
  underlying gRPC stream is not safe for concurrent writes
- Perform bounds checks on max_concurrent_moves and batch_size in int64
  space before casting to int, avoiding potential overflow on 32-bit

* fix: check disk capacity in resolveBalanceDestination

Skip disks where VolumeCount >= MaxVolumeCount so the detection loop
does not propose moves to a full disk that would fail at execution time.

* test: rename unseeded destination test to match actual behavior

The test exercises a server with 0 volumes that IS seeded from topology
(matching disk type), not an unseeded destination. Rename to
TestDetection_ZeroVolumeServerIncludedInBalance and fix comments.

* test: tighten integration test to assert exactly one batch proposal

With default batch_size=20, all moves should be grouped into a single
batch proposal. Assert len(proposals)==1 and require BalanceParams with
Moves, removing the legacy single-move else branch.

* fix: propagate ctx to RPCs and restore source writability on abort

- All helper methods (markVolumeReadonly, copyVolume, tailVolume,
  readVolumeFileStatus, deleteVolume) now accept a context parameter
  instead of using context.Background(), so Execute's ctx propagates
  cancellation and timeouts into every volume server RPC
- Add deferred cleanup that restores the source volume to writable if
  any step after markVolumeReadonly fails, preventing the source from
  being left permanently readonly on abort
- Add markVolumeWritable helper using VolumeMarkWritableRequest

* fix: deep-copy protobuf messages in test recording sender

Use proto.Clone in recordingExecutionSender to store immutable snapshots
of JobProgressUpdate and JobCompleted, preventing assertions from
observing mutations if the handler reuses message pointers.

* fix: add VolumeMarkWritable and ReadVolumeFileStatus to fake volume server

The balance task now calls ReadVolumeFileStatus for pre-delete
verification and VolumeMarkWritable to restore writability on abort.
Add both RPCs to the test fake, and drop the mountCalls assertion since
BalanceTask no longer calls VolumeMount directly (VolumeCopy handles it).

* fix: use maxConcurrentMovesLimit (50) for clamp, not defaultMaxConcurrentMoves

defaultMaxConcurrentMoves (5) is the fallback when the field is unset,
not an upper bound. Clamping to it silently overrides valid config
values like 10/20/50. Introduce maxConcurrentMovesLimit (50) matching
the descriptor's MaxValue and clamp to that instead.

* fix: cancel batch moves on progress stream failure

Derive a cancellable batchCtx from the caller's ctx. If
sender.SendProgress returns an error (client disconnect, context
cancelled), capture it, skip further sends, and cancel batchCtx so
in-flight moves abort via their propagated context rather than running
blind to completion.

* fix: bound cleanup timeout and validate batch move fields

- Use a 30-second timeout for the deferred markVolumeWritable cleanup
  instead of context.Background() which can block indefinitely if the
  volume server is unreachable
- Validate required fields (VolumeID, SourceNode, TargetNode) before
  appending moves to a batch proposal, skipping invalid entries
- Fall back to a single-move proposal when filtering leaves only one
  valid move in a batch

* fix: cancel task execution on SendProgress stream failure

All handler progress callbacks previously ignored SendProgress errors,
allowing tasks to continue executing after the client disconnected.
Now each handler creates a derived cancellable context and cancels it
on the first SendProgress error, stopping the in-flight task promptly.

Handlers fixed: erasure_coding, vacuum, volume_balance (single-move),
and admin_script (breaks command loop on send failure).

* fix: validate batch moves before scheduling in executeBatchMoves

Reject empty batches, enforce a hard upper bound (100 moves), and
filter out nil or incomplete move specs (missing source/target/volume)
before allocating progress tracking and launching goroutines.

* test: add batch balance execution integration test

Tests the batch move path with 3 volumes, max concurrency 2, using
fake volume servers. Verifies all moves complete with correct readonly,
copy, tail, and delete RPC counts.

* test: add MarkWritableCount and ReadFileStatusCount accessors

Expose the markWritableCalls and readFileStatusCalls counters on the
fake volume server, following the existing MarkReadonlyCount pattern.

* fix: oscillation guard uses global effective counts for heterogeneous capacity

The oscillation guard (max-min <= 1) previously used maxServer/minServer
which are determined by utilization ratio. With heterogeneous capacity,
maxServer by utilization can have fewer raw volumes than minServer,
producing a negative diff and incorrectly triggering the guard.

Now scans all servers' effective counts to find the true global max/min
volume counts, so the guard works correctly regardless of whether
utilization-based or raw-count balancing is used.

* fix: admin script handler breaks outer loop on SendProgress failure

The break on SendProgress error inside the shell.Commands scan only
exited the inner loop, letting the outer command loop continue
executing commands on a broken stream. Use a sendBroken flag to
propagate the break to the outer execCommands loop.
2026-03-09 19:30:08 -07:00
Chris Lu
992db11d2b iam: add IAM group management (#8560)
* iam: add Group message to protobuf schema

Add Group message (name, members, policy_names, disabled) and
add groups field to S3ApiConfiguration for IAM group management
support (issue #7742).

* iam: add group CRUD to CredentialStore interface and all backends

Add group management methods (CreateGroup, GetGroup, DeleteGroup,
ListGroups, UpdateGroup) to the CredentialStore interface with
implementations for memory, filer_etc, postgres, and grpc stores.
Wire group loading/saving into filer_etc LoadConfiguration and
SaveConfiguration.

* iam: add group IAM response types

Add XML response types for group management IAM actions:
CreateGroup, DeleteGroup, GetGroup, ListGroups, AddUserToGroup,
RemoveUserFromGroup, AttachGroupPolicy, DetachGroupPolicy,
ListAttachedGroupPolicies, ListGroupsForUser.

* iam: add group management handlers to embedded IAM API

Add CreateGroup, DeleteGroup, GetGroup, ListGroups, AddUserToGroup,
RemoveUserFromGroup, AttachGroupPolicy, DetachGroupPolicy,
ListAttachedGroupPolicies, and ListGroupsForUser handlers with
dispatch in ExecuteAction.

* iam: add group management handlers to standalone IAM API

Add group handlers (CreateGroup, DeleteGroup, GetGroup, ListGroups,
AddUserToGroup, RemoveUserFromGroup, AttachGroupPolicy, DetachGroupPolicy,
ListAttachedGroupPolicies, ListGroupsForUser) and wire into DoActions
dispatch. Also add helper functions for user/policy side effects.

* iam: integrate group policies into authorization

Add groups and userGroups reverse index to IdentityAccessManagement.
Populate both maps during ReplaceS3ApiConfiguration and
MergeS3ApiConfiguration. Modify evaluateIAMPolicies to evaluate
policies from user's enabled groups in addition to user policies.
Update VerifyActionPermission to consider group policies when
checking hasAttachedPolicies.

* iam: add group side effects on user deletion and rename

When a user is deleted, remove them from all groups they belong to.
When a user is renamed, update group membership references. Applied
to both embedded and standalone IAM handlers.

* iam: watch /etc/iam/groups directory for config changes

Add groups directory to the filer subscription watcher so group
file changes trigger IAM configuration reloads.

* admin: add group management page to admin UI

Add groups page with CRUD operations, member management, policy
attachment, and enable/disable toggle. Register routes in admin
handlers and add Groups entry to sidebar navigation.

* test: add IAM group management integration tests

Add comprehensive integration tests for group CRUD, membership,
policy attachment, policy enforcement, disabled group behavior,
user deletion side effects, and multi-group membership. Add
"group" test type to CI matrix in s3-iam-tests workflow.

* iam: address PR review comments for group management

- Fix XSS vulnerability in groups.templ: replace innerHTML string
  concatenation with DOM APIs (createElement/textContent) for rendering
  member and policy lists
- Use userGroups reverse index in embedded IAM ListGroupsForUser for
  O(1) lookup instead of iterating all groups
- Add buildUserGroupsIndex helper in standalone IAM handlers; use it
  in ListGroupsForUser and removeUserFromAllGroups for efficient lookup
- Add note about gRPC store load-modify-save race condition limitation

* iam: add defensive copies, validation, and XSS fixes for group management

- Memory store: clone groups on store/retrieve to prevent mutation
- Admin dash: deep copy groups before mutation, validate user/policy exists
- HTTP handlers: translate credential errors to proper HTTP status codes,
  use *bool for Enabled field to distinguish missing vs false
- Groups templ: use data attributes + event delegation instead of inline
  onclick for XSS safety, prevent stale async responses

* iam: add explicit group methods to PropagatingCredentialStore

Add CreateGroup, GetGroup, DeleteGroup, ListGroups, and UpdateGroup
methods instead of relying on embedded interface fallthrough. Group
changes propagate via filer subscription so no RPC propagation needed.

* iam: detect postgres unique constraint violation and add groups index

Return ErrGroupAlreadyExists when INSERT hits SQLState 23505 instead of
a generic error. Add index on groups(disabled) for filtered queries.

* iam: add Marker field to group list response types

Add Marker string field to GetGroupResult, ListGroupsResult,
ListAttachedGroupPoliciesResult, and ListGroupsForUserResult to
match AWS IAM pagination response format.

* iam: check group attachment before policy deletion

Reject DeletePolicy if the policy is attached to any group, matching
AWS IAM behavior. Add PolicyArn to ListAttachedGroupPolicies response.

* iam: include group policies in IAM authorization

Merge policy names from user's enabled groups into the IAMIdentity
used for authorization, so group-attached policies are evaluated
alongside user-attached policies.

* iam: check for name collision before renaming user in UpdateUser

Scan identities and inline policies for newUserName before mutating,
returning EntityAlreadyExists if a collision is found. Reuse the
already-loaded policies instead of loading them again inside the loop.

* test: use t.Cleanup for bucket cleanup in group policy test

* iam: wrap ErrUserNotInGroup sentinel in RemoveGroupMember error

Wrap credential.ErrUserNotInGroup so errors.Is works in
groupErrorToHTTPStatus, returning proper 400 instead of 500.

* admin: regenerate groups_templ.go with XSS-safe data attributes

Regenerated from groups.templ which uses data-group-name attributes
instead of inline onclick with string interpolation.

* iam: add input validation and persist groups during migration

- Validate nil/empty group name in CreateGroup and UpdateGroup
- Save groups in migrateToMultiFile so they survive legacy migration

* admin: use groupErrorToHTTPStatus in GetGroupMembers and GetGroupPolicies

* iam: short-circuit UpdateUser when newUserName equals current name

* iam: require empty PolicyNames before group deletion

Reject DeleteGroup when group has attached policies, matching the
existing members check. Also fix GetGroup error handling in
DeletePolicy to only skip ErrGroupNotFound, not all errors.

* ci: add weed/pb/** to S3 IAM test trigger paths

* test: replace time.Sleep with require.Eventually for propagation waits

Use polling with timeout instead of fixed sleeps to reduce flakiness
in integration tests waiting for IAM policy propagation.

* fix: use credentialManager.GetPolicy for AttachGroupPolicy validation

Policies created via CreatePolicy through credentialManager are stored
in the credential store, not in s3cfg.Policies (which only has static
config policies). Change AttachGroupPolicy to use credentialManager.GetPolicy()
for policy existence validation.

* feat: add UpdateGroup handler to embedded IAM API

Add UpdateGroup action to enable/disable groups and rename groups
via the IAM API. This is a SeaweedFS extension (not in AWS SDK) used
by tests to toggle group disabled status.

* fix: authenticate raw IAM API calls in group tests

The embedded IAM endpoint rejects anonymous requests. Replace
callIAMAPI with callIAMAPIAuthenticated that uses JWT bearer token
authentication via the test framework.

* feat: add UpdateGroup handler to standalone IAM API

Mirror the embedded IAM UpdateGroup handler in the standalone IAM API
for parity.

* fix: add omitempty to Marker XML tags in group responses

Non-truncated responses should not emit an empty <Marker/> element.

* fix: distinguish backend errors from missing policies in AttachGroupPolicy

Return ServiceFailure for credential manager errors instead of masking
them as NoSuchEntity. Also switch ListGroupsForUser to use s3cfg.Groups
instead of in-memory reverse index to avoid stale data. Add duplicate
name check to UpdateGroup rename.

* fix: standalone IAM AttachGroupPolicy uses persisted policy store

Check managed policies from GetPolicies() instead of s3cfg.Policies
so dynamically created policies are found. Also add duplicate name
check to UpdateGroup rename.

* fix: rollback inline policies on UpdateUser PutPolicies failure

If PutPolicies fails after moving inline policies to the new username,
restore both the identity name and the inline policies map to their
original state to avoid a partial-write window.

* fix: correct test cleanup ordering for group tests

Replace scattered defers with single ordered t.Cleanup in each test
to ensure resources are torn down in reverse-creation order:
remove membership, detach policies, delete access keys, delete users,
delete groups, delete policies. Move bucket cleanup to parent test
scope and delete objects before bucket.

* fix: move identity nil check before map lookup and refine hasAttachedPolicies

Move the nil check on identity before accessing identity.Name to
prevent panic. Also refine hasAttachedPolicies to only consider groups
that are enabled and have actual policies attached, so membership in
a no-policy group doesn't incorrectly trigger IAM authorization.

* fix: fail group reload on unreadable or corrupt group files

Return errors instead of logging and continuing when group files
cannot be read or unmarshaled. This prevents silently applying a
partial IAM config with missing group memberships or policies.

* fix: use errors.Is for sql.ErrNoRows comparison in postgres group store

* docs: explain why group methods skip propagateChange

Group changes propagate to S3 servers via filer subscription
(watching /etc/iam/groups/) rather than gRPC RPCs, since there
are no group-specific RPCs in the S3 cache protocol.

* fix: remove unused policyNameFromArn and strings import

* fix: update service account ParentUser on user rename

When renaming a user via UpdateUser, also update ParentUser references
in service accounts to prevent them from becoming orphaned after the
next configuration reload.

* fix: wrap DetachGroupPolicy error with ErrPolicyNotAttached sentinel

Use credential.ErrPolicyNotAttached so groupErrorToHTTPStatus maps
it to 400 instead of falling back to 500.

* fix: use admin S3 client for bucket cleanup in enforcement test

The user S3 client may lack permissions by cleanup time since the
user is removed from the group in an earlier subtest. Use the admin
S3 client to ensure bucket and object cleanup always succeeds.

* fix: add nil guard for group param in propagating store log calls

Prevent potential nil dereference when logging group.Name in
CreateGroup and UpdateGroup of PropagatingCredentialStore.

* fix: validate Disabled field in UpdateGroup handlers

Reject values other than "true" or "false" with InvalidInputException
instead of silently treating them as false.

* fix: seed mergedGroups from existing groups in MergeS3ApiConfiguration

Previously the merge started with empty group maps, dropping any
static-file groups. Now seeds from existing iam.groups before
overlaying dynamic config, and builds the reverse index after
merging to avoid stale entries from overridden groups.

* fix: use errors.Is for filer_pb.ErrNotFound comparison in group loading

Replace direct equality (==) with errors.Is() to correctly match
wrapped errors, consistent with the rest of the codebase.

* fix: add ErrUserNotFound and ErrPolicyNotFound to groupErrorToHTTPStatus

Map these sentinel errors to 404 so AddGroupMember and
AttachGroupPolicy return proper HTTP status codes.

* fix: log cleanup errors in group integration tests

Replace fire-and-forget cleanup calls with error-checked versions
that log failures via t.Logf for debugging visibility.

* fix: prevent duplicate group test runs in CI matrix

The basic lane's -run "TestIAM" regex also matched TestIAMGroup*
tests, causing them to run in both the basic and group lanes.
Replace with explicit test function names.

* fix: add GIN index on groups.members JSONB for membership lookups

Without this index, ListGroupsForUser and membership queries
require full table scans on the groups table.

* fix: handle cross-directory moves in IAM config subscription

When a file is moved out of an IAM directory (e.g., /etc/iam/groups),
the dir variable was overwritten with NewParentPath, causing the
source directory change to be missed. Now also notifies handlers
about the source directory for cross-directory moves.

* fix: validate members/policies before deleting group in admin handler

AdminServer.DeleteGroup now checks for attached members and policies
before delegating to credentialManager, matching the IAM handler guards.

* fix: merge groups by name instead of blind append during filer load

Match the identity loader's merge behavior: find existing group
by name and replace, only append when no match exists. Prevents
duplicates when legacy and multi-file configs overlap.

* fix: check DeleteEntry response error when cleaning obsolete group files

Capture and log resp.Error from filer DeleteEntry calls during
group file cleanup, matching the pattern used in deleteGroupFile.

* fix: verify source user exists before no-op check in UpdateUser

Reorder UpdateUser to find the source identity first and return
NoSuchEntityException if not found, before checking if the rename
is a no-op. Previously a non-existent user renamed to itself
would incorrectly return success.

* fix: update service account parent refs on user rename in embedded IAM

The embedded IAM UpdateUser handler updated group membership but
not service account ParentUser fields, unlike the standalone handler.

* fix: replay source-side events for all handlers on cross-dir moves

Pass nil newEntry to bucket, IAM, and circuit-breaker handlers for
the source directory during cross-directory moves, so all watchers
can clear caches for the moved-away resource.

* fix: don't seed mergedGroups from existing iam.groups in merge

Groups are always dynamic (from filer), never static (from s3.config).
Seeding from iam.groups caused stale deleted groups to persist.
Now only uses config.Groups from the dynamic filer config.

* fix: add deferred user cleanup in TestIAMGroupUserDeletionSideEffect

Register t.Cleanup for the created user so it gets cleaned up
even if the test fails before the inline DeleteUser call.

* fix: assert UpdateGroup HTTP status in disabled group tests

Add require.Equal checks for 200 status after UpdateGroup calls
so the test fails immediately on API errors rather than relying
on the subsequent Eventually timeout.

* fix: trim whitespace from group name in filer store operations

Trim leading/trailing whitespace from group.Name before validation
in CreateGroup and UpdateGroup to prevent whitespace-only filenames.
Also merge groups by name during multi-file load to prevent duplicates.

* fix: add nil/empty group validation in gRPC store

Guard CreateGroup and UpdateGroup against nil group or empty name
to prevent panics and invalid persistence.

* fix: add nil/empty group validation in postgres store

Guard CreateGroup and UpdateGroup against nil group or empty name
to prevent panics from nil member access and empty-name row inserts.

* fix: add name collision check in embedded IAM UpdateUser

The embedded IAM handler renamed users without checking if the
target name already existed, unlike the standalone handler.

* fix: add ErrGroupNotEmpty sentinel and map to HTTP 409

AdminServer.DeleteGroup now wraps conflict errors with
ErrGroupNotEmpty, and groupErrorToHTTPStatus maps it to
409 Conflict instead of 500.

* fix: use appropriate error message in GetGroupDetails based on status

Return "Group not found" only for 404, use "Failed to retrieve group"
for other error statuses instead of always saying "Group not found".

* fix: use backend-normalized group.Name in CreateGroup response

After credentialManager.CreateGroup may normalize the name (e.g.,
trim whitespace), use group.Name instead of the raw input for
the returned GroupData to ensure consistency.

* fix: add nil/empty group validation in memory store

Guard CreateGroup and UpdateGroup against nil group or empty name
to prevent panics from nil pointer dereference on map access.

* fix: reorder embedded IAM UpdateUser to verify source first

Find the source identity before checking for collisions, matching
the standalone handler's logic. Previously a non-existent user
renamed to an existing name would get EntityAlreadyExists instead
of NoSuchEntity.

* fix: handle same-directory renames in metadata subscription

Replay a delete event for the old entry name during same-directory
renames so handlers like onBucketMetadataChange can clean up stale
state for the old name.

* fix: abort GetGroups on non-ErrGroupNotFound errors

Only skip groups that return ErrGroupNotFound. Other errors (e.g.,
transient backend failures) now abort the handler and return the
error to the caller instead of silently producing partial results.

* fix: add aria-label and title to icon-only group action buttons

Add accessible labels to View and Delete buttons so screen readers
and tooltips provide meaningful context.

* fix: validate group name in saveGroup to prevent invalid filenames

Trim whitespace and reject empty names before writing group JSON
files, preventing creation of files like ".json".

* fix: add /etc/iam/groups to filer subscription watched directories

The groups directory was missing from the watched directories list,
so S3 servers in a cluster would not detect group changes made by
other servers via filer. The onIamConfigChange handler already had
code to handle group directory changes but it was never triggered.

* add direct gRPC propagation for group changes to S3 servers

Groups now have the same dual propagation as identities and policies:
direct gRPC push via propagateChange + async filer subscription.

- Add PutGroup/RemoveGroup proto messages and RPCs
- Add PutGroup/RemoveGroup in-memory cache methods on IAM
- Add PutGroup/RemoveGroup gRPC server handlers
- Update PropagatingCredentialStore to call propagateChange on group mutations

* reduce log verbosity for config load summary

Change ReplaceS3ApiConfiguration log from Infof to V(1).Infof
to avoid noisy output on every config reload.

* admin: show user groups in view and edit user modals

- Add Groups field to UserDetails and populate from credential manager
- Show groups as badges in user details view modal
- Add group management to edit user modal: display current groups,
  add to group via dropdown, remove from group via badge x button

* fix: remove duplicate showAlert that broke modal-alerts.js

admin.js defined showAlert(type, message) which overwrote the
modal-alerts.js version showAlert(message, type), causing broken
unstyled alert boxes. Remove the duplicate and swap all callers
in admin.js to use the correct (message, type) argument order.

* fix: unwrap groups API response in edit user modal

The /api/groups endpoint returns {"groups": [...]}, not a bare array.

* Update object_store_users_templ.go

* test: assert AccessDenied error code in group denial tests

Replace plain assert.Error checks with awserr.Error type assertion
and AccessDenied code verification, matching the pattern used in
other IAM integration tests.

* fix: propagate GetGroups errors in ShowGroups handler

getGroupsPageData was swallowing errors and returning an empty page
with 200 status. Now returns the error so ShowGroups can respond
with a proper error status.

* fix: reject AttachGroupPolicy when credential manager is nil

Previously skipped policy existence validation when credentialManager
was nil, allowing attachment of nonexistent policies. Now returns
a ServiceFailureException error.

* fix: preserve groups during partial MergeS3ApiConfiguration updates

UpsertIdentity calls MergeS3ApiConfiguration with a partial config
containing only the updated identity (nil Groups). This was wiping
all in-memory group state. Now only replaces groups when
config.Groups is non-nil (full config reload).

* fix: propagate errors from group lookup in GetObjectStoreUserDetails

ListGroups and GetGroup errors were silently ignored, potentially
showing incomplete group data in the UI.

* fix: use DOM APIs for group badge remove button to prevent XSS

Replace innerHTML with onclick string interpolation with DOM
createElement + addEventListener pattern. Also add aria-label
and title to the add-to-group button.

* fix: snapshot group policies under RLock to prevent concurrent map access

evaluateIAMPolicies was copying the map reference via groupMap :=
iam.groups under RLock then iterating after RUnlock, while PutGroup
mutates the map in-place. Now copies the needed policy names into
a slice while holding the lock.

* fix: add nil IAM check to PutGroup and RemoveGroup gRPC handlers

Match the nil guard pattern used by PutPolicy/DeletePolicy to
prevent nil pointer dereference when IAM is not initialized.
2026-03-09 11:54:32 -07:00
Chris Lu
3f946fc0c0 mount: make metadata cache rebuilds snapshot-consistent (#8531)
* filer: expose metadata events and list snapshots

* mount: invalidate hot directory caches

* mount: read hot directories directly from filer

* mount: add sequenced metadata cache applier

* mount: apply metadata responses through cache applier

* mount: replay snapshot-consistent directory builds

* mount: dedupe self metadata events

* mount: factor directory build cleanup

* mount: replace proto marshal dedup with composite key and ring buffer

The dedup logic was doing a full deterministic proto.Marshal on every
metadata event just to produce a dedup key. Replace with a cheap
composite string key (TsNs|Directory|OldName|NewName).

Also replace the sliding-window slice (which leaked the backing array
unboundedly) with a fixed-size ring buffer that reuses the same array.

* filer: remove mutex and proto.Clone from request-scoped MetadataEventSink

MetadataEventSink is created per-request and only accessed by the
goroutine handling the gRPC call. The mutex and double proto.Clone
(once in Record, once in Last) were unnecessary overhead on every
filer write operation. Store the pointer directly instead.

* mount: skip proto.Clone for caller-owned metadata events

Add ApplyMetadataResponseOwned that takes ownership of the response
without cloning. Local metadata events (mkdir, create, flush, etc.)
are freshly constructed and never shared, so the clone is unnecessary.

* filer: only populate MetadataEvent on successful DeleteEntry

Avoid calling eventSink.Last() on error paths where the sink may
contain a partial event from an intermediate child deletion during
recursive deletes.

* mount: avoid map allocation in collectDirectoryNotifications

Replace the map with a fixed-size array and linear dedup. There are
at most 3 directories to notify (old parent, new parent, new child
if directory), so a 3-element array avoids the heap allocation on
every metadata event.

* mount: fix potential deadlock in enqueueApplyRequest

Release applyStateMu before the blocking channel send. Previously,
if the channel was full (cap 128), the send would block while holding
the mutex, preventing Shutdown from acquiring it to set applyClosed.

* mount: restore signature-based self-event filtering as fast path

Re-add the signature check that was removed when content-based dedup
was introduced. Checking signatures is O(1) on a small slice and
avoids enqueuing and processing events that originated from this
mount instance. The content-based dedup remains as a fallback.

* filer: send snapshotTsNs only in first ListEntries response

The snapshot timestamp is identical for every entry in a single
ListEntries stream. Sending it in every response message wastes
wire bandwidth for large directories. The client already reads
it only from the first response.

* mount: exit read-through mode after successful full directory listing

MarkDirectoryRefreshed was defined but never called, so directories
that entered read-through mode (hot invalidation threshold) stayed
there permanently, hitting the filer on every readdir even when cold.
Call it after a complete read-through listing finishes.

* mount: include event shape and full paths in dedup key

The previous dedup key only used Names, which could collapse distinct
rename targets. Include the event shape (C/D/U/R), source directory,
new parent path, and both entry names so structurally different events
are never treated as duplicates.

* mount: drain pending requests on shutdown in runApplyLoop

After receiving the shutdown sentinel, drain any remaining requests
from applyCh non-blockingly and signal each with errMetaCacheClosed
so callers waiting on req.done are released.

* mount: include IsDirectory in synthetic delete events

metadataDeleteEvent now accepts an isDirectory parameter so the
applier can distinguish directory deletes from file deletes. Rmdir
passes true, Unlink passes false.

* mount: fall back to synthetic event when MetadataEvent is nil

In mknod and mkdir, if the filer response omits MetadataEvent (e.g.
older filer without the field), synthesize an equivalent local
metadata event so the cache is always updated.

* mount: make Flush metadata apply best-effort after successful commit

After filer_pb.CreateEntryWithResponse succeeds, the entry is
persisted. Don't fail the Flush syscall if the local metadata cache
apply fails — log and invalidate the directory cache instead.
Also fall back to a synthetic event when MetadataEvent is nil.

* mount: make Rename metadata apply best-effort

The rename has already succeeded on the filer by the time we apply
the local metadata event. Log failures instead of returning errors
that would be dropped by the caller anyway.

* mount: make saveEntry metadata apply best-effort with fallback

After UpdateEntryWithResponse succeeds, treat local metadata apply
as non-fatal. Log and invalidate the directory cache on failure.
Also fall back to a synthetic event when MetadataEvent is nil.

* filer_pb: preserve snapshotTsNs on error in ReadDirAllEntriesWithSnapshot

Return the snapshot timestamp even when the first page fails, so
callers receive the snapshot boundary when partial data was received.

* filer: send snapshot token for empty directory listings

When no entries are streamed, send a final ListEntriesResponse with
only SnapshotTsNs so clients always receive the snapshot boundary.

* mount: distinguish not-found vs transient errors in lookupEntry

Return fuse.EIO for non-not-found filer errors instead of
unconditionally returning ENOENT, so transient failures don't
masquerade as missing entries.

* mount: make CacheRemoteObject metadata apply best-effort

The file content has already been cached successfully. Don't fail
the read if the local metadata cache update fails.

* mount: use consistent snapshot for readdir in direct mode

Capture the SnapshotTsNs from the first loadDirectoryEntriesDirect
call and store it on the DirectoryHandle. Subsequent batch loads
pass this stored timestamp so all batches use the same snapshot.

Also export DoSeaweedListWithSnapshot so mount can use it directly
with snapshot passthrough.

* filer_pb: fix test fake to send SnapshotTsNs only on first response

Match the server behavior: only the first ListEntriesResponse in a
page carries the snapshot timestamp, subsequent entries leave it zero.

* Fix nil pointer dereference in ListEntries stream consumers

Remove the empty-directory snapshot-only response from ListEntries
that sent a ListEntriesResponse with Entry==nil, which crashed every
raw stream consumer that assumed resp.Entry is always non-nil.

Also add defensive nil checks for resp.Entry in all raw ListEntries
stream consumers across: S3 listing, broker topic lookup, broker
topic config, admin dashboard, topic retention, hybrid message
scanner, Kafka integration, and consumer offset storage.

* Add nil guards for resp.Entry in remaining ListEntries stream consumers

Covers: S3 object lock check, MQ management dashboard (version/
partition/offset loops), and topic retention version loop.

* Make applyLocalMetadataEvent best-effort in Link and Symlink

The filer operations already succeeded; failing the syscall because
the local cache apply failed is wrong. Log a warning and invalidate
the parent directory cache instead.

* Make applyLocalMetadataEvent best-effort in Mkdir/Rmdir/Mknod/Unlink

The filer RPC already committed; don't fail the syscall when the
local metadata cache apply fails. Log a warning and invalidate the
parent directory cache to force a re-fetch on next access.

* flushFileMetadata: add nil-fallback for metadata event and best-effort apply

Synthesize a metadata event when resp.GetMetadataEvent() is nil
(matching doFlush), and make the apply best-effort with cache
invalidation on failure.

* Prevent double-invocation of cleanupBuild in doEnsureVisited

Add a cleanupDone guard so the deferred cleanup and inline error-path
cleanup don't both call DeleteFolderChildren/AbortDirectoryBuild.

* Fix comment: signature check is O(n) not O(1)

* Prevent deferred cleanup after successful CompleteDirectoryBuild

Set cleanupDone before returning from the success path so the
deferred context-cancellation check cannot undo a published build.

* Invalidate parent directory caches on rename metadata apply failure

When applyLocalMetadataEvent fails during rename, invalidate the
source and destination parent directory caches so subsequent accesses
trigger a re-fetch from the filer.

* Add event nil-fallback and cache invalidation to Link and Symlink

Synthesize metadata events when the server doesn't return one, and
invalidate parent directory caches on apply failure.

* Match requested partition when scanning partition directories

Parse the partition range format (NNNN-NNNN) and match against the
requested partition parameter instead of using the first directory.

* Preserve snapshot timestamp across empty directory listings

Initialize actualSnapshotTsNs from the caller-requested value so it
isn't lost when the server returns no entries. Re-add the server-side
snapshot-only response for empty directories (all raw stream consumers
now have nil guards for Entry).

* Fix CreateEntry error wrapping to support errors.Is/errors.As

Use errors.New + %w instead of %v for resp.Error so callers can
unwrap the underlying error.

* Fix object lock pagination: only advance on non-nil entries

Move entriesReceived inside the nil check so nil entries don't
cause repeated ListEntries calls with the same lastFileName.

* Guard Attributes nil check before accessing Mtime in MQ management

* Do not send nil-Entry response for empty directory listings

The snapshot-only ListEntriesResponse (with Entry == nil) for empty
directories breaks consumers that treat any received response as an
entry (Java FilerClient, S3 listing). The Go client-side
DoSeaweedListWithSnapshot already preserves the caller-requested
snapshot via actualSnapshotTsNs initialization, so the server-side
send is unnecessary.

* Fix review findings: subscriber dedup, invalidation normalization, nil guards, shutdown race

- Remove self-signature early-return in processEventFn so all events
  flow through the applier (directory-build buffering sees self-originated
  events that arrive after a snapshot)
- Normalize NewParentPath in collectEntryInvalidations to avoid duplicate
  invalidations when NewParentPath is empty (same-directory update)
- Guard resp.Entry.Attributes for nil in admin_server.go and
  topic_retention.go to prevent panics on entries without attributes
- Fix enqueueApplyRequest race with shutdown by using select on both
  applyCh and applyDone, preventing sends after the apply loop exits
- Add cleanupDone check to deferred cleanup in meta_cache_init.go for
  clarity alongside the existing guard in cleanupBuild
- Add empty directory test case for snapshot consistency

* Propagate authoritative metadata event from CacheRemoteObjectToLocalCluster and generate client-side snapshot for empty directories

- Add metadata_event field to CacheRemoteObjectToLocalClusterResponse
  proto so the filer-emitted event is available to callers
- Use WithMetadataEventSink in the server handler to capture the event
  from NotifyUpdateEvent and return it on the response
- Update filehandle_read.go to prefer the RPC's metadata event over
  a locally fabricated one, falling back to metadataUpdateEvent when
  the server doesn't provide one (e.g., older filers)
- Generate a client-side snapshot cutoff in DoSeaweedListWithSnapshot
  when the server sends no snapshot (empty directory), so callers like
  CompleteDirectoryBuild get a meaningful boundary for filtering
  buffered events

* Skip directory notifications for dirs being built to prevent mid-build cache wipe

When a metadata event is buffered during a directory build,
applyMetadataSideEffects was still firing noteDirectoryUpdate for the
building directory. If the directory accumulated enough updates to
become "hot", markDirectoryReadThrough would call DeleteFolderChildren,
wiping entries that EnsureVisited had already inserted. The build would
then complete and mark the directory cached with incomplete data.

Fix by using applyMetadataSideEffectsSkippingBuildingDirs for buffered
events, which suppresses directory notifications for dirs currently in
buildingDirs while still applying entry invalidations.

* Add test for directory notification suppression during active build

TestDirectoryNotificationsSuppressedDuringBuild verifies that metadata
events targeting a directory under active EnsureVisited build do NOT
fire onDirectoryUpdate for that directory. In production, this prevents
markDirectoryReadThrough from calling DeleteFolderChildren mid-build,
which would wipe entries already inserted by the listing.

The test inserts an entry during a build, sends multiple metadata events
for the building directory, asserts no notifications fired for it,
verifies the entry survives, and confirms buffered events are replayed
after CompleteDirectoryBuild.

* Fix create invalidations, build guard, event shape, context, and snapshot error path

- collectEntryInvalidations: invalidate FUSE kernel cache on pure
  create events (OldEntry==nil && NewEntry!=nil), not just updates
  and deletes
- completeDirectoryBuildNow: only call markCachedFn when an active
  build existed (state != nil), preventing an unpopulated directory
  from being marked as cached
- Add metadataCreateEvent helper that produces a create-shaped event
  (NewEntry only, no OldEntry) and use it in mkdir, mknod, symlink,
  and hardlink create fallback paths instead of metadataUpdateEvent
  which incorrectly set both OldEntry and NewEntry
- applyMetadataResponseEnqueue: use context.Background() for the
  queued mutation so a cancelled caller context cannot abort the
  apply loop mid-write
- DoSeaweedListWithSnapshot: move snapshot initialization before
  ListEntries call so the error path returns the preserved snapshot
  instead of 0

* Fix review findings: test loop, cache race, context safety, snapshot consistency

- Fix build test loop starting at i=1 instead of i=0, missing new-0.txt verification
- Re-check IsDirectoryCached after cache miss to avoid ENOENT race with markDirectoryReadThrough
- Use context.Background() in enqueueAndWait so caller cancellation can't abort build/complete mid-way
- Pass dh.snapshotTsNs in skip-batch loadDirectoryEntriesDirect for snapshot consistency
- Prefer resp.MetadataEvent over fallback in Unlink event derivation
- Add comment on MetadataEventSink.Record single-event assumption

* Fix empty-directory snapshot clock skew and build cancellation race

Empty-directory snapshot: Remove client-side time.Now() synthesis when
the server returns no entries. Instead return snapshotTsNs=0, and in
completeDirectoryBuildNow replay ALL buffered events when snapshot is 0.
This eliminates the clock-skew bug where a client ahead of the filer
would filter out legitimate post-list events.

Build cancellation: Use context.Background() for BeginDirectoryBuild
and CompleteDirectoryBuild calls in doEnsureVisited, so errgroup
cancellation doesn't cause enqueueAndWait to return early and trigger
cleanupBuild while the operation is still queued.

* Add tests for empty-directory build replay and cancellation resilience

TestEmptyDirectoryBuildReplaysAllBufferedEvents: verifies that when
CompleteDirectoryBuild receives snapshotTsNs=0 (empty directory, no
server snapshot), ALL buffered events are replayed regardless of their
TsNs values — no clock-skew-sensitive filtering occurs.

TestBuildCompletionSurvivesCallerCancellation: verifies that once
CompleteDirectoryBuild is enqueued, a cancelled caller context does not
prevent the build from completing. The apply loop runs with
context.Background(), so the directory becomes cached and buffered
events are replayed even when the caller gives up waiting.

* Fix directory subtree cleanup, Link rollback, test robustness

- applyMetadataResponseLocked: when a directory entry is deleted or
  moved, call DeleteFolderChildren on the old path so cached descendants
  don't leak as stale entries.

- Link: save original HardLinkId/Counter before mutation. If
  CreateEntryWithResponse fails after the source was already updated,
  rollback the source entry to its original state via UpdateEntry.

- TestBuildCompletionSurvivesCallerCancellation: replace fixed
  time.Sleep(50ms) with a deadline-based poll that checks
  IsDirectoryCached in a loop, failing only after 2s timeout.

- TestReadDirAllEntriesWithSnapshotEmptyDirectory: assert that
  ListEntries was actually invoked on the mock client so the test
  exercises the RPC path.

- newMetadataEvent: add early return when both oldEntry and newEntry are
  nil to avoid emitting events with empty Directory.

---------

Co-authored-by: Copilot <copilot@github.com>
2026-03-07 09:19:40 -08:00
Chris Lu
b3620c7e14 admin: auto migrating master maintenance scripts to admin_script plugin config (#8509)
* admin: seed admin_script plugin config from master maintenance scripts

When the admin server starts, fetch the maintenance scripts configuration
from the master via GetMasterConfiguration. If the admin_script plugin
worker does not already have a saved config, use the master's scripts as
the default value. This enables seamless migration from master.toml
[master.maintenance] to the admin script plugin worker.

Changes:
- Add maintenance_scripts and maintenance_sleep_minutes fields to
  GetMasterConfigurationResponse in master.proto
- Populate the new fields from viper config in master_grpc_server.go
- On admin server startup, fetch the master config and seed the
  admin_script plugin config if no config exists yet
- Strip lock/unlock commands from the master scripts since the admin
  script worker handles locking automatically

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: address review comments on admin_script seeding

- Replace TOCTOU race (separate Load+Save) with atomic
  SaveJobTypeConfigIfNotExists on ConfigStore and Plugin
- Replace ineffective polling loop with single GetMaster call using
  30s context timeout, since GetMaster respects context cancellation
- Add unit tests for SaveJobTypeConfigIfNotExists (in-memory + on-disk)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: apply maintenance script defaults in gRPC handler

The gRPC handler for GetMasterConfiguration read maintenance scripts
from viper without calling SetDefault, relying on startAdminScripts
having run first. If the admin server calls GetMasterConfiguration
before startAdminScripts sets the defaults, viper returns empty
strings and the seeding is silently skipped.

Apply SetDefault in the gRPC handler itself so it is self-contained.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Revert "fix: apply maintenance script defaults in gRPC handler"

This reverts commit 068a5063303f6bc34825a07bb681adfa67e6f9de.

* fix: use atomic save in ensureJobTypeConfigFromDescriptor

ensureJobTypeConfigFromDescriptor used a separate Load + Save, racing
with seedAdminScriptFromMaster. If the descriptor defaults (empty
script) were saved first, SaveJobTypeConfigIfNotExists in the seeding
goroutine would see an existing config and skip, losing the master's
maintenance scripts.

Switch to SaveJobTypeConfigIfNotExists so both paths are atomic. Whichever
wins, the other is a safe no-op.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: fetch master scripts inline during config bootstrap, not in goroutine

Replace the seedAdminScriptFromMaster goroutine with a
ConfigDefaultsProvider callback. When the plugin bootstraps
admin_script defaults from the worker descriptor, it calls the
provider which fetches maintenance scripts from the master
synchronously. This eliminates the race between the seeding
goroutine and the descriptor-based config bootstrap.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* skip commented lock unlock

Co-Authored-By: Copilot <223556219+Copilot@users.noreply.github.com>

* reduce grpc calls

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-03-04 22:11:07 -08:00
Racci
9e26d6f5dd fix: port in SNI address when using domainName instead of IP for master (#8500) 2026-03-04 07:05:45 -08:00
Chris Lu
18ccc9b773 Plugin scheduler: sequential iterations with max runtime (#8496)
* pb: add job type max runtime setting

* plugin: default job type max runtime

* plugin: redesign scheduler loop

* admin ui: update scheduler settings

* plugin: fix scheduler loop state name

* plugin scheduler: restore backlog skip

* plugin scheduler: drop legacy detection helper

* admin api: require scheduler config body

* admin ui: preserve detection interval on save

* plugin scheduler: use job context and drain cancels

* plugin scheduler: respect detection intervals

* plugin scheduler: gate runs and drain queue

* ec test: reuse req/resp vars

* ec test: add scheduler debug logs

* Adjust scheduler idle sleep and initial run delay

* Clear pending job queue before scheduler runs

* Log next detection time in EC integration test

* Improve plugin scheduler debug logging in EC test

* Expose scheduler next detection time

* Log scheduler next detection time in EC test

* Wake scheduler on config or worker updates

* Expose scheduler sleep interval in UI

* Fix scheduler sleep save value selection

* Set scheduler idle sleep default to 613s

* Show scheduler next run time in plugin UI

---------

Co-authored-by: Copilot <copilot@github.com>
2026-03-03 23:09:49 -08:00
Chris Lu
f5c35240be Add volume dir tags and EC placement priority (#8472)
* Add volume dir tags to topology

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Add preferred tag config for EC

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Prioritize EC destinations by tags

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Add EC placement planner tag tests

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Refactor EC placement tests to reuse buildActiveTopology

Remove buildActiveTopologyWithDiskTags helper function and consolidate
tag setup inline in test cases. Tests now use UpdateTopology to apply
tags after topology creation, reusing the existing buildActiveTopology
function rather than duplicating its logic.

All tag scenario tests pass:
- TestECPlacementPlannerPrefersTaggedDisks
- TestECPlacementPlannerFallsBackWhenTagsInsufficient

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Consolidate normalizeTagList into shared util package

Extract normalizeTagList from three locations (volume.go,
detection.go, erasure_coding_handler.go) into new weed/util/tag.go
as exported NormalizeTagList function. Replace all duplicate
implementations with imports and calls to util.NormalizeTagList.

This improves code reuse and maintainability by centralizing
tag normalization logic.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Add PreferredTags to EC config persistence

Add preferred_tags field to ErasureCodingTaskConfig protobuf with field
number 5. Update GetConfigSpec to include preferred_tags field in the
UI configuration schema. Add PreferredTags to ToTaskPolicy to serialize
config to protobuf. Add PreferredTags to FromTaskPolicy to deserialize
from protobuf with defensive copy to prevent external mutation.

This allows EC preferred tags to be persisted and restored across
worker restarts.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Add defensive copy for Tags slice in DiskLocation

Copy the incoming tags slice in NewDiskLocation instead of storing
by reference. This prevents external callers from mutating the
DiskLocation.Tags slice after construction, improving encapsulation
and preventing unexpected changes to disk metadata.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Add doc comment to buildCandidateSets method

Document the tiered candidate selection and fallback behavior. Explain
that for a planner with preferredTags, it accumulates disks matching
each tag in order into progressively larger tiers, emits a candidate
set once a tier reaches shardsNeeded, and finally falls back to the
full candidates set if preferred-tag tiers are insufficient.

This clarifies the intended semantics for future maintainers.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Apply final PR review fixes

1. Update parseVolumeTags to replicate single tag entry to all folders
   instead of leaving some folders with nil tags. This prevents nil
   pointer dereferences when processing folders without explicit tags.

2. Add defensive copy in ToTaskPolicy for PreferredTags slice to match
   the pattern used in FromTaskPolicy, preventing external mutation of
   the returned TaskPolicy.

3. Add clarifying comment in buildCandidateSets explaining that the
   shardsNeeded <= 0 branch is a defensive check for direct callers,
   since selectDestinations guarantees shardsNeeded > 0.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Fix nil pointer dereference in parseVolumeTags

Ensure all folder tags are initialized to either normalized tags or
empty slices, not nil. When multiple tag entries are provided and there
are more folders than entries, remaining folders now get empty slices
instead of nil, preventing nil pointer dereference in downstream code.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Fix NormalizeTagList to return empty slice instead of nil

Change NormalizeTagList to always return a non-nil slice. When all tags
are empty or whitespace after normalization, return an empty slice
instead of nil. This prevents nil pointer dereferences in downstream
code that expects a valid (possibly empty) slice.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Add nil safety check for v.tags pointer

Add a safety check to handle the case where v.tags might be nil,
preventing a nil pointer dereference. If v.tags is nil, use an empty
string instead. This is defensive programming to prevent panics in
edge cases.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Add volume.tags flag to weed server and weed mini commands

Add the volume.tags CLI option to both the 'weed server' and 'weed mini'
commands. This allows users to specify disk tags when running the
combined server modes, just like they can with 'weed volume'.

The flag uses the same format and description as the volume command:
comma-separated tag groups per data dir with ':' separators
(e.g. fast:ssd,archive).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

---------

Co-authored-by: Copilot <copilot@github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-03-01 10:22:00 -08:00
Chris Lu
c73e65ad5e Add customizable plugin display names and weights (#8459)
* feat: add customizable plugin display names and weights

- Add weight field to JobTypeCapability proto message
- Modify ListKnownJobTypes() to return JobTypeInfo with display names and weights
- Modify ListPluginJobTypes() to return JobTypeInfo instead of string
- Sort plugins by weight (descending) then alphabetically
- Update admin API to return enriched job type metadata
- Update plugin UI template to display names instead of IDs
- Consolidate API by reusing existing function names instead of suffixed variants

* perf: optimize plugin job type capability lookup and add null-safe parsing

- Pre-calculate job type capabilities in a map to reduce O(n*m) nested loops
  to O(n+m) lookup time in ListKnownJobTypes()
- Add parseJobTypeItem() helper function for null-safe job type item parsing
- Refactor plugin.templ to use parseJobTypeItem() in all job type access points
  (hasJobType, applyInitialNavigation, ensureActiveNavigation, renderTopTabs)
- Deterministic capability resolution by using first worker's capability

* templ

* refactor: use parseJobTypeItem helper consistently in plugin.templ

Replace duplicated job type extraction logic at line 1296-1298 with
parseJobTypeItem() helper function for consistency and maintainability.

* improve: prefer richer capability metadata and add null-safety checks

- Improve capability selection in ListKnownJobTypes() to prefer capabilities
  with non-empty DisplayName and higher Weight across all workers instead of
  first-wins approach. Handles mixed-version clusters better.
- Add defensive null checks in renderJobTypeSummary() to safely access
  parseJobTypeItem() result before property access
- Ensures malformed or missing entries won't break the rendering pipeline

* fix: preserve existing DisplayName when merging capabilities

Fix capability merge logic to respect existing DisplayName values:
- If existing has DisplayName but candidate doesn't, preserve existing
- If existing doesn't have DisplayName but candidate does, use candidate
- Only use Weight comparison if DisplayName status is equal
- Prevents higher-weight capabilities with empty DisplayName from
  overriding capabilities with non-empty DisplayName
2026-02-26 19:20:48 -08:00
Chris Lu
a2005cb2a6 fix: resolve gRPC DNS resolution issues in Kubernetes #8384 (#8387)
* fix: resolve gRPC DNS resolution issues in Kubernetes #8384

- Replace direct `grpc.NewClient` calls with `pb.GrpcDial` for consistent connection establishment
- Fix async DNS resolution behavior in K8s with `ndots:5`
- Ensure high-level components use established helper for reliable networking

* refactor: refine gRPC DNS fix and add documentation

- Use instance's grpcDialOption in BrokerClient.ConfigureTopic
- Add detailed comments to GrpcDial explaining Kubernetes DNS resolution rationale

* fix: ensure proper context propagation in broker_client gRPC calls

- Pass the provided `ctx` to `pb.GrpcDial` in `ConfigureTopic` and `GetUnflushedMessages`
- Ensures that timeouts and cancellations are correctly honored during connection establishment

* docs: refine gRPC resolver documentation and cleanup dead code

- Enhanced documentation for `GrpcDial` with explicit warnings about global state mutation when using `resolver.SetDefaultScheme("passthrough")`.
- Recommended `passthrough:///` prefix as the primary migration path for `grpc.NewClient`.
- Removed dead commented-out code for `grpc.WithBlock()` and `grpc.WithTimeout()`.
2026-02-19 15:46:02 -08:00
Chris Lu
8ec9ff4a12 Refactor plugin system and migrate worker runtime (#8369)
* admin: add plugin runtime UI page and route wiring

* pb: add plugin gRPC contract and generated bindings

* admin/plugin: implement worker registry, runtime, monitoring, and config store

* admin/dash: wire plugin runtime and expose plugin workflow APIs

* command: add flags to enable plugin runtime

* admin: rename remaining plugin v2 wording to plugin

* admin/plugin: add detectable job type registry helper

* admin/plugin: add scheduled detection and dispatch orchestration

* admin/plugin: prefetch job type descriptors when workers connect

* admin/plugin: add known job type discovery API and UI

* admin/plugin: refresh design doc to match current implementation

* admin/plugin: enforce per-worker scheduler concurrency limits

* admin/plugin: use descriptor runtime defaults for scheduler policy

* admin/ui: auto-load first known plugin job type on page open

* admin/plugin: bootstrap persisted config from descriptor defaults

* admin/plugin: dedupe scheduled proposals by dedupe key

* admin/ui: add job type and state filters for plugin monitoring

* admin/ui: add per-job-type plugin activity summary

* admin/plugin: split descriptor read API from schema refresh

* admin/ui: keep plugin summary metrics global while tables are filtered

* admin/plugin: retry executor reservation before timing out

* admin/plugin: expose scheduler states for monitoring

* admin/ui: show per-job-type scheduler states in plugin monitor

* pb/plugin: rename protobuf package to plugin

* admin/plugin: rename pluginRuntime wiring to plugin

* admin/plugin: remove runtime naming from plugin APIs and UI

* admin/plugin: rename runtime files to plugin naming

* admin/plugin: persist jobs and activities for monitor recovery

* admin/plugin: lease one detector worker per job type

* admin/ui: show worker load from plugin heartbeats

* admin/plugin: skip stale workers for detector and executor picks

* plugin/worker: add plugin worker command and stream runtime scaffold

* plugin/worker: implement vacuum detect and execute handlers

* admin/plugin: document external vacuum plugin worker starter

* command: update plugin.worker help to reflect implemented flow

* command/admin: drop legacy Plugin V2 label

* plugin/worker: validate vacuum job type and respect min interval

* plugin/worker: test no-op detect when min interval not elapsed

* command/admin: document plugin.worker external process

* plugin/worker: advertise configured concurrency in hello

* command/plugin.worker: add jobType handler selection

* command/plugin.worker: test handler selection by job type

* command/plugin.worker: persist worker id in workingDir

* admin/plugin: document plugin.worker jobType and workingDir flags

* plugin/worker: support cancel request for in-flight work

* plugin/worker: test cancel request acknowledgements

* command/plugin.worker: document workingDir and jobType behavior

* plugin/worker: emit executor activity events for monitor

* plugin/worker: test executor activity builder

* admin/plugin: send last successful run in detection request

* admin/plugin: send cancel request when detect or execute context ends

* admin/plugin: document worker cancel request responsibility

* admin/handlers: expose plugin scheduler states API in no-auth mode

* admin/handlers: test plugin scheduler states route registration

* admin/plugin: keep worker id on worker-generated activity records

* admin/plugin: test worker id propagation in monitor activities

* admin/dash: always initialize plugin service

* command/admin: remove plugin enable flags and default to enabled

* admin/dash: drop pluginEnabled constructor parameter

* admin/plugin UI: stop checking plugin enabled state

* admin/plugin: remove docs for plugin enable flags

* admin/dash: remove unused plugin enabled check method

* admin/dash: fallback to in-memory plugin init when dataDir fails

* admin/plugin API: expose worker gRPC port in status

* command/plugin.worker: resolve admin gRPC port via plugin status

* split plugin UI into overview/configuration/monitoring pages

* Update layout_templ.go

* add volume_balance plugin worker handler

* wire plugin.worker CLI for volume_balance job type

* add erasure_coding plugin worker handler

* wire plugin.worker CLI for erasure_coding job type

* support multi-job handlers in plugin worker runtime

* allow plugin.worker jobType as comma-separated list

* admin/plugin UI: rename to Workers and simplify config view

* plugin worker: queue detection requests instead of capacity reject

* Update plugin_worker.go

* plugin volume_balance: remove force_move/timeout from worker config UI

* plugin erasure_coding: enforce local working dir and cleanup

* admin/plugin UI: rename admin settings to job scheduling

* admin/plugin UI: persist and robustly render detection results

* admin/plugin: record and return detection trace metadata

* admin/plugin UI: show detection process and decision trace

* plugin: surface detector decision trace as activities

* mini: start a plugin worker by default

* admin/plugin UI: split monitoring into detection and execution tabs

* plugin worker: emit detection decision trace for EC and balance

* admin workers UI: split monitoring into detection and execution pages

* plugin scheduler: skip proposals for active assigned/running jobs

* admin workers UI: add job queue tab

* plugin worker: add dummy stress detector and executor job type

* admin workers UI: reorder tabs to detection queue execution

* admin workers UI: regenerate plugin template

* plugin defaults: include dummy stress and add stress tests

* plugin dummy stress: rotate detection selections across runs

* plugin scheduler: remove cross-run proposal dedupe

* plugin queue: track pending scheduled jobs

* plugin scheduler: wait for executor capacity before dispatch

* plugin scheduler: skip detection when waiting backlog is high

* plugin: add disk-backed job detail API and persistence

* admin ui: show plugin job detail modal from job id links

* plugin: generate unique job ids instead of reusing proposal ids

* plugin worker: emit heartbeats on work state changes

* plugin registry: round-robin tied executor and detector picks

* add temporary EC overnight stress runner

* plugin job details: persist and render EC execution plans

* ec volume details: color data and parity shard badges

* shard labels: keep parity ids numeric and color-only distinction

* admin: remove legacy maintenance UI routes and templates

* admin: remove dead maintenance endpoint helpers

* Update layout_templ.go

* remove dummy_stress worker and command support

* refactor plugin UI to job-type top tabs and sub-tabs

* migrate weed worker command to plugin runtime

* remove plugin.worker command and keep worker runtime with metrics

* update helm worker args for jobType and execution flags

* set plugin scheduling defaults to global 16 and per-worker 4

* stress: fix RPC context reuse and remove redundant variables in ec_stress_runner

* admin/plugin: fix lifecycle races, safe channel operations, and terminal state constants

* admin/dash: randomize job IDs and fix priority zero-value overwrite in plugin API

* admin/handlers: implement buffered rendering to prevent response corruption

* admin/plugin: implement debounced persistence flusher and optimize BuildJobDetail memory lookups

* admin/plugin: fix priority overwrite and implement bounded wait in scheduler reserve

* admin/plugin: implement atomic file writes and fix run record side effects

* admin/plugin: use P prefix for parity shard labels in execution plans

* admin/plugin: enable parallel execution for cancellation tests

* admin: refactor time.Time fields to pointers for better JSON omitempty support

* admin/plugin: implement pointer-safe time assignments and comparisons in plugin core

* admin/plugin: fix time assignment and sorting logic in plugin monitor after pointer refactor

* admin/plugin: update scheduler activity tracking to use time pointers

* admin/plugin: fix time-based run history trimming after pointer refactor

* admin/dash: fix JobSpec struct literal in plugin API after pointer refactor

* admin/view: add D/P prefixes to EC shard badges for UI consistency

* admin/plugin: use lifecycle-aware context for schema prefetching

* Update ec_volume_details_templ.go

* admin/stress: fix proposal sorting and log volume cleanup errors

* stress: refine ec stress runner with math/rand and collection name

- Added Collection field to VolumeEcShardsDeleteRequest for correct filename construction.
- Replaced crypto/rand with seeded math/rand PRNG for bulk payloads.
- Added documentation for EcMinAge zero-value behavior.
- Added logging for ignored errors in volume/shard deletion.

* admin: return internal server error for plugin store failures

Changed error status code from 400 Bad Request to 500 Internal Server Error for failures in GetPluginJobDetail to correctly reflect server-side errors.

* admin: implement safe channel sends and graceful shutdown sync

- Added sync.WaitGroup to Plugin struct to manage background goroutines.
- Implemented safeSendCh helper using recover() to prevent panics on closed channels.
- Ensured Shutdown() waits for all background operations to complete.

* admin: robustify plugin monitor with nil-safe time and record init

- Standardized nil-safe assignment for *time.Time pointers (CreatedAt, UpdatedAt, CompletedAt).
- Ensured persistJobDetailSnapshot initializes new records correctly if they don't exist on disk.
- Fixed debounced persistence to trigger immediate write on job completion.

* admin: improve scheduler shutdown behavior and logic guards

- Replaced brittle error string matching with explicit r.shutdownCh selection for shutdown detection.
- Removed redundant nil guard in buildScheduledJobSpec.
- Standardized WaitGroup usage for schedulerLoop.

* admin: implement deep copy for job parameters and atomic write fixes

- Implemented deepCopyGenericValue and used it in cloneTrackedJob to prevent shared state.
- Ensured atomicWriteFile creates parent directories before writing.

* admin: remove unreachable branch in shard classification

Removed an unreachable 'totalShards <= 0' check in classifyShardID as dataShards and parityShards are already guarded.

* admin: secure UI links and use canonical shard constants

- Added rel="noopener noreferrer" to external links for security.
- Replaced magic number 14 with erasure_coding.TotalShardsCount.
- Used renderEcShardBadge for missing shard list consistency.

* admin: stabilize plugin tests and fix regressions

- Composed a robust plugin_monitor_test.go to handle asynchronous persistence.
- Updated all time.Time literals to use timeToPtr helper.
- Added explicit Shutdown() calls in tests to synchronize with debounced writes.
- Fixed syntax errors and orphaned struct literals in tests.

* Potential fix for code scanning alert no. 278: Slice memory allocation with excessive size value

Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>

* Potential fix for code scanning alert no. 283: Uncontrolled data used in path expression

Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>

* admin: finalize refinements for error handling, scheduler, and race fixes

- Standardized HTTP 500 status codes for store failures in plugin_api.go.
- Tracked scheduled detection goroutines with sync.WaitGroup for safe shutdown.
- Fixed race condition in safeSendDetectionComplete by extracting channel under lock.
- Implemented deep copy for JobActivity details.
- Used defaultDirPerm constant in atomicWriteFile.

* test(ec): migrate admin dockertest to plugin APIs

* admin/plugin_api: fix RunPluginJobTypeAPI to return 500 for server-side detection/filter errors

* admin/plugin_api: fix ExecutePluginJobAPI to return 500 for job execution failures

* admin/plugin_api: limit parseProtoJSONBody request body to 1MB to prevent unbounded memory usage

* admin/plugin: consolidate regex to package-level validJobTypePattern; add char validation to sanitizeJobID

* admin/plugin: fix racy Shutdown channel close with sync.Once

* admin/plugin: track sendLoop and recv goroutines in WorkerStream with r.wg

* admin/plugin: document writeProtoFiles atomicity — .pb is source of truth, .json is human-readable only

* admin/plugin: extract activityLess helper to deduplicate nil-safe OccurredAt sort comparators

* test/ec: check http.NewRequest errors to prevent nil req panics

* test/ec: replace deprecated ioutil/math/rand, fix stale step comment 5.1→3.1

* plugin(ec): raise default detection and scheduling throughput limits

* topology: include empty disks in volume list and EC capacity fallback

* topology: remove hard 10-task cap for detection planning

* Update ec_volume_details_templ.go

* adjust default

* fix tests

---------

Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
2026-02-18 13:42:41 -08:00
Chris Lu
f44e25b422 fix(iam): ensure access key status is persisted and defaulted to Active (#8341)
* Fix master leader election startup issue

Fixes #error-log-leader-not-selected-yet

* not useful test

* fix(iam): ensure access key status is persisted and defaulted to Active

* make pb

* update tests

* using constants
2026-02-13 20:28:41 -08:00
Chris Lu
b08bb8237c Fix master leader election startup issue (#8340)
* Fix master leader election startup issue

Fixes #error-log-leader-not-selected-yet

* Fix master leader election startup issue

This change improves server address comparison using the 'Equals' method and handles recursion in topology leader lookup, resolving the 'leader not selected yet' error during master startup.

* Merge user improvements: use MaybeLeader for non-blocking checks

* not useful test

* Address code review: optimize Equals, fix deadlock in IsLeader, safe access in Leader
2026-02-13 15:39:39 -08:00
Chris Lu
1e4f30c56f pb: fix IPv6 double brackets in ServerAddress formatting (#8329)
* pb: fix IPv6 double brackets in ServerAddress formatting

* pb: refactor IPv6 tests into table-driven test

* util: add JoinHostPortStr and use it in pb to avoid unsafe port parsing
2026-02-12 18:11:03 -08:00
Lisandro Pin
1a5679a5eb Implement a VolumeEcStatus() RPC for volume servers. (#8006)
Just like `VolumeStatus()`, this call allows inspecting details for
a given EC volume - including number of files and their total size.
2026-02-09 11:52:08 -08:00
Chris Lu
a3b83f8808 test: add Trino Iceberg catalog integration test (#8228)
* test: add Trino Iceberg catalog integration test

- Create test/s3/catalog_trino/trino_catalog_test.go with TestTrinoIcebergCatalog
- Tests integration between Trino SQL engine and SeaweedFS Iceberg REST catalog
- Starts weed mini with all services and Trino in Docker container
- Validates Iceberg catalog schema creation and listing operations
- Uses native S3 filesystem support in Trino with path-style access
- Add workflow job to s3-tables-tests.yml for CI execution

* fix: preserve AWS environment credentials when replacing S3 configuration

When S3 configuration is loaded from filer/db, it replaces the identities list
and inadvertently removes AWS_ACCESS_KEY_ID credentials that were added from
environment variables. This caused auth to remain disabled even though valid
credentials were present.

Fix by preserving environment-based identities when replacing the configuration
and re-adding them after the replacement. This ensures environment credentials
persist across configuration reloads and properly enable authentication.

* fix: use correct ServerAddress format with gRPC port encoding

The admin server couldn't connect to master because the master address
was missing the gRPC port information. Use pb.NewServerAddress() which
properly encodes both HTTP and gRPC ports in the address string.

Changes:
- weed/command/mini.go: Use pb.NewServerAddress for master address in admin
- test/s3/policy/policy_test.go: Store and use gRPC ports for master/filer addresses

This fix applies to:
1. Admin server connection to master (mini.go)
2. Test shell commands that need master/filer addresses (policy_test.go)

* move

* move

* fix: always include gRPC port in server address encoding

The NewServerAddress() function was omitting the gRPC port from the address
string when it matched the port+10000 convention. However, gRPC port allocation
doesn't always follow this convention - when the calculated port is busy, an
alternative port is allocated.

This caused a bug where:
1. Master's gRPC port was allocated as 50661 (sequential, not port+10000)
2. Address was encoded as '192.168.1.66:50660' (gRPC port omitted)
3. Admin client called ToGrpcAddress() which assumed port+10000 offset
4. Admin tried to connect to 60660 but master was on 50661 → connection failed

Fix: Always include explicit gRPC port in address format (host:httpPort.grpcPort)
unless gRPC port is 0. This makes addresses unambiguous and works regardless of
the port allocation strategy used.

Impacts: All server-to-server gRPC connections now use properly formatted addresses.

* test: fix Iceberg REST API readiness check

The Iceberg REST API endpoints require authentication. When checked without
credentials, the API returns 403 Forbidden (not 401 Unauthorized).  The
readiness check now accepts both auth error codes (401/403) as indicators
that the service is up and ready, it just needs credentials.

This fixes the 'Iceberg REST API did not become ready' test failure.

* Fix AWS SigV4 signature verification for base64-encoded payload hashes

   AWS SigV4 canonical requests must use hex-encoded SHA256 hashes,
   but the X-Amz-Content-Sha256 header may be transmitted as base64.

   Changes:
   - Added normalizePayloadHash() function to convert base64 to hex
   - Call normalizePayloadHash() in extractV4AuthInfoFromHeader()
   - Added encoding/base64 import

   Fixes 403 Forbidden errors on POST requests to Iceberg REST API
   when clients send base64-encoded content hashes in the header.

   Impacted services: Iceberg REST API, S3Tables

* Fix AWS SigV4 signature verification for base64-encoded payload hashes

   AWS SigV4 canonical requests must use hex-encoded SHA256 hashes,
   but the X-Amz-Content-Sha256 header may be transmitted as base64.

   Changes:
   - Added normalizePayloadHash() function to convert base64 to hex
   - Call normalizePayloadHash() in extractV4AuthInfoFromHeader()
   - Added encoding/base64 import
   - Removed unused fmt import

   Fixes 403 Forbidden errors on POST requests to Iceberg REST API
   when clients send base64-encoded content hashes in the header.

   Impacted services: Iceberg REST API, S3Tables

* pass sigv4

* s3api: fix identity preservation and logging levels

- Ensure environment-based identities are preserved during config replacement
- Update accessKeyIdent and nameToIdentity maps correctly
- Downgrade informational logs to V(2) to reduce noise

* test: fix trino integration test and s3 policy test

- Pin Trino image version to 479
- Fix port binding to 0.0.0.0 for Docker connectivity
- Fix S3 policy test hang by correctly assigning MiniClusterCtx
- Improve port finding robustness in policy tests

* ci: pre-pull trino image to avoid timeouts

- Pull trinodb/trino:479 after Docker setup
- Ensure image is ready before integration tests start

* iceberg: remove unused checkAuth and improve logging

- Remove unused checkAuth method
- Downgrade informational logs to V(2)
- Ensure loggingMiddleware uses a status writer for accurate reported codes
- Narrow catch-all route to avoid interfering with other subsystems

* iceberg: fix build failure by removing unused s3api import

* Update iceberg.go

* use warehouse

* Update trino_catalog_test.go
2026-02-06 13:12:25 -08:00
Lisandro Pin
2cda4289f4 Add a version token on RPCs to read/update volume server states. (#8191)
* Add a version token on `GetState()`/`SetState()` RPCs for volume server states.

* Make state version a property ov `VolumeServerState` instead of an in-memory counter.

Also extend state atomicity to reads, instead of just writes.
2026-02-06 10:58:43 -08:00
Lisandro Pin
9d751a7b61 Contrib/volume scrub local (#8226) 2026-02-05 14:44:12 -08:00
Lisandro Pin
ff5a8f0579 Implement RPC skeleton for regular/EC volumes scrubbing. (#8187)
* Implement RPC skeleton for regular/EC volumes scrubbing.

See https://github.com/seaweedfs/seaweedfs/issues/8018 for details.

* Minor proto improvements for `ScrubVolume()`, `ScrubEcVolume()`:

  - Add fields for scrubbing details in `ScrubVolumeResponse` and `ScrubEcVolumeResponse`,
    instead of reporting these through RPC errors.
  - Return a list of broken shards when scrubbing EC volumes, via `EcShardInfo'.
2026-02-02 17:55:04 -08:00
Lisandro Pin
345ac950b6 Add volume server RPCs to read and update state flags. (#8186)
* Boostrap persistent state for volume servers.

This PR implements logic load/save persistent state information for storages
associated with volume servers, and reporting state changes back to masters
via heartbeat messages.

More work ensues!

See https://github.com/seaweedfs/seaweedfs/issues/7977 for details.

* Add volume server RPCs to read and update state flags.
2026-02-02 16:22:17 -08:00
Lisandro Pin
9638d37fe2 Block RPC write operations on volume servers when maintenance mode is enabled (#8115)
* Boostrap persistent state for volume servers.

This PR implements logic load/save persistent state information for storages
associated with volume servers, and reporting state changes back to masters
via heartbeat messages.

More work ensues!

See https://github.com/seaweedfs/seaweedfs/issues/7977 for details.

* Block RPC operations writing to volume servers when maintenance mode is on.
2026-02-02 13:21:02 -08:00
Chris Lu
551a31e156 Implement IAM propagation to S3 servers (#8130)
* Implement IAM propagation to S3 servers

- Add PropagatingCredentialStore to propagate IAM changes to S3 servers via gRPC
- Add Policy management RPCs to S3 proto and S3ApiServer
- Update CredentialManager to use PropagatingCredentialStore when MasterClient is available
- Wire FilerServer to enable propagation

* Implement parallel IAM propagation and fix S3 cluster registration

- Parallelized IAM change propagation with 10s timeout.
- Refined context usage in PropagatingCredentialStore.
- Added S3Type support to cluster node management.
- Enabled S3 servers to register with gRPC address to the master.
- Ensured IAM configuration reload after policy updates via gRPC.

* Optimize IAM propagation with direct in-memory cache updates

* Secure IAM propagation: Use metadata to skip persistence only on propagation

* pb: refactor IAM and S3 services for unidirectional IAM propagation

- Move SeaweedS3IamCache service from iam.proto to s3.proto.
- Remove legacy IAM management RPCs and empty SeaweedS3 service from s3.proto.
- Enforce that S3 servers only use the synchronization interface.

* pb: regenerate Go code for IAM and S3 services

Updated generated code following the proto refactoring of IAM synchronization services.

* s3api: implement read-only mode for Embedded IAM API

- Add readOnly flag to EmbeddedIamApi to reject write operations via HTTP.
- Enable read-only mode by default in S3ApiServer.
- Handle AccessDenied error in writeIamErrorResponse.
- Embed SeaweedS3IamCacheServer in S3ApiServer.

* credential: refactor PropagatingCredentialStore for unidirectional IAM flow

- Update to use s3_pb.SeaweedS3IamCacheClient for propagation to S3 servers.
- Propagate full Identity object via PutIdentity for consistency.
- Remove redundant propagation of specific user/account/policy management RPCs.
- Add timeout context for propagation calls.

* s3api: implement SeaweedS3IamCacheServer for unidirectional sync

- Update S3ApiServer to implement the cache synchronization gRPC interface.
- Methods (PutIdentity, RemoveIdentity, etc.) now perform direct in-memory cache updates.
- Register SeaweedS3IamCacheServer in command/s3.go.
- Remove registration for the legacy and now empty SeaweedS3 service.

* s3api: update tests for read-only IAM and propagation

- Added TestEmbeddedIamReadOnly to verify rejection of write operations in read-only mode.
- Update test setup to pass readOnly=false to NewEmbeddedIamApi in routing tests.
- Updated EmbeddedIamApiForTest helper with read-only checks matching production behavior.

* s3api: add back temporary debug logs for IAM updates

Log IAM updates received via:
- gRPC propagation (PutIdentity, PutPolicy, etc.)
- Metadata configuration reloads (LoadS3ApiConfigurationFromCredentialManager)
- Core identity management (UpsertIdentity, RemoveIdentity)

* IAM: finalize propagation fix with reduced logging and clarified architecture

* Allow configuring IAM read-only mode for S3 server integration tests

* s3api: add defensive validation to UpsertIdentity

* s3api: fix log message to reference correct IAM read-only flag

* test/s3/iam: ensure WaitForS3Service checks for IAM write permissions

* test: enable writable IAM in Makefile for integration tests

* IAM: add GetPolicy/ListPolicies RPCs to s3.proto

* S3: add GetBucketPolicy and ListBucketPolicies helpers

* S3: support storing generic IAM policies in IdentityAccessManagement

* S3: implement IAM policy RPCs using IdentityAccessManagement

* IAM: fix stale user identity on rename propagation
2026-01-26 22:59:43 -08:00
Chris Lu
43229b05ce Explicit IAM gRPC APIs for S3 Server (#8126)
* Update IAM and S3 protobuf definitions for explicit IAM gRPC APIs

* Refactor s3api: Extract generic ExecuteAction method for IAM operations

* Implement explicit IAM gRPC APIs in S3 server

* iam: remove deprecated GetConfiguration and PutConfiguration RPCs

* iamapi: refactor handlers to use CredentialManager directly

* s3api: refactor embedded IAM to use CredentialManager directly

* server: remove deprecated configuration gRPC handlers

* credential/grpc: refactor configuration calls to return error

* shell: update s3.configure to list users instead of full config

* s3api: fix CreateServiceAccount gRPC handler to map required fields

* s3api: fix UpdateServiceAccount gRPC handler to map fields and safe status

* s3api: enforce UserName in embedded IAM ListAccessKeys

* test: fix test_config.json structure to match proto definition

* Revert "credential/grpc: refactor configuration calls to return error"

This reverts commit cde707dd8b88c7d1bd730271518542eceb5ed069.

* Revert "server: remove deprecated configuration gRPC handlers"

This reverts commit 7307e205a083c8315cf84ddc2614b3e50eda2e33.

* Revert "s3api: enforce UserName in embedded IAM ListAccessKeys"

This reverts commit adf727ba52b4f3ffb911f0d0df85db858412ff83.

* Revert "s3api: fix UpdateServiceAccount gRPC handler to map fields and safe status"

This reverts commit 6a4be3314d43b6c8fda8d5e0558e83e87a19df3f.

* Revert "s3api: fix CreateServiceAccount gRPC handler to map required fields"

This reverts commit 9bb4425f07fbad38fb68d33e5c0aa573d8912a37.

* Revert "shell: update s3.configure to list users instead of full config"

This reverts commit f3304ead537b3e6be03d46df4cb55983ab931726.

* Revert "s3api: refactor embedded IAM to use CredentialManager directly"

This reverts commit 9012f27af82d11f0e824877712a5ae2505a65f86.

* Revert "iamapi: refactor handlers to use CredentialManager directly"

This reverts commit 3a148212236576b0a3aa4d991c2abb014fb46091.

* Revert "iam: remove deprecated GetConfiguration and PutConfiguration RPCs"

This reverts commit e16e08aa0099699338d3155bc7428e1051ce0a6a.

* s3api: address IAM code review comments (error handling, logging, gRPC response mapping)

* s3api: add robustness to startup by retrying KEK and IAM config loading from Filer

* s3api: address IAM gRPC code review comments (safety, validation, status logic)

* fix return
2026-01-26 13:38:15 -08:00
Chris Lu
5a7c74feac migrate IAM policies to multi-file storage (#8114)
* Add IAM gRPC service definition

- Add GetConfiguration/PutConfiguration for config management
- Add CreateUser/GetUser/UpdateUser/DeleteUser/ListUsers for user management
- Add CreateAccessKey/DeleteAccessKey/GetUserByAccessKey for access key management
- Methods mirror existing IAM HTTP API functionality

* Add IAM gRPC handlers on filer server

- Implement IamGrpcServer with CredentialManager integration
- Handle configuration get/put operations
- Handle user CRUD operations
- Handle access key create/delete operations
- All methods delegate to CredentialManager for actual storage

* Wire IAM gRPC service to filer server

- Add CredentialManager field to FilerOption and FilerServer
- Import credential store implementations in filer command
- Initialize CredentialManager from credential.toml if available
- Register IAM gRPC service on filer gRPC server
- Enable credential management via gRPC alongside existing filer services

* Regenerate IAM protobuf with gRPC service methods

* fix: compilation error in DeleteUser

* fix: address code review comments for IAM migration

* feat: migrate policies to multi-file layout and fix identity duplicated content

* refactor: remove configuration.json and migrate Service Accounts to multi-file layout

* refactor: standardize Service Accounts as distinct store entities and fix Admin Server persistence

* config: set ServiceAccountsDirectory to /etc/iam/service_accounts

* Fix Chrome dialog auto-dismiss with Bootstrap modals

- Add modal-alerts.js library with Bootstrap modal replacements
- Replace all 15 confirm() calls with showConfirm/showDeleteConfirm
- Auto-override window.alert() for all alert() calls
- Fixes Chrome 132+ aggressively blocking native dialogs

* Upgrade Bootstrap from 5.3.2 to 5.3.8

* Fix syntax error in object_store_users.templ - remove duplicate closing braces

* create policy

* display errors

* migrate to multi-file policies

* address PR feedback: use showDeleteConfirm and showErrorMessage in policies.templ, refine migration check

* Update policies_templ.go

* add service account to iam grpc

* iam: fix potential path traversal in policy names by validating name pattern

* iam: add GetServiceAccountByAccessKey to CredentialStore interface

* iam: implement service account support for PostgresStore

Includes full CRUD operations and efficient lookup by access key.

* iam: implement GetServiceAccountByAccessKey for filer_etc, grpc, and memory stores

Provides efficient lookup of service accounts by access key where possible,
with linear scan fallbacks for file-based stores.

* iam: remove filer_multiple support

Deleted its implementation and references in imports, scaffold config,
and core interface constants. Redundant with filer_etc.

* clear comment

* dash: robustify service account construction

- Guard against nil sa.Credential when constructing responses
- Fix Expiration logic to only set if > 0, avoiding Unix epoch 1970
- Ensure consistency across Get, Create, and Update handlers

* credential/filer_etc: improve error propagation in configuration handlers

- Return error from loadServiceAccountsFromMultiFile to callers
- Ensure listEntries errors in SaveConfiguration (cleanup logic) are
  propagated unless they are "not found" failures.
- Fixes potential silent failures during IAM configuration sync.

* credential/filer_etc: add existence check to CreateServiceAccount

Ensures consistency with other stores by preventing accidental overwrite
of existing service accounts during creation.

* credential/memory: improve store robustness and Reset logic

- Enforce ID immutability in UpdateServiceAccount to prevent orphans
- Update Reset() to also clear the policies map, ensuring full state
  cleanup for tests.

* dash: improve service account robustness and policy docs

- Wrap parent user lookup errors to preserve context
- Strictly validate Status field in UpdateServiceAccount
- Add deprecation comments to legacy policy management methods

* credential/filer_etc: protect against path traversal in service accounts

Implemented ID validation (alphanumeric, underscores, hyphens) and applied
it to Get, Save, and Delete operations to ensure no directory traversal
via saId.json filenames.

* credential/postgres: improve robustness and cleanup comments

- Removed brainstorming comments in GetServiceAccountByAccessKey
- Added missing rows.Err() check during iteration
- Properly propagate Scan and Unmarshal errors instead of swallowing them

* admin: unify UI alerts and confirmations using Bootstrap modals

- Updated modal-alerts.js with improved automated alert type detection
- Replaced native alert() and confirm() with showAlert(), showConfirm(),
  and showDeleteConfirm() across various Templ components
- Improved UX for delete operations by providing better context and styling
- Ensured consistent error reporting across IAM and Maintenance views

* admin: additional UI consistency fixes for alerts and confirmations

- Replaced native alert() and confirm() with Bootstrap modals in:
  - EC volumes (repair flow)
  - Collection details (repair flow)
  - File browser (properties and delete)
  - Maintenance config schema (save and reset)
- Improved delete confirmation in file browser with item context
- Ensured consistent success/error/info styling for all feedbacks

* make

* iam: add GetServiceAccountByAccessKey RPC and update GetConfiguration

* iam: implement GetServiceAccountByAccessKey on server and client

* iam: centralize policy and service account validation

* iam: optimize MemoryStore service account lookups with indexing

* iam: fix postgres service_accounts table and optimize lookups

* admin: refactor modal alerts and clean up dashboard logic

* admin: fix EC shards table layout mismatch

* admin: URL-encode IAM path parameters for safety

* admin: implement pauseWorker logic in maintenance view

* iam: add rows.Err() check to postgres ListServiceAccounts

* iam: standardize ErrServiceAccountNotFound across credential stores

* iam: map ErrServiceAccountNotFound to codes.NotFound in DeleteServiceAccount

* iam: refine service account store logic, errors and schema

* iam: add validation to GetServiceAccountByAccessKey

* admin: refine modal titles and ensure URL safety

* admin: address bot review comments for alerts and async usage

* iam: fix syntax error by restoring missing function declaration

* [FilerEtcStore] improve error handling in CreateServiceAccount

Refine error handling to provide clearer messages when checking for
existing service accounts.

* [PostgresStore] add nil guards and validation to service account methods

Ensure input parameters are not nil and required IDs are present
to prevent runtime panics and ensure data integrity.

* [JS] add shared IAM utility script

Consolidate common IAM operations like deleteUser and deleteAccessKey
into a shared utility script for better maintainability.

* [View] include shared IAM utilities in layout

Include iam-utils.js in the main layout to make IAM functions
available across all administrative pages.

* [View] refactor IAM logic and restore async in EC Shards view

Remove redundant local IAM functions and ensure that delete
confirmation callbacks are properly marked as async.

* [View] consolidate IAM logic in Object Store Users view

Remove redundant local definitions of deleteUser and deleteAccessKey,
relying on the shared utilities instead.

* [View] update generated templ files for UI consistency

* credential/postgres: remove redundant name column from service_accounts table

The id is already used as the unique identifier and was being copied to the name column.
This removes the name column from the schema and updates the INSERT/UPDATE queries.

* credential/filer_etc: improve logging for policy migration failures

Added Errorf log if AtomicRenameEntry fails during migration to ensure visibility of common failure points.

* credential: allow uppercase characters in service account ID username

Updated ServiceAccountIdPattern to allow [A-Za-z0-9_-]+ for the username component,
matching the actual service account creation logic which uses the parent user name directly.

* Update object_store_users_templ.go

* admin: fix ec_shards pagination to handle numeric page arguments

Updated goToPage in cluster_ec_shards.templ to accept either an Event
or a numeric page argument. This prevents errors when goToPage(1)
is called directly. Corrected both the .templ source and generated Go code.

* credential/filer_etc: improve service account storage robustness

Added nil guard to saveServiceAccount, updated GetServiceAccount
to return ErrServiceAccountNotFound for empty data, and improved
deleteServiceAccount to handle response-level Filer errors.
2026-01-26 11:28:23 -08:00
Chris Lu
6bf088cec9 IAM Policy Management via gRPC (#8109)
* Add IAM gRPC service definition

- Add GetConfiguration/PutConfiguration for config management
- Add CreateUser/GetUser/UpdateUser/DeleteUser/ListUsers for user management
- Add CreateAccessKey/DeleteAccessKey/GetUserByAccessKey for access key management
- Methods mirror existing IAM HTTP API functionality

* Add IAM gRPC handlers on filer server

- Implement IamGrpcServer with CredentialManager integration
- Handle configuration get/put operations
- Handle user CRUD operations
- Handle access key create/delete operations
- All methods delegate to CredentialManager for actual storage

* Wire IAM gRPC service to filer server

- Add CredentialManager field to FilerOption and FilerServer
- Import credential store implementations in filer command
- Initialize CredentialManager from credential.toml if available
- Register IAM gRPC service on filer gRPC server
- Enable credential management via gRPC alongside existing filer services

* Regenerate IAM protobuf with gRPC service methods

* iam_pb: add Policy Management to protobuf definitions

* credential: implement PolicyManager in credential stores

* filer: implement IAM Policy Management RPCs

* shell: add s3.policy command

* test: add integration test for s3.policy

* test: fix compilation errors in policy_test

* pb

* fmt

* test

* weed shell: add -policies flag to s3.configure

This allows linking/unlinking IAM policies to/from identities
directly from the s3.configure command.

* test: verify s3.configure policy linking and fix port allocation

- Added test case for linking policies to users via s3.configure
- Implemented findAvailablePortPair to ensure HTTP and gRPC ports
  are both available, avoiding conflicts with randomized port assignments.
- Updated assertion to match jsonpb output (policyNames)

* credential: add StoreTypeGrpc constant

* credential: add IAM gRPC store boilerplate

* credential: implement identity methods in gRPC store

* credential: implement policy methods in gRPC store

* admin: use gRPC credential store for AdminServer

This ensures that all IAM and policy changes made through the Admin UI
are persisted via the Filer's IAM gRPC service instead of direct file manipulation.

* shell: s3.configure use granular IAM gRPC APIs instead of full config patching

* shell: s3.configure use granular IAM gRPC APIs

* shell: replace deprecated ioutil with os in s3.policy

* filer: use gRPC FailedPrecondition for unconfigured credential manager

* test: improve s3.policy integration tests and fix error checks

* ci: add s3 policy shell integration tests to github workflow

* filer: fix LoadCredentialConfiguration error handling

* credential/grpc: propagate unmarshal errors in GetPolicies

* filer/grpc: improve error handling and validation

* shell: use gRPC status codes in s3.configure

* credential: document PutPolicy as create-or-replace

* credential/postgres: reuse CreatePolicy in PutPolicy to deduplicate logic

* shell: add timeout context and strictly enforce flags in s3.policy

* iam: standardize policy content field naming in gRPC and proto

* shell: extract slice helper functions in s3.configure

* filer: map credential store errors to gRPC status codes

* filer: add input validation for UpdateUser and CreateAccessKey

* iam: improve validation in policy and config handlers

* filer: ensure IAM service registration by defaulting credential manager

* credential: add GetStoreName method to manager

* test: verify policy deletion in integration test
2026-01-25 13:39:30 -08:00
Lisandro Pin
59d40f7186 Return volume server state flags via VolumeServerStatus() RPCs. (#8016) 2026-01-24 21:45:23 -08:00
Chris Lu
753e1db096 Prevent split-brain: Persistent ClusterID and Join Validation (#8022)
* Prevent split-brain: Persistent ClusterID and Join Validation

- Persist ClusterId in Raft store to survive restarts.
- Validate ClusterId on Raft command application (piggybacked on MaxVolumeId).
- Prevent masters with conflicting ClusterIds from joining/operating together.
- Update Telemetry to report the persistent ClusterId.

* Refine ClusterID validation based on feedback

- Improved error message in cluster_commands.go.
- Added ClusterId mismatch check in RaftServer.Recovery.

* Handle Raft errors and support Hashicorp Raft for ClusterId

- Check for errors when persisting ClusterId in legacy Raft.
- Implement ClusterId generation and persistence for Hashicorp Raft leader changes.
- Ensure consistent error logging.

* Refactor ClusterId validation

- Centralize ClusterId mismatch check in Topology.SetClusterId.
- Simplify MaxVolumeIdCommand.Apply and RaftServer.Recovery to rely on SetClusterId.

* Fix goroutine leak and add timeout

- Handle channel closure in Hashicorp Raft leader listener.
- Add timeout to Raft Apply call to prevent blocking.

* Fix deadlock in legacy Raft listener

- Wrap ClusterId generation/persistence in a goroutine to avoid blocking the Raft event loop (deadlock).

* Rename ClusterId to SystemId

- Renamed ClusterId to SystemId across the codebase (protobuf, topology, server, telemetry).
- Regenerated telemetry.pb.go with new field.

* Rename SystemId to TopologyId

- Rename to SystemId was intermediate step.
- Final name is TopologyId for the persistent cluster identifier.
- Updated protobuf, topology, raft server, master server, and telemetry.

* Optimize Hashicorp Raft listener

- Integrated TopologyId generation into existing monitorLeaderLoop.
- Removed extra goroutine in master_server.go.

* Fix optimistic TopologyId update

- Removed premature local state update of TopologyId in master_server.go and raft_hashicorp.go.
- State is now solely updated via the Raft state machine Apply/Restore methods after consensus.

* Add explicit log for recovered TopologyId

- Added glog.V(0) info log in RaftServer.Recovery to print the recovered TopologyId on startup.

* Add Raft barrier to prevent TopologyId race condition

- Implement ensureTopologyId helper method
- Send no-op MaxVolumeIdCommand to sync Raft log before checking TopologyId
- Ensures persisted TopologyId is recovered before generating new one
- Prevents race where generation happens during log replay

* Serialize TopologyId generation with mutex

- Add topologyIdGenLock mutex to MasterServer struct
- Wrap ensureTopologyId method with lock to prevent concurrent generation
- Fixes race where event listener and manual leadership check both generate IDs
- Second caller waits for first to complete and sees the generated ID

* Add TopologyId recovery logging to Apply method

- Change log level from V(1) to V(0) for visibility
- Log 'Recovered TopologyId' when applying from Raft log
- Ensures recovery is visible whether from snapshot or log replay
- Matches Recovery() method logging for consistency

* Fix Raft barrier timing issue

- Add 100ms delay after barrier command to ensure log application completes
- Add debug logging to track barrier execution and TopologyId state
- Return early if barrier command fails
- Prevents TopologyId generation before old logs are fully applied

* ensure leader

* address comments

* address comments

* redundant

* clean up

* double check

* refactoring

* comment
2026-01-18 14:02:34 -08:00
Jaehoon Kim
f2e7af257d Fix volume.fsck -forcePurging -reallyDeleteFromVolume to fail fast on filer traversal errors (#8015)
* Add TraverseBfsWithContext and fix race conditions in error handling

- Add TraverseBfsWithContext function to support context cancellation
- Fix race condition in doTraverseBfsAndSaving using atomic.Bool and sync.Once
- Improve error handling with fail-fast behavior and proper error propagation
- Update command_volume_fsck to use error-returning saveFn callback
- Enhance error messages in readFilerFileIdFile with detailed context

* refactoring

* fix error format

* atomic

* filer_pb: make enqueue return void

* shell: simplify fs.meta.save error handling

* filer_pb: handle enqueue return value

* Revert "atomic"

This reverts commit 712648bc354b186d6654fdb8a46fd4848fdc4e00.

* shell: refine fs.meta.save logic

---------

Co-authored-by: Chris Lu <chris.lu@gmail.com>
2026-01-14 21:37:50 -08:00
Lisandro Pin
2af293ce60 Boostrap persistent state for volume servers. (#7984)
This PR implements logic load/save persistent state information for storages
associated with volume servers, and reporting state changes back to masters
via heartbeat messages.

More work ensues!

See https://github.com/seaweedfs/seaweedfs/issues/7977 for details.
2026-01-12 10:49:59 -08:00
promalert
9012069bd7 chore: execute goimports to format the code (#7983)
* chore: execute goimports to format the code

Signed-off-by: promalert <promalert@outlook.com>

* goimports -w .

---------

Signed-off-by: promalert <promalert@outlook.com>
Co-authored-by: Chris Lu <chris.lu@gmail.com>
2026-01-07 13:06:08 -08:00
Chris Lu
e67973dc53 Support Policy Attachment for Object Store Users (#7981)
* Implement Policy Attachment support for Object Store Users

- Added policy_names field to iam.proto and regenerated protos.
- Updated S3 API and IAM integration to support direct policy evaluation for users.
- Enhanced Admin UI to allow attaching policies to users via modals.
- Renamed 'policies' to 'policy_names' to clarify that it stores identifiers.
- Fixed syntax error in user_management.go.

* Fix policy dropdown not populating

The API returns {policies: [...]} but JavaScript was treating response as direct array.
Updated loadPolicies() to correctly access data.policies property.

* Add null safety checks for policy dropdowns

Added checks to prevent "undefined" errors when:
- Policy select elements don't exist
- Policy dropdowns haven't loaded yet
- User is being edited before policies are loaded

* Fix policy dropdown by using correct JSON field name

JSON response has lowercase 'name' field but JavaScript was accessing 'Name'.
Changed policy.Name to policy.name to match the IAMPolicy JSON structure.

* Fix policy names not being saved on user update

Changed condition from len(req.PolicyNames) > 0 to req.PolicyNames != nil
to ensure policy names are always updated when present in the request,
even if it's an empty array (to allow clearing policies).

* Add debug logging for policy names update flow

Added console.log in frontend and glog in backend to trace
policy_names data through the update process.

* Temporarily disable auto-reload for debugging

Commented out window.location.reload() so console logs are visible
when updating a user.

* Add detailed debug logging and alert for policy selection

Added console.log for each step and an alert to show policy_names value
to help diagnose why it's not being included in the request.

* Regenerate templ files for object_store_users

Ran templ generate to ensure _templ.go files are up to date with
the latest .templ changes including debug logging.

* Remove debug logging and restore normal functionality

Cleaned up temporary debug code (console.log and alert statements)
and re-enabled automatic page reload after user update.

* Add step-by-step alert debugging for policy update

Added 5 alert checkpoints to trace policy data through the update flow:
1. Check if policiesSelect element exists
2. Show selected policy values
3. Show userData.policy_names
4. Show full request body
5. Confirm server response

Temporarily disabled auto-reload to see alerts.

* Add version check alert on page load

Added alert on DOMContentLoaded to verify new JavaScript is being executed
and not cached by the browser.

* Compile templates using make

Ran make to compile all template files and install the weed binary.

* Add button click detection and make handleUpdateUser global

- Added inline alert on button click to verify click is detected
- Made handleUpdateUser a window-level function to ensure it's accessible
- Added alert at start of handleUpdateUser function

* Fix handleUpdateUser scope issue - remove duplicate definition

Removed duplicate function definition that was inside DOMContentLoaded.
Now handleUpdateUser is defined only once in global scope (line 383)
making it accessible when button onclick fires.

* Remove all duplicate handleUpdateUser definitions

Now handleUpdateUser is defined only once at the very top of the script
block (line 352), before DOMContentLoaded, ensuring it's available when
the button onclick fires.

* Add function existence check and error catching

Added alerts to check if handleUpdateUser is defined and wrapped
the function call in try-catch to capture any JavaScript errors.
Also added console.log statements to verify function definition.

* Simplify handleUpdateUser to non-async for testing

Removed async/await and added early return to test if function
can be called at all. This will help identify if async is causing
the issue.

* Add cache-control headers to prevent browser caching

Added no-cache headers to ShowObjectStoreUsers handler to prevent
aggressive browser caching of inline JavaScript in the HTML page.

* Fix syntax error - make handleUpdateUser async

Changed function back to async to fix 'await is only valid in async functions' error.
The cache-control headers are working - browser is now loading new code.

* Update version check to v3 to verify cache busting

Changed version alert to 'v3 - WITH EARLY RETURN' to confirm
the new code with early return statement is being loaded.

* Remove all debug code - clean implementation

Removed all alerts, console.logs, and test code.
Implemented clean policy update functionality with proper error handling.

* Add ETag header for cache-busting and update walkthrough

* Fix policy pre-selection in Edit User modal

- Updated admin.js editUser function to pre-select policies
- Root cause: duplicate editUser in admin.js overwrote inline version
- Added policy pre-selection logic to match inline template
- Verified working in browser: policies now pre-select correctly

* Fix policy persistence in handleUpdateUser

- Added policy_names field to userData payload in handleUpdateUser
- Policies were being lost because handleUpdateUser only sent email and actions
- Now collects selected policies from editPolicies dropdown
- Verified working: policies persist correctly across updates

* Fix XSS vulnerability in access keys display

- Escape HTML in access key display using escapeHtml utility
- Replace inline onclick handlers with data attributes
- Add event delegation for delete access key buttons
- Prevents script injection via malicious access key values

* Fix additional XSS vulnerabilities in user details display

- Escape HTML in actions badges (line 626)
- Escape HTML in policy_names badges (line 636)
- Prevents script injection via malicious action or policy names

* Fix XSS vulnerability in loadPolicies function

- Replace innerHTML string concatenation with DOM API
- Use createElement and textContent for safe policy name insertion
- Prevents script injection via malicious policy names
- Apply same pattern to both create and edit select elements

* Remove debug logging from UpdateObjectStoreUser

- Removed glog.V(0) debug statements
- Clean up temporary debugging code before production

* Remove duplicate handleUpdateUser function

- Removed inline handleUpdateUser that duplicated admin.js logic
- Removed debug console.log statement
- admin.js version is now the single source of truth
- Eliminates maintenance burden of keeping two versions in sync

* Refine user management and address code review feedback

- Preserve PolicyNames in UpdateUserPolicies
- Allow clearing actions in UpdateObjectStoreUser by checking for nil
- Remove version comment from object_store_users.templ
- Refactor loadPolicies for DRYness using cloneNode while keeping DOM API security

* IAM Authorization for Static Access Keys

* verified XSS Fixes in Templates

* fix div
2026-01-06 21:53:28 -08:00
Chris Lu
31a4f57cd9 Fix: Add -admin.grpc flag to worker for explicit gRPC port (#7926) (#7927)
* Fix: Add -admin.grpc flag to worker for explicit gRPC port configuration

* Fix(helm): Add adminGrpcServer to worker configuration

* Refactor: Support host:port.grpcPort address format, revert -admin.grpc flag

* Helm: Conditionally append grpcPort to worker admin address

* weed/admin: fix "send on closed channel" panic in worker gRPC server

Make unregisterWorker connection-aware to prevent closing channels
belonging to newer connections.

* weed/worker: improve gRPC client stability and logging

- Fix goroutine leak in reconnection logic
- Refactor reconnection loop to exit on success and prevent busy-waiting
- Add session identification and enhanced logging to client handlers
- Use constant for internal reset action and remove unused variables

* weed/worker: fix worker state initialization and add lifecycle logs

- Revert workerState to use running boolean correctly
- Prevent handleStart failing by checking running state instead of startTime
- Add more detailed logs for worker startup events
2025-12-31 11:55:09 -08:00
Chris Lu
ae9a943ef6 IAM: Add Service Account Support (#7744) (#7901)
* iam: add ServiceAccount protobuf schema

Add ServiceAccount message type to iam.proto with support for:
- Unique ID and parent user linkage
- Optional expiration timestamp
- Separate credentials (access key/secret)
- Action restrictions (subset of parent)
- Enable/disable status

This is the first step toward implementing issue #7744
(IAM Service Account Support).

* iam: add service account response types

Add IAM API response types for service account operations:
- ServiceAccountInfo struct for marshaling account details
- CreateServiceAccountResponse
- DeleteServiceAccountResponse
- ListServiceAccountsResponse
- GetServiceAccountResponse
- UpdateServiceAccountResponse

Also add type aliases in iamapi package for backwards compatibility.

Part of issue #7744 (IAM Service Account Support).

* iam: implement service account API handlers

Add CRUD operations for service accounts:
- CreateServiceAccount: Creates service account with ABIA key prefix
- DeleteServiceAccount: Removes service account and parent linkage
- ListServiceAccounts: Lists all or filtered by parent user
- GetServiceAccount: Retrieves service account details
- UpdateServiceAccount: Modifies status, description, expiration

Service accounts inherit parent user's actions by default and
support optional expiration timestamps.

Part of issue #7744 (IAM Service Account Support).

* sts: add AssumeRoleWithWebIdentity HTTP endpoint

Add STS API HTTP endpoint for AWS SDK compatibility:
- Create s3api_sts.go with HTTP handlers matching AWS STS spec
- Support AssumeRoleWithWebIdentity action with JWT token
- Return XML response with temporary credentials (AccessKeyId,
  SecretAccessKey, SessionToken) matching AWS format
- Register STS route at POST /?Action=AssumeRoleWithWebIdentity

This enables AWS SDKs (boto3, AWS CLI, etc.) to obtain temporary
S3 credentials using OIDC/JWT tokens.

Part of issue #7744 (IAM Service Account Support).

* test: add service account and STS integration tests

Add integration tests for new IAM features:

s3_service_account_test.go:
- TestServiceAccountLifecycle: Create, Get, List, Update, Delete
- TestServiceAccountValidation: Error handling for missing params

s3_sts_test.go:
- TestAssumeRoleWithWebIdentityValidation: Parameter validation
- TestAssumeRoleWithWebIdentityWithMockJWT: JWT token handling

Tests skip gracefully when SeaweedFS is not running or when IAM
features are not configured.

Part of issue #7744 (IAM Service Account Support).

* iam: address code review comments

- Add constants for service account ID and key lengths
- Use strconv.ParseInt instead of fmt.Sscanf for better error handling
- Allow clearing descriptions by checking key existence in url.Values
- Replace magic numbers (12, 20, 40) with named constants

Addresses review comments from gemini-code-assist[bot]

* test: add proper error handling in service account tests

Use require.NoError(t, err) for io.ReadAll and xml.Unmarshal
to prevent silent failures and ensure test reliability.

Addresses review comment from gemini-code-assist[bot]

* test: add proper error handling in STS tests

Use require.NoError(t, err) for io.ReadAll and xml.Unmarshal
to prevent silent failures and ensure test reliability.
Repeated this fix throughout the file.

Addresses review comment from gemini-code-assist[bot] in PR #7901.

* iam: address additional code review comments

- Specific error code mapping for STS service errors
- Distinguish between Sender and Receiver error types in STS responses
- Add nil checks for credentials in List/GetServiceAccount
- Validate expiration date is in the future
- Improve integration test error messages (include response body)
- Add credential verification step in service account tests

Addresses remaining review comments from gemini-code-assist[bot] across multiple files.

* iam: fix shared slice reference in service account creation

Copy parent's actions to create an independent slice for the service
account instead of sharing the underlying array. This prevents
unexpected mutations when the parent's actions are modified later.

Addresses review comment from coderabbitai[bot] in PR #7901.

* iam: remove duplicate unused constant

Removed redundant iamServiceAccountKeyPrefix as ServiceAccountKeyPrefix
is already defined and used.

Addresses remaining cleanup task.

* sts: document limitation of string-based error mapping

Added TODO comment explaining that the current string-based error
mapping approach is fragile and should be replaced with typed errors
from the STS service in a future refactoring.

This addresses the architectural concern raised in code review while
deferring the actual implementation to a separate PR to avoid scope
creep in the current service account feature addition.

* iam: fix remaining review issues

- Add future-date validation for expiration in UpdateServiceAccount
- Reorder tests so credential verification happens before deletion
- Fix compilation error by using correct JWT generation methods

Addresses final review comments from coderabbitai[bot].

* iam: fix service account access key length

The access key IDs were incorrectly generated with 24 characters
instead of the AWS-standard 20 characters. This was caused by
generating 20 random characters and then prepending the 4-character
ABIA prefix.

Fixed by subtracting the prefix length from AccessKeyLength, so the
final key is: ABIA (4 chars) + random (16 chars) = 20 chars total.

This ensures compatibility with S3 clients that validate key length.

* test: add comprehensive service account security tests

Added comprehensive integration tests for service account functionality:

- TestServiceAccountS3Access: Verify SA credentials work for S3 operations
- TestServiceAccountExpiration: Test expiration date validation and enforcement
- TestServiceAccountInheritedPermissions: Verify parent-child relationship
- TestServiceAccountAccessKeyFormat: Validate AWS-compatible key format (ABIA prefix, 20 char length)

These tests ensure SeaweedFS service accounts are compatible with AWS
conventions and provide robust security coverage.

* iam: remove unused UserAccessKeyPrefix constant

Code cleanup to remove unused constants.

* iam: remove unused iamCommonResponse type alias

Code cleanup to remove unused type aliases.

* iam: restore and use UserAccessKeyPrefix constant

Restored UserAccessKeyPrefix constant and updated s3api tests to use it
instead of hardcoded strings for better maintainability and consistency.

* test: improve error handling in service account security tests

Added explicit error checking for io.ReadAll and xml.Unmarshal in
TestServiceAccountExpiration to ensure failures are reported correctly and
cleanup is performed only when appropriate. Also added logging for failed
responses.

* test: use t.Cleanup for reliable resource cleanup

Replaced defer with t.Cleanup to ensure service account cleanup runs even
when require.NoError fails. Also switched from manual error checking to
require.NoError for more idiomatic testify usage.

* iam: add CreatedBy field and optimize identity lookups

- Added createdBy parameter to CreateServiceAccount to track who created each service account
- Extract creator identity from request context using GetIdentityNameFromContext
- Populate created_by field in ServiceAccount protobuf
- Added findIdentityByName helper function to optimize identity lookups
- Replaced nested loops with O(n) helper function calls in CreateServiceAccount and DeleteServiceAccount

This addresses code review feedback for better auditing and performance.

* iam: prevent user deletion when service accounts exist

Following AWS IAM behavior, prevent deletion of users that have active
service accounts. This ensures explicit cleanup and prevents orphaned
service account resources with invalid ParentUser references.

Users must delete all associated service accounts before deleting the
parent user, providing safer resource management.

* sts: enhance TODO with typed error implementation guidance

Updated TODO comment with detailed implementation approach for replacing
string-based error matching with typed errors using errors.Is(). This
provides a clear roadmap for a follow-up PR to improve error handling
robustness and maintainability.

* iam: add operational limits for service account creation

Added AWS IAM-compatible safeguards to prevent resource exhaustion:
- Maximum 100 service accounts per user (LimitExceededException)
- Maximum 1000 character description length (InvalidInputException)

These limits prevent accidental or malicious resource exhaustion while
not impacting legitimate use cases.

* iam: add missing operational limit constants

Added MaxServiceAccountsPerUser and MaxDescriptionLength constants that
were referenced in the previous commit but not defined.

* iam: enforce service account expiration during authentication

CRITICAL SECURITY FIX: Expired service account credentials were not being
rejected during authentication, allowing continued access after expiration.

Changes:
- Added Expiration field to Credential struct
- Populate expiration when loading service accounts from configuration
- Check expiration in all authentication paths (V2 and V4 signatures)
- Return ErrExpiredToken for expired credentials

This ensures expired service accounts are properly rejected at authentication
time, matching AWS IAM behavior and preventing unauthorized access.

* iam: fix error code for expired service account credentials

Use ErrAccessDenied instead of non-existent ErrExpiredToken for expired
service account credentials. This provides appropriate access denial for
expired credentials while maintaining AWS-compatible error responses.

* iam: fix remaining ErrExpiredToken references

Replace all remaining instances of non-existent ErrExpiredToken with
ErrAccessDenied for expired service account credentials.

* iam: apply AWS-standard key format to user access keys

Updated CreateAccessKey to generate AWS-standard 20-character access keys
with AKIA prefix for regular users, matching the format used for service
accounts. This ensures consistency across all access key types and full
AWS compatibility.

- Access keys: AKIA + 16 random chars = 20 total (was 21 chars, no prefix)
- Secret keys: 40 random chars (was 42, now matches AWS standard)
- Uses AccessKeyLength and UserAccessKeyPrefix constants

* sts: replace fragile string-based error matching with typed errors

Implemented robust error handling using typed errors and errors.Is() instead
of fragile strings.Contains() matching. This decouples the HTTP layer from
service implementation details and prevents errors from being miscategorized
if error messages change.

Changes:
- Added typed error variables to weed/iam/sts/constants.go:
  * ErrTypedTokenExpired
  * ErrTypedInvalidToken
  * ErrTypedInvalidIssuer
  * ErrTypedInvalidAudience
  * ErrTypedMissingClaims

- Updated STS service to wrap provider authentication errors with typed errors
- Replaced strings.Contains() with errors.Is() in HTTP layer for error checking
- Removed TODO comment as the improvement is now implemented

This makes error handling more maintainable and reliable.

* sts: eliminate all string-based error matching with provider-level typed errors

Completed the typed error implementation by adding provider-level typed errors
and updating provider implementations to return them. This eliminates ALL
fragile string matching throughout the entire error handling stack.

Changes:
- Added typed error definitions to weed/iam/providers/errors.go:
  * ErrProviderTokenExpired
  * ErrProviderInvalidToken
  * ErrProviderInvalidIssuer
  * ErrProviderInvalidAudience
  * ErrProviderMissingClaims

- Updated OIDC provider to wrap JWT validation errors with typed provider errors
- Replaced strings.Contains() with errors.Is() in STS service for error mapping
- Complete error chain: Provider -> STS -> HTTP layer, all using errors.Is()

This provides:
- Reliable error classification independent of error message content
- Type-safe error checking throughout the stack
- No order-dependent string matching
- Maintainable error handling that won't break with message changes

* oidc: use jwt.ErrTokenExpired instead of string matching

Replaced the last remaining string-based error check with the JWT library's
exported typed error. This makes the error detection independent of error
message content and more robust against library updates.

Changed from:
  strings.Contains(errMsg, "expired")
To:
  errors.Is(err, jwt.ErrTokenExpired)

This completes the elimination of ALL string-based error matching throughout
the entire authentication stack.

* iam: add description length validation to UpdateServiceAccount

Fixed inconsistency where UpdateServiceAccount didn't validate description
length against MaxDescriptionLength, allowing operational limits to be
bypassed during updates.

Now validates that updated descriptions don't exceed 1000 characters,
matching the validation in CreateServiceAccount.

* iam: refactor expiration check into helper method

Extracted duplicated credential expiration check logic into a helper method
to reduce code duplication and improve maintainability.

Added Credential.isCredentialExpired() method and replaced 5 instances of
inline expiration checks across auth_signature_v2.go and auth_signature_v4.go.

* iam: address critical Copilot security and consistency feedback

Fixed three critical issues identified by Copilot code review:

1. SECURITY: Prevent loading disabled service account credentials
   - Added check to skip disabled service accounts during credential loading
   - Disabled accounts can no longer authenticate

2. Add DurationSeconds validation for STS AssumeRoleWithWebIdentity
   - Enforce AWS-compatible range: 900-43200 seconds (15 min - 12 hours)
   - Returns proper error for out-of-range values

3. Fix expiration update consistency in UpdateServiceAccount
   - Added key existence check like Description field
   - Allows explicit clearing of expiration by setting to empty string
   - Distinguishes between "not updating" and "clearing expiration"

* sts: remove unused durationSecondsStr variable

Fixed build error from unused variable after refactoring duration parsing.

* iam: address remaining Copilot feedback and remove dead code

Completed remaining Copilot code review items:

1. Remove unused getPermission() method (dead code)
   - Method was defined but never called anywhere

2. Improve slice modification safety in DeleteServiceAccount
   - Replaced append-with-slice-operations with filter pattern
   - Avoids potential issues from mutating slice during iteration

3. Fix route registration order
   - Moved STS route registration BEFORE IAM route
   - Prevents IAM route from intercepting STS requests
   - More specific route (with query parameter) now registered first

* iam: improve expiration validation and test cleanup robustness

Addressed additional Copilot feedback:

1. Make expiration validation more explicit
   - Added explicit check for negative values
   - Added comment clarifying that 0 is allowed to clear expiration
   - Improves code readability and intent

2. Fix test cleanup order in s3_service_account_test.go
   - Track created service accounts in a slice
   - Delete all service accounts before deleting parent user
   - Prevents DeleteConflictException during cleanup
   - More robust cleanup even if test fails mid-execution

Note: s3_service_account_security_test.go already had correct cleanup
order due to LIFO defer execution.

* test: remove redundant variable assignments

Removed duplicate assignments of createdSAId, createdAccessKeyId, and
createdSecretAccessKey on lines 148-150 that were already assigned on
lines 132-134.
2025-12-29 20:17:23 -08:00
Chris Lu
56ebd9236e Fix: relax gRPC server keepalive enforcement to 20s (#7898)
* Fix: relax gRPC server keepalive enforcement to 20s to prevent 'too_many_pings' errors

* Refactor: introduce GrpcKeepAliveMinimumTime constant for clarity
2025-12-28 12:03:21 -08:00
Chris Lu
95716a2f87 less verbose 2025-12-26 12:55:19 -08:00
Chris Lu
5aa111708d grpc: reduce client idle pings to avoid ENHANCE_YOUR_CALM (#7885)
* grpc: reduce client idle pings to avoid ENHANCE_YOUR_CALM (too_many_pings)

* test: use context.WithTimeout and pb constants for keepalive

* test(kafka): use separate dial and client contexts in NewDirectBrokerClient

* test(kafka): fix client context usage in NewDirectBrokerClient
2025-12-26 10:58:18 -08:00
Chris Lu
2567be8040 refactor: remove unused gRPC connection age parameters (#7852)
The GrpcMaxConnectionAge and GrpcMaxConnectionAgeGrace constants have a troubled
history - they were removed in 2022 due to gRPC issues, reverted later, and
recently re-added. However, they are not essential to the core worker reconnection
fix which was solved through proper goroutine ordering.

The Docker Swarm DNS handling mentioned in the comments is not critical, and
these parameters have caused problems in the past. Removing them simplifies
the configuration without losing functionality.
2025-12-22 18:25:21 -08:00