seaweedFS

Author	SHA1	Message	Date
Chris Lu	75a6a34528	dlm: resilient distributed locks via consistent hashing + backup replication (#8860 ) * dlm: replace modulo hashing with consistent hash ring Introduce HashRing with virtual nodes (CRC32-based consistent hashing) to replace the modulo-based hashKeyToServer. When a filer node is removed, only keys that hashed to that node are remapped to the next server on the ring, leaving all other mappings stable. This is the foundation for backup replication — the successor on the ring is always the natural takeover node. * dlm: add Generation and IsBackup fields to Lock Lock now carries IsBackup (whether this node holds the lock as a backup replica) and Generation (a monotonic fencing token that increments on each fresh acquisition, stays the same on renewal). Add helper methods: AllLocks, PromoteLock, DemoteLock, InsertBackupLock, RemoveLock, GetLock. * dlm: add ReplicateLock RPC and generation/is_backup proto fields Add generation field to LockResponse for fencing tokens. Add generation and is_backup fields to Lock message. Add ReplicateLock RPC for primary-to-backup lock replication. Add ReplicateLockRequest/ReplicateLockResponse messages. * dlm: add async backup replication to DistributedLockManager Route lock/unlock via consistent hash ring's GetPrimaryAndBackup(). After a successful lock or unlock on the primary, asynchronously replicate the operation to the backup server via ReplicateFunc callback. Single-server deployments skip replication. * dlm: add ReplicateLock handler and backup-aware topology changes Add ReplicateLock gRPC handler for primary-to-backup replication. Revise OnDlmChangeSnapshot to handle three cases on topology change: - Promote backup locks when this node becomes primary - Demote primary locks when this node becomes backup - Transfer locks when this node is neither primary nor backup Wire up SetupDlmReplication during filer server initialization. * dlm: expose generation fencing token in lock client LiveLock now captures the generation from LockResponse and exposes it via Generation() method. Consumers can use this as a fencing token to detect stale lock holders. * dlm: update empty folder cleaner to use consistent hash ring Replace local modulo-based hashKeyToServer with LockRing.GetPrimary() which uses the shared consistent hash ring for folder ownership. * dlm: add unit tests for consistent hash ring Test basic operations, consistency on server removal (only keys from removed server move), backup-is-successor property (backup becomes new primary when primary is removed), and key distribution balance. * dlm: add integration tests for lock replication failure scenarios Test cases: - Primary crash with backup promotion (backup has valid token) - Backup crash with primary continuing - Both primary and backup crash (lock lost, re-acquirable) - Rolling restart across all nodes - Generation fencing token increments on new acquisition - Replication failure (primary still works independently) - Unlock replicates deletion to backup - Lock survives server addition (topology change) - Consistent hashing minimal disruption (only removed server's keys move) * dlm: address PR review findings 1. Causal replication ordering: Add per-lock sequence number (Seq) that increments on every mutation. Backup rejects incoming mutations with seq <= current seq, preventing stale async replications from overwriting newer state. Unlock replication also carries seq and is rejected if stale. 2. Demote-after-handoff: OnDlmChangeSnapshot now transfers the lock to the new primary first and only demotes to backup after a successful TransferLocks RPC. If the transfer fails, the lock stays as primary on this node. 3. SetSnapshot candidateServers leak: Replace the candidateServers map entirely instead of appending, so removed servers don't linger. 4. TransferLocks preserves Generation and Seq: InsertLock now accepts generation and seq parameters. After accepting a transferred lock, the receiving node re-replicates to its backup. 5. Rolling restart test: Add re-replication step after promotion and assert survivedCount > 0. Add TestDLM_StaleReplicationRejected. 6. Mixed-version upgrade note: Add comment on HashRing documenting that all filer nodes must be upgraded together. * dlm: serve renewals locally during transfer window on node join When a new node joins and steals hash ranges from surviving nodes, there's a window between ring update and lock transfer where the client gets redirected to a node that doesn't have the lock yet. Fix: if the ring says primary != self but we still hold the lock locally (non-backup, matching token), serve the renewal/unlock here rather than redirecting. The lock will be transferred by OnDlmChangeSnapshot, and subsequent requests will go to the new primary once the transfer completes. Add tests: - TestDLM_NodeDropAndJoin_OwnershipDisruption: measures disruption when a node drops and a new one joins (14/100 surviving-node locks disrupted, all handled by transfer logic) - TestDLM_RenewalDuringTransferWindow: verifies renewal succeeds on old primary during the transfer window * dlm: master-managed lock ring with stabilization batching The master now owns the lock ring membership. Instead of filers independently reacting to individual ClusterNodeUpdate add/remove events, the master: 1. Tracks filer membership in LockRingManager 2. Batches rapid changes with a 1-second stabilization timer (e.g., a node drop + join within 1 second → single ring update) 3. Broadcasts the complete ring snapshot atomically via the new LockRingUpdate message in KeepConnectedResponse Filers receive the ring as a complete snapshot and apply it via SetSnapshot, ensuring all filers converge to the same ring state without intermediate churn. This eliminates the double-churn problem where a rapid drop+join would fire two separate ring mutations, each triggering lock transfers and disrupting ownership on surviving nodes. * dlm: track ring version, reject stale updates, remove dead code SetSnapshot now takes a version parameter from the master. Stale updates (version < current) are rejected, preventing reordered messages from overwriting a newer ring state. Version 0 is always accepted for bootstrap. Remove AddServer/RemoveServer from LockRing — the ring is now exclusively managed by the master via SetSnapshot. Remove the candidateServers map that was only used by those methods. * dlm: fix SelectLocks data race, advance generation on backup insert - SelectLocks: change RLock to Lock since the function deletes map entries, which is a write operation and causes a data race under RLock. - InsertBackupLock: advance nextGeneration to at least the incoming generation so that after failover promotion, new lock acquisitions get a generation strictly greater than any replicated lock. - Bump replication failure log from V(1) to Warningf for production visibility. * dlm: fix SetSnapshot race, test reliability, timer edge cases - SetSnapshot: hold LockRing lock through both version update and Ring.SetServers() so they're atomic. Prevents a concurrent caller from seeing the new version but applying stale servers. - Transfer window test: search for a key that actually moves primary when filer4 joins, instead of relying on a fixed key that may not. - renewLock redirect: pass the existing token to the new primary instead of empty string, so redirected renewals work correctly. - scheduleBroadcast: check timer.Stop() return value. If the timer already fired, the callback picks up latest state. - FlushPending: only broadcast if timer.Stop() returns true (timer was still pending). If false, the callback is already running. - Fix test comment: "idempotent" → "accepted, state-changing". * dlm: use wall-clock nanoseconds for lock ring version The lock ring version was an in-memory counter that reset to 0 on master restart. A filer that had seen version 5 would reject version 1 from the restarted master. Fix: use time.Now().UnixNano() as the version. This survives master restarts without persistence — the restarted master produces a version greater than any pre-restart value. * dlm: treat expired lock owners as missing Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * dlm: reject stale lock transfers Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * dlm: order replication by generation Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * dlm: bootstrap lock ring on reconnect Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-03-30 23:29:56 -07:00
Chris Lu	ced2236cc6	Adjust rename events metadata format (#8854 ) * rename metadata events * fix subscription filter to use NewEntry.Name for rename path matching The server-side subscription filter constructed the new path using OldEntry.Name instead of NewEntry.Name when checking if a rename event's destination matches the subscriber's path prefix. This could cause events to be incorrectly filtered when a rename changes the file name. * fix bucket events to handle rename of bucket directories onBucketEvents only checked IsCreate and IsDelete. A bucket directory rename via AtomicRenameEntry now emits a single rename event (both OldEntry and NewEntry non-nil), which matched neither check. Handle IsRename by deleting the old bucket and creating the new one. * fix replicator to handle rename events across directory boundaries Two issues fixed: 1. The replicator filtered events by checking if the key (old path) was under the source directory. Rename events now use the old path as key, so renames from outside into the watched directory were silently dropped. Now both old and new paths are checked, and cross-boundary renames are converted to create or delete. 2. NewParentPath was passed to the sink without remapping to the sink's target directory structure, causing the sink to write entries at the wrong location. Now NewParentPath is remapped alongside the key. * fix filer sync to handle rename events crossing directory boundaries The early directory-prefix filter only checked resp.Directory (old parent). Rename events now carry the old parent as Directory, so renames from outside the source path into it were dropped before reaching the existing cross-boundary handling logic. Check both old and new directories against sourcePath and excludePaths so the downstream old-key/new-key logic can properly convert these to create or delete operations. * fix metadata event path matching * fix metadata event consumers for rename targets * Fix replication rename target keys Logical rename events now reach replication sinks with distinct source and target paths.\n\nHandle non-filer sinks as delete-plus-create on the translated target key, and make the rename fallback path create at the translated target key too.\n\nAdd focused tests covering non-filer renames, filer rename updates, and the fallback path.\n\nCo-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Fix filer sync rename path scoping Use directory-boundary matching instead of raw prefix checks when classifying source and target paths during filer sync.\n\nAlso apply excludePaths per side so renames across excluded boundaries downgrade cleanly to create/delete instead of being misclassified as in-scope updates.\n\nAdd focused tests for boundary matching and rename classification.\n\nCo-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Fix replicator directory boundary checks Use directory-boundary matching instead of raw prefix checks when deciding whether a source or target path is inside the watched tree or an excluded subtree.\n\nThis prevents sibling paths such as /foo and /foobar from being misclassified during rename handling, and preserves the earlier rename-target-key fix.\n\nAdd focused tests for boundary matching and rename classification across sibling/excluded directories.\n\nCo-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Fix etc-remote rename-out handling Use boundary-safe source/target directory membership when classifying metadata events under DirectoryEtcRemote.\n\nThis prevents rename-out events from being processed as config updates, while still treating them as removals where appropriate for the remote sync and remote gateway command paths.\n\nAdd focused tests for update/removal classification and sibling-prefix handling.\n\nCo-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Defer rename events until commit Queue logical rename metadata events during atomic and streaming renames and publish them only after the transaction commits successfully.\n\nThis prevents subscribers from seeing delete or logical rename events for operations that later fail during delete or commit.\n\nAlso serialize notification.Queue swaps in rename tests and add failure-path coverage.\n\nCo-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Skip descendant rename target lookups Avoid redundant target lookups during recursive directory renames once the destination subtree is known absent.\n\nThe recursive move path now inserts known-absent descendants directly, and the test harness exercises prefixed directory listing so the optimization is covered by a directory rename regression test.\n\nCo-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Tighten rename review tests Return filer_pb.ErrNotFound from the bucket tracking store test stub so it follows the FilerStore contract, and add a webhook filter case for same-name renames across parent directories.\n\nCo-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * fix HardLinkId format verb in InsertEntryKnownAbsent error HardLinkId is a byte slice. %d prints each byte as a decimal number which is not useful for an identifier. Use %x to match the log line two lines above. * only skip descendant target lookup when source and dest use same store moveFolderSubEntries unconditionally passed skipTargetLookup=true for every descendant. This is safe when all paths resolve to the same underlying store, but with path-specific store configuration a child's destination may map to a different backend that already holds an entry at that path. Use FilerStoreWrapper.SameActualStore to check per-child and fall back to the full CreateEntry path when stores differ. * add nil and create edge-case tests for metadata event scope helpers * extract pathIsEqualOrUnder into util.IsEqualOrUnder Identical implementations existed in both replication/replicator.go and command/filer_sync.go. Move to util.IsEqualOrUnder (alongside the existing FullPath.IsUnder) and remove the duplicates. * use MetadataEventTargetDirectory for new-side directory in filer sync The new-side directory checks and sourceNewKey computation used message.NewParentPath directly. If NewParentPath were empty (legacy events, older filer versions during rolling upgrades), sourceNewKey would be wrong (/filename instead of /dir/filename) and the UpdateEntry parent path rewrite would panic on slice bounds. Derive targetDir once from MetadataEventTargetDirectory, which falls back to resp.Directory when NewParentPath is empty, and use it consistently for all new-side checks and the sink parent path.	2026-03-30 18:25:11 -07:00
Chris Lu	2eaf98a7a2	Use Unix sockets for gRPC in mini mode (#8856 ) * Use Unix sockets for gRPC between co-located services in mini mode In `weed mini`, all services run in one process. Previously, inter-service gRPC traffic (volume↔master, filer↔master, S3↔filer, worker↔admin, etc.) went through TCP loopback. This adds a gRPC Unix socket registry in the pb package: mini mode registers a socket path per gRPC port at startup, each gRPC server additionally listens on its socket, and GrpcDial transparently routes to the socket via WithContextDialer when a match is found. Standalone commands (weed master, weed filer, etc.) are unaffected since no sockets are registered. TCP listeners are kept for external clients. * Handle Serve error and clean up socket file in ServeGrpcOnLocalSocket Log non-expected errors from grpcServer.Serve (ignoring grpc.ErrServerStopped) and always remove the Unix socket file when Serve returns, ensuring cleanup on Stop/GracefulStop.	2026-03-30 18:18:52 -07:00
Chris Lu	c2c58419b8	filer.sync: send log file chunk fids to clients for direct volume server reads (#8792 ) * filer.sync: send log file chunk fids to clients for direct volume server reads Instead of the server reading persisted log files from volume servers, parsing entries, and streaming them over gRPC (serial bottleneck), clients that opt in via client_supports_metadata_chunks receive log file chunk references (fids) and read directly from volume servers in parallel. New proto messages: - LogFileChunkRef: chunk fids + timestamp + filer ID for one log file - SubscribeMetadataRequest.client_supports_metadata_chunks: client opt-in - SubscribeMetadataResponse.log_file_refs: server sends refs during backlog Server changes: - CollectLogFileRefs: lists log files and returns chunk refs without any volume server I/O (metadata-only operation) - SubscribeMetadata/SubscribeLocalMetadata: when client opts in, sends refs during persisted log phase, then falls back to normal streaming for in-memory events Client changes: - ReadLogFileRefs: reads log files from volume servers, parses entries, filters by path prefix, invokes processEventFn - MetadataFollowOption.LogFileReaderFn: factory for chunk readers, enables metadata chunks when non-nil - Both filer_pb_tail.go and meta_aggregator.go recv loops accumulate refs then process them at the disk→memory transition Backward compatible: old clients don't set the flag, get existing behavior. Ref: #8771 * filer.sync: merge entries across filers in timestamp order on client side ReadLogFileRefs now groups refs by filer ID and merges entries from multiple filers using a min-heap priority queue — the same algorithm the server uses in OrderedLogVisitor + LogEntryItemPriorityQueue. This ensures events are processed in correct timestamp order even when log files from different filers have interleaved timestamps. Single-filer case takes the fast path (no heap allocation). * filer.sync: integration tests for direct-read metadata chunks Three test categories: 1. Merge correctness (TestReadLogFileRefsMergeOrder): Verifies entries from 3 filers are delivered in strict timestamp order, matching the server-side OrderedLogVisitor guarantee. 2. Path filtering (TestReadLogFileRefsPathFilter): Verifies client-side path prefix filtering works correctly. 3. Throughput comparison (TestDirectReadVsServerSideThroughput): 3 filers × 7 files × 300 events = 6300 events, 2ms per file read: server-side: 6300 events 218ms 28,873 events/sec direct-read: 6300 events 51ms 123,566 events/sec (4.3x) parallel: 6300 events 17ms 378,628 events/sec (13.1x) Direct-read eliminates gRPC send overhead per event (4.3x). Parallel per-filer reading eliminates serial file I/O (13.1x). * filer.sync: parallel per-filer reads with prefetching in ReadLogFileRefs ReadLogFileRefs now has two levels of I/O overlap: 1. Cross-filer parallelism: one goroutine per filer reads its files concurrently. Entries feed into per-filer channels, merged by the main goroutine via min-heap (same ordering guarantee as the server's OrderedLogVisitor). 2. Within-filer prefetching: while the current file's entries are being consumed by the merge heap, the next file is already being read from the volume server in a background goroutine. Single-filer fast path avoids the heap and channels. Test results (3 filers × 7 files × 300 events, 2ms per file read): server-side sequential: 6300 events 212ms 29,760 events/sec parallel + prefetch: 6300 events 36ms 177,443 events/sec Speedup: 6.0x * filer.sync: address all review comments on metadata chunks PR Critical fixes: - sendLogFileRefs: bypass pipelinedSender, send directly on gRPC stream. Ref messages have TsNs=0 and were being incorrectly batched into the Events field by the adaptive batching logic, corrupting ref delivery. - readLogFileEntries: use io.ReadFull instead of reader.Read to prevent partial reads from corrupting size values or protobuf data. - Error handling: only skip chunk-not-found errors (matching server-side isChunkNotFoundError). Other I/O or decode failures are propagated so the follower can retry. High-priority fixes: - CollectLogFileRefs: remove incorrect +24h padding from stopTime. The extra day caused unnecessary log file refs to be collected. - Path filtering: ReadLogFileRefs now accepts PathFilter struct with PathPrefix, AdditionalPathPrefixes, and DirectoriesToWatch. Uses util.Join for path construction (avoids "//foo" on root). Excludes /.system/log/ internal entries. Matches server-side eachEventNotificationFn filtering logic. Medium-priority fixes: - CollectLogFileRefs: accept context.Context, propagate to ListDirectoryEntries calls for cancellation support. - NewChunkStreamReaderFromLookup: accept context.Context, propagate to doNewChunkStreamReader. Test fixes: - Check error returns from ReadLogFileRefs in all test call sites. --------- Co-authored-by: Copilot <copilot@github.com>	2026-03-27 11:01:29 -07:00
Chris Lu	d97660d0cd	filer.sync: pipelined subscription with adaptive batching for faster catch-up (#8791 ) * filer.sync: pipelined subscription with adaptive batching for faster catch-up The SubscribeMetadata pipeline was fully serial: reading a log entry from a volume server, unmarshaling, filtering, and calling stream.Send() all happened one-at-a-time. stream.Send() blocked the entire pipeline until the client acknowledged each event, limiting throughput to ~80 events/sec regardless of the -concurrency setting. Three server-side optimizations that stack: 1. Pipelined sender: decouple stream.Send() from the read loop via a buffered channel (1024 messages). A dedicated goroutine handles gRPC delivery while the reader continues processing the next events. 2. Adaptive batching: when event timestamps are >2min behind wall clock (backlog catch-up), drain multiple events from the channel and pack them into a single stream.Send() using a new `repeated events` field on SubscribeMetadataResponse. When events are recent (real-time), send one-by-one for low latency. Old clients ignore the new field (backward compatible). 3. Persisted log readahead: run the OrderedLogVisitor in a background goroutine so volume server I/O for the next log file overlaps with event processing and gRPC delivery. 4. Event-driven aggregated subscription: replace time.Sleep(1127ms) polling in SubscribeMetadata with notification-driven wake-up using the MetaLogBuffer subscriber mechanism, reducing real-time latency from ~1127ms to sub-millisecond. Combined, these create a 3-stage pipeline: [Volume I/O → readahead buffer] → [Filter → send buffer] → [gRPC Send] Test results (simulated backlog with 50µs gRPC latency per Send): direct (old): 2100 events 2100 sends 168ms 12,512 events/sec pipelined+batched: 2100 events 14 sends 40ms 52,856 events/sec Speedup: 4.2x single-stream throughput Ref: #8771 * filer.sync: require client opt-in for batch event delivery Add ClientSupportsBatching field to SubscribeMetadataRequest. The server only packs events into the Events batch field when the client explicitly sets this flag to true. Old clients (Java SDK, third-party) that don't set the flag get one-event-per-Send, preserving backward compatibility. All Go callers (FollowMetadata, MetaAggregator) set the flag to true since their recv loops already unpack batched events. * filer.sync: clear batch Events field after Send to release references Prevents the envelope message from holding references to the rest of the batch after gRPC serialization, allowing the GC to collect them sooner. * filer.sync: fix Send deadlock, add error propagation test, event-driven local subscribe - pipelinedSender.Send: add case <-s.done to unblock when sender goroutine exits (fixes deadlock when errCh was already consumed by a prior Send). - pipelinedSender.reportErr: remove for-range drain on sendCh that could block indefinitely. Send() now detects exit via s.done instead. - SubscribeLocalMetadata: replace remaining time.Sleep(1127ms) in the gap-detected-no-memory-data path with event-driven listenersCond.Wait(), consistent with the rest of the subscription paths. - Add TestPipelinedSenderErrorPropagation: verifies error surfaces via Send and Close when the underlying stream fails. - Replace goto with labeled break in test simulatePipeline. * filer.sync: check error returns in test code - direct_send: check slowStream.Send error return - pipelined_batched_send: check sender.Close error return - simulatePipeline: return error from sender.Close, propagate to callers --------- Co-authored-by: Copilot <copilot@github.com>	2026-03-26 23:55:42 -07:00
Chris Lu	ba624f1f34	Rust volume server implementation with CI (#8539 ) * Match Go gRPC client transport defaults * Honor Go HTTP idle timeout * Honor maintenanceMBps during volume copy * Honor images.fix.orientation on uploads * Honor cpuprofile when pprof is disabled * Match Go memory status payloads * Propagate request IDs across gRPC calls * Format pending Rust source updates * Match Go stats endpoint payloads * Serve Go volume server UI assets * Enforce Go HTTP whitelist guards * Align Rust metrics admin-port test with Go behavior * Format pending Rust server updates * Honor access.ui without per-request JWT checks * Honor keepLocalDatFile in tier upload shortcut * Honor Go remote volume write mode * Load tier backends from master config * Check master config before loading volumes * Remove vif files on volume destroy * Delete remote tier data on volume destroy * Honor vif version defaults and overrides * Reject mismatched vif bytes offsets * Load remote-only tiered volumes * Report Go tail offsets in sync status * Stream remote dat in incremental copy * Honor collection vif for EC shard config * Persist EC expireAtSec in vif metadata * Stream remote volume reads through HTTP * Serve HTTP ranges from backend source * Match Go ReadAllNeedles scan order * Match Go CopyFile zero-stop metadata * Delete EC volumes with collection cleanup * Drop deleted collection metrics * Match Go tombstone ReadNeedleMeta * Match Go TTL parsing: all-digit default to minutes, two-pass fit algorithm * Match Go needle ID/cookie formatting and name size computation * Match Go image ext checks: webp resize only, no crop; empty healthz body * Match Go Prometheus metric names and add missing handler counter constants * Match Go ReplicaPlacement short string parsing with zero-padding * Add missing EC constants MAX_SHARD_COUNT and MIN_TOTAL_DISKS * Add walk_ecx_stats for accurate EC volume file counts and size * Match Go VolumeStatus dat file size, EC shard stats, and disk pct precision * Match Go needle map: unconditional delete counter, fix redb idx walk offset * Add CompactMapSegment overflow panic guard matching Go * Match Go volume: vif creation, version from superblock, TTL expiry, dedup data_size, garbage_level fallback * Match Go 304 Not Modified: return bare status with no headers * Match Go JWT error message: use "wrong jwt" instead of detailed error * Match Go read handler bare 400, delete error prefix, download throttle timeout * Match Go pretty JSON 1-space indent and "Deletion Failed:" error prefix * Match Go heartbeat: keep is_heartbeating on error, add EC shard identification * Match Go needle ReadBytes V2: tolerate EOF on truncated body * Match Go volume: cookie check on any existing needle, return DataSize, 128KB meta guard * Match Go DeleteCollection: propagate destroy errors * Match Go gRPC: BatchDelete no flag, IncrementalCopy error, FetchAndWrite concurrent, VolumeUnmount/DeleteCollection errors, tail draining, query error code * Match Go Content-Disposition RFC 6266 formatting with RFC 2231 encoding * Match Go Guard isWriteActive: combine whitelist and signing key check * Match Go DeleteCollectionMetrics: use partial label matching * Match Go heartbeat: send state-only delta on volume state changes * Match Go ReadNeedleMeta paged I/O: read header+tail only, skip data; add EIO tracking * Match Go ScrubVolume INDEX mode dispatch; add VolumeCopy preallocation and EC NeedleStatus TODOs * Add read_ec_shard_needle for full needle reconstruction from local EC shards * Make heartbeat master config helpers pub for VolumeCopy preallocation * Match Go gRPC: VolumeCopy preallocation, EC NeedleStatus full read, error message wording * Match Go HTTP responses: omitempty fields, 2-space JSON indent, JWT JSON error, delete pretty/JSONP, 304 Last-Modified, raw write error * Match Go WriteNeedleBlob V3 timestamp patching, fix makeup_diff double padding, count==0 read handling * Add rebuild_ecx_file for EC index reconstruction from data shards * Match Go gRPC: tail header first-chunk-only, EC cleanup on failure, copy append mode, ecx rebuild, compact cancellation * Add EC volume read and delete support in HTTP handlers * Add per-shard EC mount/unmount, location predicate search, idx directory for EC * Add CheckVolumeDataIntegrity on volume load matching Go * Match Go gRPC: EC multi-disk placement, per-shard mount/unmount, no auto-mount on reconstruct, streaming ReadAll/EcShardRead, ReceiveFile cleanup, version check, proxy streaming, redirect Content-Type * Match Go heartbeat metric accounting * Match Go duplicate UUID heartbeat retries * Delete expired EC volumes during heartbeat * Match Go volume heartbeat pruning * Honor master preallocate in volume max * Report remote storage info in heartbeats * Emit EC heartbeat deltas on shard changes * Match Go throttle boundary: use <= instead of <, fix pretty JSON to 1-space * Match Go write_needle_blob monotonic appendAtNs via get_append_at_ns * Match Go VolumeUnmount: idempotent success when volume not found * Match Go TTL Display: return empty string when unit is Empty Go checks `t.Unit == Empty` separately and returns "" for TTLs with nonzero count but Empty unit. Rust only checked is_empty() (count==0 && unit==0), so count>0 with unit=0 would format as "5 " instead of "". * Match Go error behavior for truncated needle data in read_body_v2 Go's readNeedleDataVersion2 returns "index out of range %d" errors (indices 1-7) when needle body or metadata fields are truncated. Rust was silently tolerating truncation and returning Ok. Now returns NeedleError::IndexOutOfRange with the matching index for each field. * Match Go download throttle: return JSON error instead of plain text * Match Go crop params: default x1/y1 to 0 when not provided * Match Go ScrubEcVolume: accumulate total_files from EC shards * Match Go ScrubVolume: count total_files even on scrub error * Match Go VolumeEcShardsCopy: set ignore_source_file_not_found for .vif * Match Go VolumeTailSender: send needle_header on every chunk * Match Go read_super_block: apply replication override from .vif * Match Go check_volume_data_integrity: verify all 10 entries, detect trailing corruption * Match Go WriteNeedleBlob: dedup check before writing during replication * handlers: use meta-only reads for HEAD * handlers: align range parsing and responses with Go * handlers: align upload parsing with Go * deps: enable webp support * Make 5bytes the default feature for idx entry compatibility * Match Go TTL: preserve original unit when count fits in byte * Fix EC locate_needle: use get_actual_size for full needle size * Fix raw body POST: only parse multipart when Content-Type contains form-data * Match Go ReceiveFile: return protocol errors in response body, not gRPC status * add docs * Match Go VolumeEcShardsCopy: append to .ecj file instead of truncating * Match Go ParsePath: support _delta suffix on file IDs for sub-file addressing * Match Go chunk manifest: add Accept-Ranges, Content-Disposition, filename fallback, MIME detection * Match Go privateStoreHandler: use proper JSON error for unsupported methods * Match Go Destroy: add only_empty parameter to reject non-empty volume deletion * Fix compilation: set_read_only_persist and set_writable return () These methods fire-and-forget save_vif internally, so gRPC callers should not try to chain .map_err() on the unit return type. * Match Go SaveVolumeInfo: check writability and propagate errors in save_vif * Match Go VolumeDelete: propagate only_empty to delete_volume for defense in depth The gRPC VolumeDelete handler had a pre-check for only_empty but then passed false to store.delete_volume(), bypassing the store-level check. Go passes req.OnlyEmpty directly to DeleteVolume. Now Rust does the same for defense in depth against TOCTOU races (though the store write lock makes this unlikely). * Match Go ProcessRangeRequest: return full content for empty/oversized ranges Go returns nil from ProcessRangeRequest when ranges are empty or total range size exceeds content length, causing the caller to serve the full content as a normal 200 response. Rust was returning an empty 200 body. * Match Go Query: quote JSON keys in output records Go's ToJson produces valid JSON with quoted keys like {"name":"Alice"}. Rust was producing invalid JSON with unquoted keys like {name:"Alice"}. * Match Go VolumeCopy: reject when no suitable disk location exists Go returns ErrVolumeNoSpaceLeft when no location matches the disk type and has sufficient space. Rust had an unsafe fallback that silently picked the first location regardless of type or available space. * Match Go DeleteVolumeNeedle: check noWriteOrDelete before allowing delete Go checks v.noWriteOrDelete before proceeding with needle deletion, returning "volume is read only" if true. Rust was skipping this check. * Match Go ReceiveFile: prefer HardDrive location for EC and use response-level write errors Two fixes: (1) Go prefers HardDriveType disk location for EC volumes, falling back to first location. Returns "no storage location available" when no locations exist. (2) Write failures are now response-level errors (in response body) instead of gRPC status errors, matching Go. * Match Go CopyFile: sync EC volume journal to disk before copying Go calls ecVolume.Sync() before copying EC volume files to ensure the .ecj journal is flushed to disk. Added sync_to_disk() to EcVolume and call it in the CopyFile EC branch. * Match Go readSuperBlock: propagate replication parse errors Go returns an error when parsing the replication string from the .vif file fails. Rust was silently ignoring the parse failure and using the super block's replication as-is. * Match Go TTL expiry: remove append_at_ns > 0 guard Go computes TTL expiry from AppendAtNs without guarding against zero. When append_at_ns is 0, the expiry is epoch + TTL which is in the past, correctly returning NotFound. Rust's extra guard skipped the check, incorrectly returning success for such needles. * Match Go delete_collection: skip volumes with compaction in progress Go checks !v.isCompactionInProgress.Load() before destroying a volume during collection deletion, skipping compacting volumes. Also changed destroy errors to log instead of aborting the entire collection delete. * Match Go MarkReadonly/MarkWritable: always notify master even on local error Go always notifies the master regardless of whether the local set_read_only_persist or set_writable step fails. The Rust code was using `?` which short-circuited on error, skipping the final master notification. Save the result and defer the `?` until after the notify call. * Match Go PostHandler: return 500 for all write errors Go returns 500 (InternalServerError) for all write failures. Rust was returning 404 for volume-not-found and 403 for read-only volumes. * Match Go makeupDiff: validate .cpd compaction revision is old + 1 Go reads the new .cpd file's super block and verifies the compaction revision is exactly old + 1. Rust only validated the old revision. * Match Go VolumeStatus: check data backend before returning status Go checks v.DataBackend != nil before building the status response, returning an error if missing. Rust was silently returning size 0. * Match Go PostHandler: always include mime field in upload response JSON Go always serializes the mime field even when empty ("mime":""). Rust was omitting it when empty due to Option<String> with skip_serializing_if. * Match Go FindFreeLocation: account for EC shards in free slot calculation Go subtracts EC shard equivalents when computing available volume slots. Rust was only comparing volume count, potentially over-counting free slots on locations with many EC shards. * Match Go privateStoreHandler: use INVALID as metrics label for unsupported methods Go records the method as INVALID in metrics for unsupported HTTP methods. Rust was using the actual method name. * Match Go volume: add commit_compact guard and scrub data size validation Two fixes: (1) commit_compact now checks/sets is_compacting flag to prevent concurrent commits, matching Go's CompareAndSwap guard. (2) scrub now validates total needle sizes against .dat file size. * Match Go gRPC: fix TailSender error propagation, EcShardsInfo all slots, EcShardRead .ecx check Three fixes: (1) VolumeTailSender now propagates binary search errors instead of silently falling back to start. (2) VolumeEcShardsInfo returns entries for all shard slots including unmounted. (3) VolumeEcShardRead checks .ecx index for deletions instead of .ecj. * Match Go metrics: add BuildInfo gauge and connection tracking functions Go exposes a BuildInfo Prometheus metric with version labels, and tracks open connections via stats.ConnectionOpen/Close. Added both to Rust. * Match Go NeedleMap.Delete: use !is_deleted() instead of is_valid() Go's CompactMap.Delete checks !IsDeleted() not IsValid(), so needles with size==0 (live but anomalous) can still be deleted. The Rust code was using is_valid() which returns false for size==0, preventing deletion of such needles. * Match Go fitTtlCount: always normalize TTL to coarsest unit Go's fitTtlCount always converts to seconds first, then finds the coarsest unit that fits in one byte (e.g., 120m → 2h). Rust had an early return for count<=255 that skipped normalization, producing different binary encodings for the same duration. * Match Go BuildInfo metric: correct name and add missing labels Go uses SeaweedFS_build_info (Namespace=SeaweedFS, Subsystem=build, Name=info) with labels [version, commit, sizelimit, goos, goarch]. Rust had SeaweedFS_volumeServer_buildInfo with only [version]. * Match Go HTTP handlers: fix UploadResult fields, DiskStatus JSON, chunk manifest ETag - UploadResult.mime: add skip_serializing_if to omit empty MIME (Go uses omitempty) - UploadResult.contentMd5: only include when request provided Content-MD5 header - Content-MD5 response header: only set when request provided it - DiskStatuses: use camelCase field names (percentFree, percentUsed, diskType) to match Go's protobuf JSON marshaling - Chunk manifest: preserve needle ETag in expanded response headers * Match Go volume: fix version(), integrity check, scrub, and commit_compact - version(): use self.version() instead of self.super_block.version in read_all_needles, check_volume_data_integrity, scan_raw_needles_from to respect volumeInfo.version override - check_volume_data_integrity: initialize healthy_index_size to idx_size (matching Go) and continue on EOF instead of returning error - scrub(): count deleted needles in total_read since they still occupy space in the .dat file (matches Go's totalRead += actualSize for deleted) - commit_compact: clean up .cpd/.cpx files on makeup_diff failure (matches Go's error path cleanup) * Match Go write queue: add 4MB batch byte limit Go's startWorker breaks the batch at either 128 requests or 4MB of accumulated write data. Rust only had the 128-request limit, allowing large writes to accumulate unbounded latency. * Add TTL normalization tests for Go parity verification Test that fit_ttl_count normalizes 120m→2h, 24h→1d, 7d→1w even when count fits in a byte, matching Go's fitTtlCount behavior. * Match Go FindFreeLocation: account for EC shards in free slot calculation Go's free volume count subtracts both regular volumes and EC volumes from max_volume_count. Rust was only counting regular volumes, which could over-report available slots when EC shards are mounted. * Match Go EC volume: mark deletions in .ecx and replay .ecj at startup Go's DeleteNeedleFromEcx marks needles as deleted in the .ecx index in-place (writing TOMBSTONE_FILE_SIZE at the size field) in addition to appending to the .ecj journal. Go's RebuildEcxFile replays .ecj entries into .ecx on startup, then removes the .ecj file. Rust was only appending to .ecj without marking .ecx, which meant deleted EC needles remained readable via .ecx binary search. This fix: - Opens .ecx in read/write mode (was read-only) - Adds mark_needle_deleted_in_ecx: binary search + in-place write - Calls it from journal_delete before appending to .ecj - Adds rebuild_ecx_from_journal: replays .ecj into .ecx on startup * Match Go check_all_ec_shards_deleted: use MAX_SHARD_COUNT instead of hardcoded 14 Go's TotalShardsCount is DataShardsCount + ParityShardsCount = 14 by default, but custom EC configs via .vif can have more shards (up to MaxShardCount = 32). Using MAX_SHARD_COUNT ensures all shard files are checked regardless of EC configuration. * Match Go EC locate: subtract 1 from shard size and use datFileSize override Go's LocateEcShardNeedleInterval passes shard.ecdFileSize-1 to LocateData (shards are padded, -1 avoids overcounting large block rows). When datFileSize is known, Go uses datFileSize/DataShards instead. Rust was passing the raw shard file size without adjustment. * Fix TTL parsing and DiskStatus field names to match Go exactly TTL::read: Go's ReadTTL preserves the original unit (7d stays 7d, not 1w) and errors on count > 255. The previous normalization change was incorrect — Go only normalizes internally via fitTtlCount, not during string parsing. DiskStatus: Go uses encoding/json on protobuf structs, which reads the json struct tags (snake_case: percent_free, percent_used, disk_type), not the protobuf JSON names (camelCase). Revert to snake_case to match Go's actual output. * Fix heartbeat: check leader != current master before redirect, process duplicated UUIDs first Match Go's volume_grpc_client_to_master.go behavior: 1. Only trigger leader redirect when the leader address differs from the current master (prevents unnecessary reconnect loops when master confirms its own address). 2. Process duplicated_uuids before leader redirect check, matching Go's ordering where duplicate UUID detection takes priority. * Remove SetState version check to match Go behavior Go's SetState unconditionally applies the state without any version mismatch check. The Rust version had an extra optimistic concurrency check that would reject valid requests from Go clients that don't track versions. * Fix TTL::read() to normalize via fit_ttl_count matching Go's ReadTTL Go's ReadTTL calls fitTtlCount which converts to seconds and normalizes to the coarsest unit that fits in a byte count (e.g. 120m->2h, 7d->1w, 24h->1d). The Rust version was preserving the original unit, producing different binary encodings on disk and in heartbeat messages. * Always return Content-MD5 header and JSON field on successful writes Go always sets Content-MD5 in the response regardless of whether the request included it. The Rust version was conditionally including it only when the request provided Content-MD5. * Include name and size in UploadResult JSON even when empty/zero Go's encoding/json always includes empty strings and zero values in the upload response. The Rust version was using skip_serializing_if to omit them, causing JSON structure differences. * Include deleted needles in scan_raw_needles_from to match Go Go's ScanVolumeFileFrom visits ALL needles including deleted ones. Skipping deleted entries during incremental copy would cause tombstones to not be propagated, making deleted files reappear on the receiving side. * Match Go NeedleMap.Delete: always write tombstone to idx file Go's NeedleMap.Delete unconditionally writes a tombstone entry to the idx file and updates metrics, even if the needle doesn't exist or is already deleted. This is important for replication where every delete operation must produce an idx write. The Rust version was skipping the tombstone write for non-existent or already-deleted needles. * Limit MIME type to 255 bytes matching Go's CreateNeedleFromRequest * Title-case Seaweed-* pair keys to match Go HTTP header canonicalization * Unify DiskType::Hdd into HardDrive to match Go's single HardDriveType * Skip tombstone entries in walk_ecx_stats total_size matching Go's Raw() * Return EMPTY TTL when computed seconds is zero matching Go's fitTtlCount * Include disk-space-low in Volume.is_read_only() matching Go * Log error on CIDR parse failure in whitelist matching Go's glog.Errorf * Log cookie mismatch in gRPC Query matching Go's V(0).Infof * Fix is_expired volume_size comparison to use < matching Go Go checks `volumeSize < super_block.SuperBlockSize` (strict less-than), but Rust used `<=`. This meant Rust would fail to expire a volume that is exactly SUPER_BLOCK_SIZE bytes. * Apply Go's JWT expiry defaults: 10s write, 60s read Go calls v.SetDefault("jwt.signing.expires_after_seconds", 10) and v.SetDefault("jwt.signing.read.expires_after_seconds", 60). Rust defaulted to 0 for both, which meant tokens would never expire when security.toml has a signing key but omits expires_after_seconds. * Stop [grpc.volume].ca from overriding [grpc].ca matching Go Go reads the gRPC CA file only from config.GetString("grpc.ca"), i.e. the [grpc] section. The [grpc.volume] section only provides cert and key. Rust was also reading ca from [grpc.volume] which would silently override the [grpc].ca value when both were present. * Fix free_volume_count to use EC shard count matching Go Was counting EC volumes instead of EC shards, which underestimates EC space usage. One EC volume with 14 shards uses ~1.4 volume slots, not 1. Now uses Go's formula: ((max - volumes) * DataShardsCount - ecShardCount) / DataShardsCount. * Include preallocate in compaction space check matching Go Go uses max(preallocate, estimatedCompactSize) for the free space check. Rust was only using the estimated volume size, which could start a compaction that fails mid-way if preallocate exceeds the volume size. * Check gzip magic bytes before setting Content-Encoding matching Go Go checks both Accept-Encoding contains "gzip" AND IsGzippedContent (data starts with 0x1f 0x8b) before setting Content-Encoding: gzip. Rust only checked Accept-Encoding, which could incorrectly declare gzip encoding for non-gzip compressed data. * Only set upload response name when needle HasName matching Go Go checks reqNeedle.HasName() before setting ret.Name. Rust always set the name from the filename variable, which could return the fid portion of the path as the name for raw PUT requests without a filename. * Treat MaxVolumeCount==0 as unlimited matching Go's hasFreeDiskLocation Go's hasFreeDiskLocation returns true immediately when MaxVolumeCount is 0, treating it as unlimited. Rust was computing effective_free as <= 0 for max==0, rejecting the location. This could fail volume creation during early startup before the first heartbeat adjusts max. * Read lastAppendAtNs from deleted V3 entries in integrity check Go's doCheckAndFixVolumeData reads AppendAtNs from both live entries (verifyNeedleIntegrity) and deleted tombstones (verifyDeletedNeedleIntegrity). Rust was skipping deleted entries, which could result in a stale last_append_at_ns if the last index entry is a deletion. * Return empty body for empty/oversized range requests matching Go Go's ProcessRangeRequest returns nil (empty body, 200 OK) when parsed ranges are empty or combined range size exceeds total content size. The Rust buffered path incorrectly returned the full file data for both cases. The streaming path already handled this correctly. * Dispatch ScrubEcVolume by mode matching Go's INDEX/LOCAL/FULL Go's ScrubEcVolume switches on mode: INDEX calls v.ScrubIndex() (ecx integrity only), LOCAL calls v.ScrubLocal(), FULL calls vs.store.ScrubEcVolume(). Rust was ignoring the mode and always running verify_ec_shards. Now INDEX mode checks ecx index integrity (sorted overlap detection + file size validation) without shard I/O, while LOCAL/FULL modes run the existing shard verification. * Fix TTL test expectation: 7d normalizes to 1w matching Go's fitTtlCount Go's ReadTTL calls fitTtlCount which normalizes to the coarsest unit that fits: 7 days = 1 week, so "7d" becomes {Count:1, Unit:Week} which displays as "1w". Both Go and Rust normalize identically. * Add version mismatch check to SetState matching Go's State.Update Go's State.Update compares the incoming version with the stored version and returns "version mismatch" error if they differ. This provides optimistic concurrency control. The Rust implementation was accepting any version unconditionally. * Use unquoted keys in Query JSON output matching Go's json.ToJson Go's json.ToJson produces records with unquoted keys like {score:12} not {"score":12}. This is a custom format used internally by SeaweedFS for query results. * Fix TTL test expectation in VolumeNeedleStatus: 7d normalizes to 1w Same normalization as the HTTP test: Go's ReadTTL calls fitTtlCount which converts 7 days to 1 week. * Include ETag header in 304 Not Modified responses matching Go behavior Go sets ETag on the response writer (via SetEtag) before the If-Modified-Since and If-None-Match conditional checks, so both 304 response paths include the ETag header. The Rust implementation was only adding ETag to 200 responses. * Remove needle-name fallback in chunk manifest filename resolution Go's tryHandleChunkedFile only falls back from URL filename to manifest name. Rust had an extra fallback to needle.name that Go does not perform, which could produce different Content-Disposition filenames for chunk manifests. * Validate JWT nbf (Not Before) claim matching Go's jwt-go/v5 Go's jwt.ParseWithClaims validates the nbf claim when present, rejecting tokens whose nbf is in the future. The Rust jsonwebtoken crate defaults validate_nbf to false, so tokens with future nbf were incorrectly accepted. * Set isHeartbeating to true at startup matching Go's VolumeServer init Go unconditionally sets isHeartbeating: true in the VolumeServer struct literal. Rust was starting with false when masters are configured, causing /healthz to return 503 until the first heartbeat succeeds. * Call store.close() on shutdown matching Go's Shutdown() Go's Shutdown() calls vs.store.Close() which closes all volumes and flushes file handles. The Rust server was relying on process exit for cleanup, which could leave data unflushed. * Include server ID in maintenance mode error matching Go's format Go returns "volume server %s is in maintenance mode" with the store ID. Rust was returning a generic "maintenance mode" message. * Fix DiskType test: use HardDrive variant matching Go's HddType="" Go maps both "" and "hdd" to HardDriveType (empty string). The Rust enum variant is HardDrive, not Hdd. The test referenced a nonexistent Hdd variant causing compilation failure. * Do not include ETag in 304 responses matching Go's GetOrHeadHandler Go sets ETag at L235 AFTER the If-Modified-Since and If-None-Match 304 return paths, so Go's 304 responses do not include the ETag header. The Rust code was incorrectly including ETag in both 304 response paths. * Return 400 on malformed query strings in PostHandler matching Go's ParseForm Go's r.ParseForm() returns HTTP 400 with "form parse error: ..." when the query string is malformed. Rust was silently falling back to empty query params via unwrap_or_default(). * Load EC volume version from .vif matching Go's NewEcVolume Go sets ev.Version = needle.Version(volumeInfo.Version) from the .vif file. Rust was always using Version::current() (V3), which would produce wrong needle actual size calculations for volumes created with V1 or V2. * Sync .ecx file before close matching Go's EcVolume.Close Go calls ev.ecxFile.Sync() before closing to ensure in-place deletion marks are flushed to disk. Without this, deletion marks written via MarkNeedleDeleted could be lost on crash. * Validate SuperBlock extra data size matching Go's Bytes() guard Go checks extraSize > 256256-2 and calls glog.Fatalf to prevent corrupt super block headers. Rust was silently truncating via u16 cast, which would write an incorrect extra_size field. Update quinn-proto 0.11.13 -> 0.11.14 to fix GHSA-6xvm-j4wr-6v98 Fixes Dependency Review CI failure: quinn-proto < 0.11.14 is vulnerable to unauthenticated remote DoS via panic in QUIC transport parameter parsing. * Skip TestMultipartUploadUsesFormFieldsForTimestampAndTTL for Go server Go's r.FormValue() cannot read multipart text fields after r.MultipartReader() consumes the body, so ts/ttl sent as multipart form fields only work with the Rust volume server. Skip this test when VOLUME_SERVER_IMPL != "rust" to fix CI failure. * Flush .ecx in EC volume sync_to_disk matching Go's Sync() Go's EcVolume.Sync() flushes both the .ecj journal and the .ecx index to disk. The Rust version only flushed .ecj, leaving in-place deletion marks in .ecx unpersisted until close(). This could cause data inconsistency if the server crashes after marking a needle deleted in .ecx but before close(). * Remove .vif file in EC volume destroy matching Go's Destroy() Go's EcVolume.Destroy() removes .ecx, .ecj, and .vif files. The Rust version only removed .ecx and .ecj, leaving orphaned .vif files on disk after EC volume destruction (e.g., after TTL expiry). * Fix is_expired to use <= for SuperBlockSize check matching Go Go checks contentSize <= SuperBlockSize to detect empty volumes (no needles). Rust used < which would incorrectly allow a volume with exactly SuperBlockSize bytes (header only, no data) to proceed to the TTL expiry check and potentially be marked as expired. * Fix read_append_at_ns to read timestamps from tombstone entries Go reads the full needle body for all entries including tombstones (deleted needles with size=0) to extract the actual AppendAtNs timestamp. The Rust version returned 0 early for size <= 0 entries, which would cause the binary search in incremental copy to produce incorrect results for positions containing deleted needles. Now uses get_actual_size to compute the on-disk size (which handles tombstones correctly) and only returns 0 when the actual size is 0. * Add X-Request-Id response header matching Go's requestIDMiddleware Go sets both X-Request-Id and x-amz-request-id response headers. The Rust server only set x-amz-request-id, missing X-Request-Id. * Add skip_serializing_if for UploadResult name and size fields Go's UploadResult uses json:"name,omitempty" and json:"size,omitempty", omitting these fields from JSON when they are zero values (empty string / 0). The Rust struct always serialized them, producing "name":"" and "size":0 where Go would omit them. * Support JSONP/pretty-print for write success responses Go's writeJsonQuiet checks for callback (JSONP) and pretty query parameters on all JSON responses including write success. The Rust write success path used axum::Json directly, bypassing JSONP and pretty-print support. Now uses json_result_with_query to match Go. * Include actual limit in file size limit error message Go returns "file over the limited %d bytes" with the actual limit value included. Rust returned a generic "file size limit exceeded" without the limit value, making it harder to debug. * Extract extension from 2-segment URL paths for image operations Go's parseURLPath extracts the file extension from all URL formats including 2-segment paths like /vid,fid.jpg. The Rust version only handled 3-segment paths (/vid/fid/filename.ext), so extensions in 2-segment paths were lost. This caused image resize/crop operations requested via query params to be silently skipped for those paths. * Add size_hint to TrackedBody so throttled downloads get Content-Length TrackedBody (used for download throttling) did not implement size_hint(), causing HTTP/1.1 to fall back to chunked transfer encoding instead of setting Content-Length. Go always sets Content-Length explicitly for non-range responses. * Add Last-Modified, pairs, and S3 headers to chunk manifest responses Go sets Last-Modified, needle pairs, and S3 pass-through headers on the response writer BEFORE calling tryHandleChunkedFile. Since the Rust chunk manifest handler created fresh response headers and returned early, these headers were missing from chunk manifest responses. Now passes last_modified_str into the chunk manifest handler and applies pairs and S3 pass-through query params (response-cache-control, response-content-encoding, etc.) to the chunk manifest response headers. * Fix multipart fallback to use first part data when no filename Go reads the first part's data unconditionally, then looks for a part with a filename. If none found, Go uses the first part's data (with empty filename). Rust only captured parts with filenames, so when no part had a filename it fell back to the raw multipart body bytes (including boundary delimiters), producing corrupt needle data. * Set HasName and HasMime flags for empty values matching Go Go's CreateNeedleFromRequest sets HasName and HasMime flags even when the filename or MIME type is empty (len < 256 is true for len 0). Rust skipped empty values, causing the on-disk needle format to differ: Go-written needles include extra bytes for the empty name/mime size fields, changing the serialized needle size in the idx entry. This ensures binary format compatibility between Go and Rust servers. * Add is_stopping guard to vacuum_volume_commit matching Go Go's CommitCompactVolume (store_vacuum.go L53-54) checks s.isStopping before committing compaction to prevent file swaps during shutdown. The Rust handler was missing this check, which could allow compaction commits while the server is stopping. * Remove disk_type from required status fields since Go omits it Go's default DiskType is "" (HardDriveType), and protobuf's omitempty tag causes empty strings to be dropped from JSON output. * test: honor rust env in dual volume harness * grpc: notify master after volume lifecycle changes * http: proxy to replicas before download-limit timeout * test: pass readMode to rust volume harnesses * fix store free-location predicate selection * fix volume copy disk placement and heartbeat notification * fix chunk manifest delete replication * fix write replication to survive client disconnects * fix download limit proxy and wait flow * fix crop gating for streamed reads * fix upload limit wait counter behavior * fix chunk manifest image transforms * fix has_resize_ops to check width/height > 0 instead of is_some() Go's shouldResizeImages condition is `width > 0 \|\| height > 0`, so `?width=0` correctly evaluates to false. Rust was using `is_some()` which made `?width=0` evaluate to true, unnecessarily disabling streaming reads for those requests. * fix Content-MD5 to only compute and return when provided by client Go only computes the MD5 of uncompressed data when a Content-MD5 header or multipart field is provided. Rust was always computing and returning it. Also fix the mismatch error message to include size, matching Go's format. * fix save_vif to compute ExpireAtSec from TTL Go's SaveVolumeInfo always computes ExpireAtSec = now + ttlSeconds when the volume has a TTL. The save_vif path (used by set_read_only and set_writable) was missing this computation, causing .vif files to be written without the correct expiration timestamp for TTL volumes. * fix set_writable to not modify no_write_can_delete Go's MarkVolumeWritable only sets noWriteOrDelete=false and persists. Rust was additionally setting no_write_can_delete=has_remote_file, which could incorrectly change the write mode for remote-file volumes when the master explicitly asks to make the volume writable. * fix write_needle_blob_and_index to error on too-small V3 blob Go returns an error when the needle blob is too small for timestamp patching. Rust was silently skipping the patch and writing the blob with a stale/zero timestamp, which could cause data integrity issues during incremental replication that relies on AppendAtNs ordering. * fix VolumeEcShardsToVolume to validate dataShards range Go validates that dataShards is > 0 and <= MaxShardCount before proceeding with EC-to-volume reconstruction. Without this check, a zero or excessively large data_shards value could cause confusing downstream failures. * fix destroy to use VolumeError::NotEmpty instead of generic Io error The dedicated NotEmpty variant exists in the enum but was not being used. This makes error matching consistent with Go's ErrVolumeNotEmpty. * fix SetState to persist state to disk with rollback on failure Go's State.Update saves VolumeServerState to a state.pb file after each SetState call, and rolls back the in-memory state if persistence fails. Rust was only updating in-memory atomics, so maintenance mode would be lost on server restart. Now saves protobuf-encoded state.pb and loads it on startup. * fix VolumeTierMoveDatToRemote to close local dat backend after upload Go calls v.LoadRemoteFile() after saving volume info, which closes the local DataBackend before transitioning to remote storage. Without this, the volume holds a stale file handle to the deleted local .dat file, causing reads to fail until server restart. * fix VolumeTierMoveDatFromRemote to close remote dat backend after download Go calls v.DataBackend.Close() and sets DataBackend=nil after removing the remote file reference. Without this, the stale remote backend state lingers and reads may not discover the newly downloaded local .dat file until server restart. * fix redirect to use internal url instead of public_url Go's proxyReqToTargetServer builds the redirect Location header from loc.Url (the internal URL), not publicUrl. Using public_url could cause redirect failures when internal and external URLs differ. * fix redirect test and add state_file_path to integration test Update redirect unit test to expect internal url (matching the previous fix). Add missing state_file_path field to the integration test VolumeServerState constructor. * fix FetchAndWriteNeedle to await all writes before checking errors Go uses a WaitGroup to await all writes (local + replicas) before checking errors. Rust was short-circuiting on local write failure, which could leave replica writes in-flight without waiting for completion. * fix shutdown to send deregister heartbeat before pre_stop delay Go's StopHeartbeat() closes stopChan immediately on interrupt, causing the heartbeat goroutine to send the deregister heartbeat right away, before the preStopSeconds delay. Rust was only setting is_stopping=true without waking the heartbeat loop, so the deregister was delayed until after the pre_stop sleep. Now we call volume_state_notify.notify_one() to wake the heartbeat immediately. * fix heartbeat response ordering to check duplicate UUIDs first Go processes heartbeat responses in this order: DuplicatedUuids first, then volume options (prealloc/size limit), then leader redirect. Rust was applying volume options before checking for duplicate UUIDs, which meant volume option changes would take effect even when the response contained a duplicate UUID error that should cause an immediate return. * the test thread was blocked * fix(deps): update aws-lc-sys 0.38.0 → 0.39.0 to resolve security advisories Bumps aws-lc-rs 1.16.1 → 1.16.2, pulling in aws-lc-sys 0.39.0 which fixes GHSA-394x-vwmw-crm3 (X.509 Name Constraints wildcard/unicode bypass) and GHSA-9f94-5g5w-gf6r (CRL Distribution Point scope check logic error). * fix: match Go Content-MD5 mismatch error message format Go uses "Content-MD5 did not match md5 of file data expected [X] received [Y] size Z" while Rust had a shorter format. Match the exact Go error string so clients see identical messages. * fix: match Go Bearer token length check (> 7, not >= 7) Go requires len(bearer) > 7 ensuring at least one char after "Bearer ". Rust used >= 7 which would accept an empty token. * fix(deps): drop legacy rustls 0.21 to resolve rustls-webpki GHSA-pwjx-qhcg-rvj4 aws-sdk-s3's default "rustls" feature enables tls-rustls in aws-smithy-runtime, which pulls in legacy-rustls-ring (rustls 0.21 → rustls-webpki 0.101.7, moderate CRL advisory). Replace with explicit default-https-client which uses only rustls 0.23 / rustls-webpki 0.103.9. * fix: use uploaded filename for auto-compression extension detection Go extracts the file extension from pu.FileName (the uploaded filename) for auto-compression decisions. Rust was using the URL path, which typically has no extension for SeaweedFS file IDs. * fix: add CRC legacy Value() backward-compat check on needle read Go double-checks CRC: n.Checksum != crc && uint32(n.Checksum) != crc.Value(). The Value() path is a deprecated transform for compat with seaweed versions prior to commit `056c480eb`. Rust had the legacy_value() method but wasn't using it in validation. * fix: remove /stats/* endpoints to match Go (commented out since L130) Go's volume_server.go has the /stats/counter, /stats/memory, and /stats/disk endpoints commented out (lines 130-134). Remove them from the Rust router along with the now-unused whitelist_guard middleware. * fix: filter application/octet-stream MIME for chunk manifests Go's tryHandleChunkedFile (L334) filters out application/octet-stream from chunk manifest MIME types, falling back to extension-based detection. Rust was returning the stored MIME as-is for manifests. * fix: VolumeMarkWritable returns error before notifying master Go returns early at L200 if MarkVolumeWritable fails, before reaching the master notification at L206. Rust was notifying master even on failure, creating inconsistent state where master thinks the volume is writable but local marking failed. * fix: check volume existence before maintenance in MarkReadonly/Writable Go's VolumeMarkReadonly (L239-241) and VolumeMarkWritable (L253-255) look up the volume first, then call makeVolumeReadonly/Writable which checks maintenance. Rust was checking maintenance first, returning "maintenance mode" instead of "not found" for missing volumes. * feat: implement ScrubVolume mark_broken_volumes_readonly (PR #8360) Add the mark_broken_volumes_readonly flag from PR #8360: - Sync proto field (tag 3) to local volume_server.proto - After scrubbing, if flag is set, call makeVolumeReadonly on each broken volume (notify master, mark local readonly, notify again) - Collect errors via joined error semantics matching Go's errors.Join - Factor out make_volume_readonly helper reused by both VolumeMarkReadonly and ScrubVolume Also refactors VolumeMarkReadonly to use the shared helper. * fix(deps): update rustls-webpki 0.103.9 → 0.103.10 (GHSA-pwjx-qhcg-rvj4) CRL Distribution Point matching logic fix for moderate severity advisory about CRLs not considered authoritative. * test: update integration tests for removed /stats/* endpoints Replace tests that expected /stats/* routes to return 200/401 with tests confirming they now fall through to the store handler (400), matching Go's commented-out stats endpoints. * docs: fix misleading comment about default offset feature The comment said "4-byte offsets unless explicitly built with 5-byte support" but the default feature enables 5bytes. This is intentional for production parity with Go -tags 5BytesOffset builds. Fix the comment to match reality.	2026-03-26 17:24:35 -07:00
Lisandro Pin	e5cf2d2a19	Give the `ScrubVolume()` RPC an option to flag found broken volumes as read-only. (#8360 ) * Give the `ScrubVolume()` RPC an option to flag found broken volumes as read-only. Also exposes this option in the shell `volume.scrub` command. * Remove redundant test in `TestVolumeMarkReadonlyWritableErrorPaths`. 417051bb slightly rearranges the logic for `VolumeMarkReadonly()` and `VolumeMarkWritable()`, so calling them for invalid volume IDs will actually yield that error, instead of checking maintnenance mode first.	2026-03-26 10:20:57 -07:00
Chris Lu	94bfa2b340	mount: stream all filer mutations over single ordered gRPC stream (#8770 ) * filer: add StreamMutateEntry bidi streaming RPC Add a bidirectional streaming RPC that carries all filer mutation types (create, update, delete, rename) over a single ordered stream. This eliminates per-request connection overhead for pipelined operations and guarantees mutation ordering within a stream. The server handler delegates each request to the existing unary handlers (CreateEntry, UpdateEntry, DeleteEntry) and uses a proxy stream adapter for rename operations to reuse StreamRenameEntry logic. The is_last field signals completion for multi-response operations (rename sends multiple events per request; create/update/delete always send exactly one response with is_last=true). * mount: add streaming mutation multiplexer (streamMutateMux) Implement a client-side multiplexer that routes all filer mutation RPCs (create, update, delete, rename) over a single bidirectional gRPC stream. Multiple goroutines submit requests through a send channel; a dedicated sendLoop serializes them on the stream; a recvLoop dispatches responses to waiting callers via per-request channels. Key features: - Lazy stream opening on first use - Automatic reconnection on stream failure - Permanent fallback to unary RPCs if filer returns Unimplemented - Monotonic request_id for response correlation - Multi-response support for rename operations (is_last signaling) The mux is initialized on WFS and closed during unmount cleanup. No call sites use it yet — wiring comes in subsequent commits. * mount: route CreateEntry and UpdateEntry through streaming mux Wire all CreateEntry call sites to use wfs.streamCreateEntry() which routes through the StreamMutateEntry stream when available, falling back to unary RPCs otherwise. Also wire Link's UpdateEntry calls through wfs.streamUpdateEntry(). Updated call sites: - flushMetadataToFiler (file flush after write) - Mkdir (directory creation) - Symlink (symbolic link creation) - createRegularFile non-deferred path (Mknod) - flushFileMetadata (periodic metadata flush) - Link (hard link: update source + create link + rollback) * mount: route UpdateEntry and DeleteEntry through streaming mux Wire remaining mutation call sites through the streaming mux: - saveEntry (Setattr/chmod/chown/utimes) → streamUpdateEntry - Unlink → streamDeleteEntry (replaces RemoveWithResponse) - Rmdir → streamDeleteEntry (replaces RemoveWithResponse) All filer mutations except Rename now go through StreamMutateEntry when the filer supports it, with automatic unary RPC fallback. * mount: route Rename through streaming mux Wire Rename to use streamMutate.Rename() when available, with fallback to the existing StreamRenameEntry unary stream. The streaming mux sends rename as a StreamRenameEntryRequest oneof variant. The server processes it through the existing rename logic and sends multiple StreamRenameEntryResponse events (one per moved entry), with is_last=true on the final response. All filer mutations now go through a single ordered stream. * mount: fix stream mux connection ownership WithGrpcClient(streamingMode=true) closes the gRPC connection when the callback returns, destroying the stream. Own the connection directly via pb.GrpcDial so it stays alive for the stream's lifetime. Close it explicitly in recvLoop on stream failure and in Close on shutdown. * mount: fix rename failure for deferred-create files Three fixes for rename operations over the streaming mux: 1. lookupEntry: fall back to local metadata store when filer returns "not found" for entries in uncached directories. Files created with deferFilerCreate=true exist only in the local leveldb store until flushed; lookupEntry skipped the local store when the parent directory had never been readdir'd, causing rename to fail with ENOENT. 2. Rename: wait for pending async flushes and force synchronous flush of dirty metadata before sending rename to the filer. Covers the writebackCache case where close() defers the flush to a background worker that may not complete before rename fires. 3. StreamMutateEntry: propagate rename errors from server to client. Add error/errno fields to StreamMutateEntryResponse so the mount can map filer errors to correct FUSE status codes instead of silently returning OK. Also fix the existing Rename error handler which could return fuse.OK on unrecognized errors. * mount: fix streaming mux error handling, sendLoop lifecycle, and fallback Address PR review comments: 1. Server: populate top-level Error/Errno on StreamMutateEntryResponse for create/update/delete errors, not just rename. Previously update errors were silently dropped and create/delete errors were only in nested response fields that the client didn't check. 2. Client: check nested error fields in CreateEntry (ErrorCode, Error) and DeleteEntry (Error) responses, matching CreateEntryWithResponse behavior. 3. Fix sendLoop lifecycle: give each stream generation a stopSend channel. recvLoop closes it on error to stop the paired sendLoop. Previously a reconnect left the old sendLoop draining sendCh, breaking ordering. 4. Transparent fallback: stream helpers and doRename fall back to unary RPCs on transport errors (ErrStreamTransport), including the first Unimplemented from ensureStream. Previously the first call failed instead of degrading. 5. Filer rotation in openStream: try all filer addresses on dial failure, matching WithFilerClient behavior. Stop early on Unimplemented. 6. Pass metadata-bearing context to StreamMutateEntry RPC call so sw-client-id header is actually sent. 7. Gate lookupEntry local-cache fallback on open dirty handle or pending async flush to avoid resurrecting deleted/renamed entries. 8. Remove dead code in flushFileMetadata (err=nil followed by if err!=nil). 9. Use string matching for rename error-to-errno mapping in the mount to stay portable across Linux/macOS (numeric errno values differ). * mount: make failAllPending idempotent with delete-before-close Change failAllPending to collect pending entries into a local slice (deleting from the sync.Map first) before closing channels. This prevents double-close panics if called concurrently. Also remove the unused err parameter. * mount: add stream generation tracking and teardownStream Introduce a generation counter on streamMutateMux that increments each time a new stream is created. Requests carry the generation they were enqueued for so sendLoop can reject stale requests after reconnect. Add teardownStream(gen) which is idempotent (only acts when gen matches current generation and stream is non-nil). Both sendLoop and recvLoop call it on error, replacing the inline cleanup in recvLoop. sendLoop now actively triggers teardown on send errors instead of silently exiting. ensureStream waits for the prior generation's recvDone before creating a new stream, ensuring all old pending waiters are failed before reconnect. recvLoop now takes the stream, generation, and recvDone channel as parameters to avoid accessing shared fields without the lock. * mount: harden Close to prevent races with teardownStream Nil out stream, cancel, and grpcConn under the lock so that any concurrent teardownStream call from recvLoop/sendLoop becomes a no-op. Call failAllPending before closing sendCh to unblock waiters promptly. Guard recvDone with a nil check for the case where Close is called before any stream was ever opened. * mount: make errCh receive ctx-aware in doUnary and Rename Replace the blocking <-sendReq.errCh with a select that also observes ctx.Done(). If sendLoop exits via stopSend without consuming a buffered request, the caller now returns ctx.Err() instead of blocking forever. The buffered errCh (capacity 1) ensures late acknowledgements from sendLoop don't block the sender. * mount: fix sendLoop/Close race and recvLoop/teardown pending channel race Three related fixes: 1. Stop closing sendCh in Close(). Closing the shared producer channel races with callers who passed ensureStream() but haven't sent yet, causing send-on-closed-channel panics. sendCh is now left open; ensureStream checks m.closed to reject new callers. 2. Drain buffered sendCh items on shutdown. sendLoop defers drainSendCh() on exit so buffered requests get an ErrStreamTransport on their errCh instead of blocking forever. Close() drains again for any stragglers enqueued between sendLoop's drain and the final shutdown. 3. Move failAllPending from teardownStream into recvLoop's defer. teardownStream (called from sendLoop on send error) was closing pending response channels while recvLoop could be between pending.Load and the channel send — a send-on-closed-channel panic. recvLoop is now the sole closer of pending channels, eliminating the race. Close() waits on recvDone (with cancel() to guarantee Recv unblocks) so pending cleanup always completes. * filer/mount: add debug logging for hardlink lifecycle Add V(0) logging at every point where a HardLinkId is created, stored, read, or deleted to trace orphaned hardlink references. Logging covers: - gRPC server: CreateEntry/UpdateEntry when request carries HardLinkId - FilerStoreWrapper: InsertEntry/UpdateEntry when entry has HardLinkId - handleUpdateToHardLinks: entry path, HardLinkId, counter, chunk count - setHardLink: KvPut with blob size - maybeReadHardLink: V(1) on read attempt and successful decode - DeleteHardLink: counter decrement/deletion events - Mount Link(): when NewHardLinkId is generated and link is created This helps diagnose how a git pack .rev file ended up with a HardLinkId during a clone (no hard links should be involved). * test: add git clone/pull integration test for FUSE mount Shell script that exercises git operations on a SeaweedFS mount: 1. Creates a bare repo on the mount 2. Clones locally, makes 3 commits, pushes to mount 3. Clones from mount bare repo into an on-mount working dir 4. Verifies clone integrity (files, content, commit hashes) 5. Pushes 2 more commits with renames and deletes 6. Checks out an older revision on the mount clone 7. Returns to branch and pulls with real changes 8. Verifies file content, renames, deletes after pull 9. Checks git log integrity and clean status 27 assertions covering file existence, content, commit hashes, file counts, renames, deletes, and git status. Run against any existing mount: bash test-git-on-mount.sh /path/to/mount * test: add git clone/pull FUSE integration test to CI suite Add TestGitOperations to the existing fuse_integration test framework. The test exercises git's full file operation surface on the mount: 1. Creates a bare repo on the mount (acts as remote) 2. Clones locally, makes 3 commits (files, bulk data, renames), pushes 3. Clones from mount bare repo into an on-mount working dir 4. Verifies clone integrity (content, commit hash, file count) 5. Pushes 2 more commits with new files, renames, and deletes 6. Checks out an older revision on the mount clone 7. Returns to branch and pulls with real fast-forward changes 8. Verifies post-pull state: content, renames, deletes, file counts 9. Checks git log integrity (5 commits) and clean status Runs automatically in the existing fuse-integration.yml CI workflow. * mount: fix permission check with uid/gid mapping The permission checks in createRegularFile() and Access() compared the caller's local uid/gid against the entry's filer-side uid/gid without applying the uid/gid mapper. With -map.uid 501:0, a directory created as uid 0 on the filer would not match the local caller uid 501, causing hasAccess() to fall through to "other" permission bits and reject write access (0755 → other has r-x, no w). Fix: map entry uid/gid from filer-space to local-space before the hasAccess() call so both sides are in the same namespace. This fixes rsync -a failing with "Permission denied" on mkstempat when using uid/gid mapping. * mount: fix Mkdir/Symlink returning filer-side uid/gid to kernel Mkdir and Symlink used `defer wfs.mapPbIdFromFilerToLocal(entry)` to restore local uid/gid, but `outputPbEntry` writes the kernel response before the function returns — so the kernel received filer-side uid/gid (e.g., 0:0). macFUSE then caches these and rejects subsequent child operations (mkdir, create) because the caller uid (501) doesn't match the directory owner (0), and "other" bits (0755 → r-x) lack write permission. Fix: replace the defer with an explicit call to mapPbIdFromFilerToLocal before outputPbEntry, so the kernel gets local uid/gid. Also add nil guards for UidGidMapper in Access and createRegularFile to prevent panics in tests that don't configure a mapper. This fixes rsync -a "Permission denied" on mkpathat for nested directories when using uid/gid mapping. * mount: fix Link outputting filer-side uid/gid to kernel, add nil guards Link had the same defer-before-outputPbEntry bug as Mkdir and Symlink: the kernel received filer-side uid/gid because the defer hadn't run yet when outputPbEntry wrote the response. Also add nil guards for UidGidMapper in Access and createRegularFile so tests without a mapper don't panic. Audit of all outputPbEntry/outputFilerEntry call sites: - Mkdir: fixed in prior commit (explicit map before output) - Symlink: fixed in prior commit (explicit map before output) - Link: fixed here (explicit map before output) - Create (existing file): entry from maybeLoadEntry (already mapped) - Create (deferred): entry has local uid/gid (never mapped to filer) - Create (non-deferred): createRegularFile defer runs before return - Mknod: createRegularFile defer runs before return - Lookup: entry from lookupEntry (already mapped) - GetAttr: entry from maybeReadEntry/maybeLoadEntry (already mapped) - readdir: entry from cache (mapIdFromFilerToLocal) or filer (mapped) - saveEntry: no kernel output - flushMetadataToFiler: no kernel output - flushFileMetadata: no kernel output * test: fix git test for same-filesystem FUSE clone When both the bare repo and working clone live on the same FUSE mount, git's local transport uses hardlinks and cross-repo stat calls that fail on FUSE. Fix: - Use --no-local on clone to disable local transport optimizations - Use reset --hard instead of checkout to stay on branch - Use fetch + reset --hard origin/<branch> instead of git pull to avoid local transport stat failures during fetch * adjust logging * test: use plain git clone/pull to exercise real FUSE behavior Remove --no-local and fetch+reset workarounds. The test should use the same git commands users run (clone, reset --hard, pull) so it reveals real FUSE issues rather than hiding them. * test: enable V(1) logging for filer/mount and collect logs on failure - Run filer and mount with -v=1 so hardlink lifecycle logs (V(0): create/delete/insert, V(1): read attempts) are captured - On test failure, automatically dump last 16KB of all process logs (master, volume, filer, mount) to test output - Copy process logs to /tmp/seaweedfs-fuse-logs/ for CI artifact upload - Update CI workflow to upload SeaweedFS process logs alongside test output * mount: clone entry for filer flush to prevent uid/gid race flushMetadataToFiler and flushFileMetadata used entry.GetEntry() which returns the file handle's live proto entry pointer, then mutated it in-place via mapPbIdFromLocalToFiler. During the gRPC call window, a concurrent Lookup (which takes entryLock.RLock but NOT fhLockTable) could observe filer-side uid/gid (e.g., 0:0) on the file handle entry and return it to the kernel. The kernel caches these attributes, so subsequent opens by the local user (uid 501) fail with EACCES. Fix: proto.Clone the entry before mapping uid/gid for the filer request. The file handle's live entry is never mutated, so concurrent Lookup always sees local uid/gid. This fixes the intermittent "Permission denied" on .git/FETCH_HEAD after the first git pull on a mount with uid/gid mapping. * mount: add debug logging for stale lock file investigation Add V(0) logging to trace the HEAD.lock recreation issue: - Create: log when O_EXCL fails (file already exists) with uid/gid/mode - completeAsyncFlush: log resolved path, saved path, dirtyMetadata, isDeleted at entry to trace whether async flush fires after rename - flushMetadataToFiler: log the dir/name/fullpath being flushed This will show whether the async flush is recreating the lock file after git renames HEAD.lock → HEAD. * mount: prevent async flush from recreating renamed .lock files When git renames HEAD.lock → HEAD, the async flush from the prior close() can run AFTER the rename and re-insert HEAD.lock into the meta cache via its CreateEntryRequest response event. The next git pull then sees HEAD.lock and fails with "File exists". Fix: add isRenamed flag on FileHandle, set by Rename before waiting for the pending async flush. The async flush checks this flag and skips the metadata flush for renamed files (same pattern as isDeleted for unlinked files). The data pages still flush normally. The Rename handler flushes deferred metadata synchronously (Case 1) before setting isRenamed, ensuring the entry exists on the filer for the rename to proceed. For already-released handles (Case 2), the entry was created by a prior flush. * mount: also mark renamed inodes via entry.Attributes.Inode fallback When GetInode fails (Forget already removed the inode mapping), the Rename handler couldn't find the pending async flush to set isRenamed. The async flush then recreated the .lock file on the filer. Fix: fall back to oldEntry.Attributes.Inode to find the pending async flush when the inode-to-path mapping is gone. Also extract MarkInodeRenamed into a method on FileHandleToInode for clarity. * mount: skip async metadata flush when saved path no longer maps to inode The isRenamed flag approach failed for refs/remotes/origin/HEAD.lock because neither GetInode nor oldEntry.Attributes.Inode could find the inode (Forget already evicted the mapping, and the entry's stored inode was 0). Add a direct check in completeAsyncFlush: before flushing metadata, verify that the saved path still maps to this inode in the inode-to-path table. If the path was renamed or removed (inode mismatch or not found), skip the metadata flush to avoid recreating a stale entry. This catches all rename cases regardless of whether the Rename handler could set the isRenamed flag. * mount: wait for pending async flush in Unlink before filer delete Unlink was deleting the filer entry first, then marking the draining async-flush handle as deleted. The async flush worker could race between these two operations and recreate the just-unlinked entry on the filer. This caused git's .lock files (e.g. refs/remotes/origin/HEAD.lock) to persist after git pull, breaking subsequent git operations. Move the isDeleted marking and add waitForPendingAsyncFlush() before the filer delete so any in-flight flush completes first. Even if the worker raced past the isDeleted check, the wait ensures it finishes before the filer delete cleans up any recreated entry. * mount: reduce async flush and metadata flush log verbosity Raise completeAsyncFlush entry log, saved-path-mismatch skip log, and flushMetadataToFiler entry log from V(0) to V(3)/V(4). These fire for every file close with writebackCache and are too noisy for normal use. * filer: reduce hardlink debug log verbosity from V(0) to V(4) HardLinkId logs in filerstore_wrapper, filerstore_hardlink, and filer_grpc_server fire on every hardlinked file operation (git pack files use hardlinks extensively) and produce excessive noise. * mount/filer: reduce noisy V(0) logs for link, rmdir, and empty folder check - weedfs_link.go: hardlink creation logs V(0) → V(4) - weedfs_dir_mkrm.go: non-empty folder rmdir error V(0) → V(1) - empty_folder_cleaner.go: "not empty" check log V(0) → V(4) * filer: handle missing hardlink KV as expected, not error A "kv: not found" on hardlink read is normal when the link blob was already cleaned up but a stale entry still references it. Log at V(1) for not-found; keep Error level for actual KV failures. * test: add waitForDir before git pull in FUSE git operations test After git reset --hard, the FUSE mount's metadata cache may need a moment to settle on slow CI. The git pull subprocess (unpack-objects) could fail to stat the working directory. Poll for up to 5s. * Update git_operations_test.go * wait * test: simplify FUSE test framework to use weed mini Replace the 4-process setup (master + volume + filer + mount) with 2 processes: "weed mini" (all-in-one) + "weed mount". This simplifies startup, reduces port allocation, and is faster on CI. * test: fix mini flag -admin → -admin.ui	2026-03-25 20:06:34 -07:00
Chris Lu	0b3867dca3	filer: add structured error codes to CreateEntryResponse (#8767 ) * filer: add FilerError enum and error_code field to CreateEntryResponse Add a machine-readable error code alongside the existing string error field. This follows the precedent set by PublishMessageResponse in the MQ broker proto. The string field is kept for human readability and backward compatibility. Defined codes: OK, ENTRY_NAME_TOO_LONG, PARENT_IS_FILE, EXISTING_IS_DIRECTORY, EXISTING_IS_FILE, ENTRY_ALREADY_EXISTS. * filer: add sentinel errors and error code mapping in filer_pb Define sentinel errors (ErrEntryNameTooLong, ErrParentIsFile, etc.) in the filer_pb package so both the filer and consumers can reference them without circular imports. Add FilerErrorToSentinel() to map proto error codes to sentinels, and update CreateEntryWithResponse() to check error_code first, falling back to the string-based path for backward compatibility with old servers. * filer: return wrapped sentinel errors and set proto error codes Replace fmt.Errorf string errors in filer.CreateEntry, UpdateEntry, and ensureParentDirectoryEntry with wrapped filer_pb sentinel errors (using %w). This preserves errors.Is() traversal on the server side. In the gRPC CreateEntry handler, map sentinel errors to the corresponding FilerError proto codes using errors.Is(), setting both resp.Error (string, for backward compat) and resp.ErrorCode (enum). * S3: use errors.Is() with filer sentinels instead of string matching Replace fragile string-based error matching in filerErrorToS3Error and other S3 API consumers with errors.Is() checks against filer_pb sentinel errors. This works because the updated CreateEntryWithResponse helper reconstructs sentinel errors from the proto FilerError code. Update iceberg stage_create and metadata_files to check resp.ErrorCode instead of parsing resp.Error strings. Update SSE-S3 to use errors.Is() for the already-exists check. String matching is retained only for non-filer errors (gRPC transport errors, checksum validation) that don't go through CreateEntryResponse. * filer: remove backward-compat string fallbacks for error codes Clients and servers are always deployed together, so there is no need for backward-compatibility fallback paths that parse resp.Error strings when resp.ErrorCode is unset. Simplify all consumers to rely solely on the structured error code. * iceberg: ensure unknown non-OK error codes are not silently ignored When FilerErrorToSentinel returns nil for an unrecognized error code, return an error including the code and message rather than falling through to return nil. * filer: fix redundant error message and restore error wrapping in helper Use request path instead of resp.Error in the sentinel error format string to avoid duplicating the sentinel message (e.g. "entry already exists: entry already exists"). Restore %w wrapping with errors.New() in the fallback paths so callers can use errors.Is()/errors.As(). * filer: promote file to directory on path conflict instead of erroring S3 allows both "foo/bar" (object) and "foo/bar/xyzzy" (another object) to coexist because S3 has a flat key space. When ensureParentDirectoryEntry finds a parent path that is a file instead of a directory, promote it to a directory by setting ModeDir while preserving the original content and chunks. Use Store.UpdateEntry directly to bypass the Filer.UpdateEntry type-change guard. This fixes the S3 compatibility test failures where creating overlapping keys (e.g. "foo/bar" then "foo/bar/xyzzy") returned ExistingObjectIsFile.	2026-03-24 17:08:22 -07:00
Chris Lu	81369b8a83	improve: large file sync throughput for remote.cache and filer.sync (#8676 ) * improve large file sync throughput for remote.cache and filer.sync Three main throughput improvements: 1. Adaptive chunk sizing for remote.cache: targets ~32 chunks per file instead of always starting at 5MB. A 500MB file now uses ~16MB chunks (32 chunks) instead of 5MB chunks (100 chunks), reducing per-chunk overhead (volume assign, gRPC call, needle write) by 3x. 2. Configurable concurrency at every layer: - remote.cache chunk concurrency: -chunkConcurrency flag (default 8) - remote.cache S3 download concurrency: -downloadConcurrency flag (default raised from 1 to 5 per chunk) - filer.sync chunk concurrency: -chunkConcurrency flag (default 32) 3. S3 multipart download concurrency raised from 1 to 5: the S3 manager downloader was using Concurrency=1, serializing all part downloads within each chunk. This alone can 5x per-chunk download speed. The concurrency values flow through the gRPC request chain: shell command → CacheRemoteObjectToLocalClusterRequest → FetchAndWriteNeedleRequest → S3 downloader Zero values in the request mean "use server defaults", maintaining full backward compatibility with existing callers. Ref #8481 * fix: use full maxMB for chunk size cap and remove loop guard Address review feedback: - Use full maxMB instead of maxMB/2 for maxChunkSize to avoid unnecessarily limiting chunk size for very large files. - Remove chunkSize < maxChunkSize guard from the safety loop so it can always grow past maxChunkSize when needed to stay under 1000 chunks (e.g., extremely large files with small maxMB). * address review feedback: help text, validation, naming, docs - Fix help text for -chunkConcurrency and -downloadConcurrency flags to say "0 = server default" instead of advertising specific numeric defaults that could drift from the server implementation. - Validate chunkConcurrency and downloadConcurrency are within int32 range before narrowing, returning a user-facing error if out of range. - Rename ReadRemoteErr to readRemoteErr to follow Go naming conventions. - Add doc comment to SetChunkConcurrency noting it must be called during initialization before replication goroutines start. - Replace doubling loop in chunk size safety check with direct ceil(remoteSize/1000) computation to guarantee the 1000-chunk cap. * address Copilot review: clamp concurrency, fix chunk count, clarify proto docs - Use ceiling division for chunk count check to avoid overcounting when file size is an exact multiple of chunk size. - Clamp chunkConcurrency (max 1024) and downloadConcurrency (max 1024 at filer, max 64 at volume server) to prevent excessive goroutines. - Always use ReadFileWithConcurrency when the client supports it, falling back to the implementation's default when value is 0. - Clarify proto comments that download_concurrency only applies when the remote storage client supports it (currently S3). - Include specific server defaults in help text (e.g., "0 = server default 8") so users see the actual values in -h output. * fix data race on executionErr and use %w for error wrapping - Protect concurrent writes to executionErr in remote.cache worker goroutines with a sync.Mutex to eliminate the data race. - Use %w instead of %v in volume_grpc_remote.go error formatting to preserve the error chain for errors.Is/errors.As callers.	2026-03-17 16:49:56 -07:00
Chris Lu	acea36a181	filer: add conditional update preconditions (#8647 ) * filer: add conditional update preconditions * iceberg: tighten metadata CAS preconditions	2026-03-16 12:33:32 -07:00
Chris Lu	a838661b83	feat(plugin): EC shard balance handler for plugin worker (#8629 ) * feat(ec_balance): add TaskTypeECBalance constant and protobuf definitions Add the ec_balance task type constant to both topology and worker type systems. Define EcBalanceTaskParams, EcShardMoveSpec, and EcBalanceTaskConfig protobuf messages for EC shard balance operations. * feat(ec_balance): add configuration for EC shard balance task Config includes imbalance threshold, min server count, collection filter, disk type, and preferred tags for tag-aware placement. * feat(ec_balance): add multi-phase EC shard balance detection algorithm Implements four detection phases adapted from the ec.balance shell command: 1. Duplicate shard detection and removal proposals 2. Cross-rack shard distribution balancing 3. Within-rack node-level shard balancing 4. Global shard count equalization across nodes Detection is side-effect-free: it builds an EC topology view from ActiveTopology and generates move proposals without executing them. * feat(ec_balance): add EC shard move task execution Implements the shard move sequence using the same VolumeEcShardsCopy, VolumeEcShardsMount, VolumeEcShardsUnmount, and VolumeEcShardsDelete RPCs as the shell ec.balance command. Supports both regular shard moves and dedup-phase deletions (unmount+delete without copy). * feat(ec_balance): add task registration and scheduling Register EC balance task definition with auto-config update support. Scheduling respects max concurrent limits and worker capabilities. * feat(ec_balance): add plugin handler for EC shard balance Implements the full plugin handler with detection, execution, admin and worker config forms, proposal building, and decision trace reporting. Supports collection/DC/disk type filtering, preferred tag placement, and configurable detection intervals. Auto-registered via init() with the handler registry. * test(ec_balance): add tests for detection algorithm and plugin handler Detection tests cover: duplicate shard detection, cross-rack imbalance, within-rack imbalance, global rebalancing, topology building, collection filtering, and edge cases. Handler tests cover: config derivation with clamping, proposal building, protobuf encode/decode round-trip, fallback parameter decoding, capability, and config policy round-trip. * fix(ec_balance): address PR review feedback and fix CI test failure - Update TestWorkerDefaultJobTypes to expect 6 handlers (was 5) - Extract threshold constants (ecBalanceMinImbalanceThreshold, etc.) to eliminate magic numbers in Descriptor and config derivation - Remove duplicate ShardIdsToUint32 helper (use erasure_coding package) - Add bounds checks for int64→int/uint32 conversions to fix CodeQL integer conversion warnings * fix(ec_balance): address code review findings storage_impact.go: - Add TaskTypeECBalance case returning shard-level reservation (ShardSlots: -1/+1) instead of falling through to default which incorrectly reserves a full volume slot on target. detection.go: - Use dc:rack composite key to avoid cross-DC rack name collisions. Only create rack entries after confirming node has matching disks. - Add exceedsImbalanceThreshold check to cross-rack, within-rack, and global phases so trivial skews below the configured threshold are ignored. Dedup phase always runs since duplicates are errors. - Reserve destination capacity after each planned move (decrement destNode.freeSlots, update rackShardCount/nodeShardCount) to prevent overbooking the same destination. - Skip nodes with freeSlots <= 0 when selecting minNode in global balance to avoid proposing moves to full nodes. - Include loop index and source/target node IDs in TaskID to guarantee uniqueness across moves with the same volumeID/shardID. ec_balance_handler.go: - Fail fast with error when shard_id is absent in fallback parameter decoding instead of silently defaulting to shard 0. ec_balance_task.go: - Delegate GetProgress() to BaseTask.GetProgress() so progress updates from ReportProgressWithStage are visible to callers. - Add fail-fast guard rejecting multiple sources/targets until batch execution is implemented. Findings verified but not changed (matches existing codebase pattern in vacuum/balance/erasure_coding handlers): - register.go globalTaskDef.Config race: same unsynchronized pattern in all 4 task packages. - CreateTask using generated ID: same fmt.Sprintf pattern in all 4 task packages. * fix(ec_balance): harden parameter decoding, progress tracking, and validation ec_balance_handler.go (decodeECBalanceTaskParams): - Validate execution-critical fields (Sources[0].Node, ShardIds, Targets[0].Node, ShardIds) after protobuf deserialization. - Require source_disk_id and target_disk_id in legacy fallback path so Targets[0].DiskId is populated for VolumeEcShardsCopyRequest. - All error messages reference decodeECBalanceTaskParams and the specific missing field (TaskParams, shard_id, Targets[0].DiskId, EcBalanceTaskParams) for debuggability. ec_balance_task.go: - Track progress in ECBalanceTask.progress field, updated via reportProgress() helper called before ReportProgressWithStage(), so GetProgress() returns real stage progress instead of stale 0. - Validate: require exactly 1 source and 1 target (mirrors Execute guard), require ShardIds on both, with error messages referencing ECBalanceTask.Validate and the specific field. * fix(ec_balance): fix dedup execution path, stale topology, collection filter, timeout, and dedupeKey detection.go: - Dedup moves now set target=source so isDedupPhase() triggers the unmount+delete-only execution path instead of attempting a copy. - Apply moves to in-memory topology between phases via applyMovesToTopology() so subsequent phases see updated shard placement and don't conflict with already-planned moves. - detectGlobalImbalance now accepts allowedVids and filters both shard counting and shard selection to respect CollectionFilter. ec_balance_task.go: - Apply EcBalanceTaskParams.TimeoutSeconds to the context via context.WithTimeout so all RPC operations respect the configured timeout instead of hanging indefinitely. ec_balance_handler.go: - Include source node ID in dedupeKey so dedup deletions from different source nodes for the same shard aren't collapsed. - Clamp minServerCountRaw and minIntervalRaw lower bounds on int64 before narrowing to int, preventing undefined overflow on 32-bit. * fix(ec_balance): log warning before cancelling on progress send failure Log the error, job ID, job type, progress percentage, and stage before calling execCancel() in the progress callback so failed progress sends are diagnosable instead of silently cancelling.	2026-03-14 21:34:53 -07:00
Chris Lu	baae672b6f	feat: auto-disable master vacuum when plugin worker is active (#8624 ) * feat: auto-disable master vacuum when plugin vacuum worker is active When a vacuum-capable plugin worker connects to the admin server, the admin server calls DisableVacuum on the master to prevent the automatic scheduled vacuum from conflicting with the plugin worker's vacuum. When the worker disconnects, EnableVacuum is called to restore the default behavior. A safety net in the topology refresh loop re-enables vacuum if the admin server disconnects without cleanup. * rename isAdminServerConnected to isAdminServerConnectedFunc * add 5s timeout to DisableVacuum/EnableVacuum gRPC calls Prevents the monitor goroutine from blocking indefinitely if the master is unresponsive. * track plugin ownership of vacuum disable to avoid overriding operator - Add vacuumDisabledByPlugin flag to Topology, set when DisableVacuum is called while admin server is connected (i.e., by plugin monitor) - Safety net only re-enables vacuum when it was disabled by plugin, not when an operator intentionally disabled it via shell command - EnableVacuum clears the plugin flag * extract syncVacuumState for testability, add fake toggler tests Extract the single sync step into syncVacuumState() with a vacuumToggler interface. Add TestSyncVacuumState with a fake toggler that verifies disable/enable calls on state transitions. * use atomic.Bool for isDisableVacuum and vacuumDisabledByPlugin Both fields are written by gRPC handlers and read by the vacuum goroutine, causing a data race. Use atomic.Bool with Store/Load for thread-safe access. * use explicit by_plugin field instead of connection heuristic Add by_plugin bool to DisableVacuumRequest proto so the caller declares intent explicitly. The admin server monitor sets it to true; shell commands leave it false. This prevents an operator's intentional disable from being auto-reversed by the safety net. * use setter for admin server callback instead of function parameter Move isAdminServerConnected from StartRefreshWritableVolumes parameter to Topology.SetAdminServerConnectedFunc() setter. Keeps the function signature stable and decouples the topology layer from the admin server concept. * suppress repeated log messages on persistent sync failures Add retrying parameter to syncVacuumState so the initial state transition is logged at V(0) but subsequent retries of the same transition are silent until the call succeeds. * clear plugin ownership flag on manual DisableVacuum Prevents stale plugin flag from causing incorrect auto-enable when an operator manually disables vacuum after a plugin had previously disabled it. * add by_plugin to EnableVacuumRequest for symmetric ownership tracking Plugin-driven EnableVacuum now only re-enables if the plugin was the one that disabled it. If an operator manually disabled vacuum after the plugin, the plugin's EnableVacuum is a no-op. This prevents the plugin monitor from overriding operator intent on worker disconnect. * use cancellable context for monitorVacuumWorker goroutine Replace context.Background() with a cancellable context stored as bgCancel on AdminServer. Shutdown() calls bgCancel() so monitorVacuumWorker exits cleanly via ctx.Done(). * track operator and plugin vacuum disables independently Replace single isDisableVacuum flag with two independent flags: vacuumDisabledByOperator and vacuumDisabledByPlugin. Each caller only flips its own flag. The effective disabled state is the OR of both. This prevents a plugin connect/disconnect cycle from overriding an operator's manual disable, and vice versa. * fix safety net to clear plugin flag, not operator flag The safety net should call EnableVacuumByPlugin() to clear only the plugin disable flag when the admin server disconnects. The previous call to EnableVacuum() incorrectly cleared the operator flag instead.	2026-03-13 22:49:12 -07:00
Chris Lu	f3c5ba3cd6	feat(filer): add lazy directory listing for remote mounts (#8615 ) * feat(filer): add lazy directory listing for remote mounts Directory listings on remote mounts previously only queried the local filer store. With lazy mounts the listing was empty; with eager mounts it went stale over time. Add on-demand directory listing that fetches from remote and caches results with a 5-minute TTL: - Add `ListDirectory` to `RemoteStorageClient` interface (delimiter-based, single-level listing, separate from recursive `Traverse`) - Implement in S3, GCS, and Azure backends using each platform's hierarchical listing API - Add `maybeLazyListFromRemote` to filer: before each directory listing, check if the directory is under a remote mount with an expired cache, fetch from remote, persist entries to the local store, then let existing listing logic run on the populated store - Use singleflight to deduplicate concurrent requests for the same directory - Skip local-only entries (no RemoteEntry) to avoid overwriting unsynced uploads - Errors are logged and swallowed (availability over consistency) * refactor: extract xattr key to constant xattrRemoteListingSyncedAt * feat: make listing cache TTL configurable per mount via listing_cache_ttl_seconds Add listing_cache_ttl_seconds field to RemoteStorageLocation protobuf. When 0 (default), lazy directory listing is disabled for that mount. When >0, enables on-demand directory listing with the specified TTL. Expose as -listingCacheTTL flag on remote.mount command. * refactor: address review feedback for lazy directory listing - Add context.Context to ListDirectory interface and all implementations - Capture startTime before remote call for accurate TTL tracking - Simplify S3 ListDirectory using ListObjectsV2PagesWithContext - Make maybeLazyListFromRemote return void (errors always swallowed) - Remove redundant trailing-slash path manipulation in caller - Update tests to match new signatures * When an existing entry has Remote != nil, we should merge remote metadata into it rather than replacing it. * fix(gcs): wrap ListDirectory iterator error with context The raw iterator error was returned without bucket/path context, making it harder to debug. Wrap it consistently with the S3 pattern. * fix(s3): guard against nil pointer dereference in Traverse and ListDirectory Some S3-compatible backends may return nil for LastModified, Size, or ETag fields. Check for nil before dereferencing to prevent panics. * fix(filer): remove blanket 2-minute timeout from lazy listing context Individual SDK operations (S3, GCS, Azure) already have per-request timeouts and retry policies. The blanket timeout could cut off large directory listings mid-operation even though individual pages were succeeding. * fix(filer): preserve trace context in lazy listing with WithoutCancel Use context.WithoutCancel(ctx) instead of context.Background() so trace/span values from the incoming request are retained for distributed tracing, while still decoupling cancellation. * fix(filer): use Store.FindEntry for internal lookups, add Uid/Gid to files, fix updateDirectoryListingSyncedAt - Use f.Store.FindEntry instead of f.FindEntry for staleness check and child lookups to avoid unnecessary lazy-fetch overhead - Set OS_UID/OS_GID on new file entries for consistency with directories - In updateDirectoryListingSyncedAt, use Store.UpdateEntry for existing directories instead of CreateEntry to avoid deleteChunksIfNotNew and NotifyUpdateEvent side effects * fix(filer): distinguish not-found from store errors in lazy listing Previously, any error from Store.FindEntry was treated as "not found," which could cause entry recreation/overwrite on transient DB failures. Now check for filer_pb.ErrNotFound explicitly and skip entries or bail out on real store errors. * refactor(filer): use errors.Is for ErrNotFound comparisons	2026-03-13 09:36:54 -07:00
Chris Lu	5f85bf5e8a	Batch volume balance: run multiple moves per job (#8561 ) * proto: add BalanceMoveSpec and batch fields to BalanceTaskParams Add BalanceMoveSpec message for encoding individual volume moves, and max_concurrent_moves + repeated moves fields to BalanceTaskParams to support batching multiple volume moves in a single job. * balance handler: add batch execution with concurrent volume moves Refactor Execute() into executeSingleMove() (backward compatible) and executeBatchMoves() which runs multiple volume moves concurrently using a semaphore-bounded goroutine pool. When BalanceTaskParams.Moves is populated, the batch path is taken; otherwise the single-move path. Includes aggregate progress reporting across concurrent moves, per-move error collection, and partial failure support. * balance handler: add batch config fields to Descriptor and worker config Add max_concurrent_moves and batch_size fields to the worker config form and deriveBalanceWorkerConfig(). These control how many volume moves run concurrently within a batch job and the maximum batch size. * balance handler: group detection proposals into batch jobs When batch_size > 1, the Detect method groups detection results into batch proposals where each proposal encodes multiple BalanceMoveSpec entries in BalanceTaskParams.Moves. Single-result batches fall back to the existing single-move proposal format for backward compatibility. * admin UI: add volume balance execution plan and batch badge Add renderBalanceExecutionPlan() for rich rendering of volume balance jobs in the job detail modal. Single-move jobs show source/target/volume info; batch jobs show a moves table with all volume moves. Add batch badge (e.g., "5 moves") next to job type in the execution jobs table when the job has batch=true label. * Update plugin_templ.go * fix: detection algorithm uses greedy target instead of divergent topology scores The detection loop tracked effective volume counts via an adjustments map, but createBalanceTask independently called planBalanceDestination which used the topology's LoadCount — a separate, unadjusted source of truth. This divergence caused multiple moves to pile onto the same server. Changes: - Add resolveBalanceDestination to resolve the detection loop's greedy target (minServer) rather than independently picking a destination - Add oscillation guard: stop when max-min <= 1 since no single move can improve the balance beyond that point - Track unseeded destinations: if a target server wasn't in the initial serverVolumeCounts, add it so subsequent iterations include it - Add TestDetection_UnseededDestinationDoesNotOverload * fix: handler force_move propagation, partial failure, deterministic dedupe - Propagate ForceMove from outer BalanceTaskParams to individual move TaskParams so batch moves respect the force_move flag - Fix partial failure: mark job successful if at least one move succeeded (succeeded > 0 \|\| failed == 0) to avoid re-running already-completed moves on retry - Use SHA-256 hash for deterministic dedupe key fallback instead of time.Now().UnixNano() which is non-deterministic - Remove unused successDetails variable - Extract maxProposalStringLength constant to replace magic number 200 * admin UI: use template literals in balance execution plan rendering * fix: integration test handles batch proposals from batched detection With batch_size=20, all moves are grouped into a single proposal containing BalanceParams.Moves instead of top-level Sources/Targets. Update assertions to handle both batch and single-move proposal formats. * fix: verify volume size on target before deleting source during balance Add a pre-delete safety check that reads the volume file status on both source and target, then compares .dat file size and file count. If they don't match, the move is aborted — leaving the source intact rather than risking irreversible data loss. Also removes the redundant mountVolume call since VolumeCopy already mounts the volume on the target server. * fix: clamp maxConcurrent, serialize progress sends, validate config as int64 - Clamp maxConcurrentMoves to defaultMaxConcurrentMoves before creating the semaphore so a stale or malicious job cannot request unbounded concurrent volume moves - Extend progressMu to cover sender.SendProgress calls since the underlying gRPC stream is not safe for concurrent writes - Perform bounds checks on max_concurrent_moves and batch_size in int64 space before casting to int, avoiding potential overflow on 32-bit * fix: check disk capacity in resolveBalanceDestination Skip disks where VolumeCount >= MaxVolumeCount so the detection loop does not propose moves to a full disk that would fail at execution time. * test: rename unseeded destination test to match actual behavior The test exercises a server with 0 volumes that IS seeded from topology (matching disk type), not an unseeded destination. Rename to TestDetection_ZeroVolumeServerIncludedInBalance and fix comments. * test: tighten integration test to assert exactly one batch proposal With default batch_size=20, all moves should be grouped into a single batch proposal. Assert len(proposals)==1 and require BalanceParams with Moves, removing the legacy single-move else branch. * fix: propagate ctx to RPCs and restore source writability on abort - All helper methods (markVolumeReadonly, copyVolume, tailVolume, readVolumeFileStatus, deleteVolume) now accept a context parameter instead of using context.Background(), so Execute's ctx propagates cancellation and timeouts into every volume server RPC - Add deferred cleanup that restores the source volume to writable if any step after markVolumeReadonly fails, preventing the source from being left permanently readonly on abort - Add markVolumeWritable helper using VolumeMarkWritableRequest * fix: deep-copy protobuf messages in test recording sender Use proto.Clone in recordingExecutionSender to store immutable snapshots of JobProgressUpdate and JobCompleted, preventing assertions from observing mutations if the handler reuses message pointers. * fix: add VolumeMarkWritable and ReadVolumeFileStatus to fake volume server The balance task now calls ReadVolumeFileStatus for pre-delete verification and VolumeMarkWritable to restore writability on abort. Add both RPCs to the test fake, and drop the mountCalls assertion since BalanceTask no longer calls VolumeMount directly (VolumeCopy handles it). * fix: use maxConcurrentMovesLimit (50) for clamp, not defaultMaxConcurrentMoves defaultMaxConcurrentMoves (5) is the fallback when the field is unset, not an upper bound. Clamping to it silently overrides valid config values like 10/20/50. Introduce maxConcurrentMovesLimit (50) matching the descriptor's MaxValue and clamp to that instead. * fix: cancel batch moves on progress stream failure Derive a cancellable batchCtx from the caller's ctx. If sender.SendProgress returns an error (client disconnect, context cancelled), capture it, skip further sends, and cancel batchCtx so in-flight moves abort via their propagated context rather than running blind to completion. * fix: bound cleanup timeout and validate batch move fields - Use a 30-second timeout for the deferred markVolumeWritable cleanup instead of context.Background() which can block indefinitely if the volume server is unreachable - Validate required fields (VolumeID, SourceNode, TargetNode) before appending moves to a batch proposal, skipping invalid entries - Fall back to a single-move proposal when filtering leaves only one valid move in a batch * fix: cancel task execution on SendProgress stream failure All handler progress callbacks previously ignored SendProgress errors, allowing tasks to continue executing after the client disconnected. Now each handler creates a derived cancellable context and cancels it on the first SendProgress error, stopping the in-flight task promptly. Handlers fixed: erasure_coding, vacuum, volume_balance (single-move), and admin_script (breaks command loop on send failure). * fix: validate batch moves before scheduling in executeBatchMoves Reject empty batches, enforce a hard upper bound (100 moves), and filter out nil or incomplete move specs (missing source/target/volume) before allocating progress tracking and launching goroutines. * test: add batch balance execution integration test Tests the batch move path with 3 volumes, max concurrency 2, using fake volume servers. Verifies all moves complete with correct readonly, copy, tail, and delete RPC counts. * test: add MarkWritableCount and ReadFileStatusCount accessors Expose the markWritableCalls and readFileStatusCalls counters on the fake volume server, following the existing MarkReadonlyCount pattern. * fix: oscillation guard uses global effective counts for heterogeneous capacity The oscillation guard (max-min <= 1) previously used maxServer/minServer which are determined by utilization ratio. With heterogeneous capacity, maxServer by utilization can have fewer raw volumes than minServer, producing a negative diff and incorrectly triggering the guard. Now scans all servers' effective counts to find the true global max/min volume counts, so the guard works correctly regardless of whether utilization-based or raw-count balancing is used. * fix: admin script handler breaks outer loop on SendProgress failure The break on SendProgress error inside the shell.Commands scan only exited the inner loop, letting the outer command loop continue executing commands on a broken stream. Use a sendBroken flag to propagate the break to the outer execCommands loop.	2026-03-09 19:30:08 -07:00
Chris Lu	992db11d2b	iam: add IAM group management (#8560 ) * iam: add Group message to protobuf schema Add Group message (name, members, policy_names, disabled) and add groups field to S3ApiConfiguration for IAM group management support (issue #7742). * iam: add group CRUD to CredentialStore interface and all backends Add group management methods (CreateGroup, GetGroup, DeleteGroup, ListGroups, UpdateGroup) to the CredentialStore interface with implementations for memory, filer_etc, postgres, and grpc stores. Wire group loading/saving into filer_etc LoadConfiguration and SaveConfiguration. * iam: add group IAM response types Add XML response types for group management IAM actions: CreateGroup, DeleteGroup, GetGroup, ListGroups, AddUserToGroup, RemoveUserFromGroup, AttachGroupPolicy, DetachGroupPolicy, ListAttachedGroupPolicies, ListGroupsForUser. * iam: add group management handlers to embedded IAM API Add CreateGroup, DeleteGroup, GetGroup, ListGroups, AddUserToGroup, RemoveUserFromGroup, AttachGroupPolicy, DetachGroupPolicy, ListAttachedGroupPolicies, and ListGroupsForUser handlers with dispatch in ExecuteAction. * iam: add group management handlers to standalone IAM API Add group handlers (CreateGroup, DeleteGroup, GetGroup, ListGroups, AddUserToGroup, RemoveUserFromGroup, AttachGroupPolicy, DetachGroupPolicy, ListAttachedGroupPolicies, ListGroupsForUser) and wire into DoActions dispatch. Also add helper functions for user/policy side effects. * iam: integrate group policies into authorization Add groups and userGroups reverse index to IdentityAccessManagement. Populate both maps during ReplaceS3ApiConfiguration and MergeS3ApiConfiguration. Modify evaluateIAMPolicies to evaluate policies from user's enabled groups in addition to user policies. Update VerifyActionPermission to consider group policies when checking hasAttachedPolicies. * iam: add group side effects on user deletion and rename When a user is deleted, remove them from all groups they belong to. When a user is renamed, update group membership references. Applied to both embedded and standalone IAM handlers. * iam: watch /etc/iam/groups directory for config changes Add groups directory to the filer subscription watcher so group file changes trigger IAM configuration reloads. * admin: add group management page to admin UI Add groups page with CRUD operations, member management, policy attachment, and enable/disable toggle. Register routes in admin handlers and add Groups entry to sidebar navigation. * test: add IAM group management integration tests Add comprehensive integration tests for group CRUD, membership, policy attachment, policy enforcement, disabled group behavior, user deletion side effects, and multi-group membership. Add "group" test type to CI matrix in s3-iam-tests workflow. * iam: address PR review comments for group management - Fix XSS vulnerability in groups.templ: replace innerHTML string concatenation with DOM APIs (createElement/textContent) for rendering member and policy lists - Use userGroups reverse index in embedded IAM ListGroupsForUser for O(1) lookup instead of iterating all groups - Add buildUserGroupsIndex helper in standalone IAM handlers; use it in ListGroupsForUser and removeUserFromAllGroups for efficient lookup - Add note about gRPC store load-modify-save race condition limitation * iam: add defensive copies, validation, and XSS fixes for group management - Memory store: clone groups on store/retrieve to prevent mutation - Admin dash: deep copy groups before mutation, validate user/policy exists - HTTP handlers: translate credential errors to proper HTTP status codes, use bool for Enabled field to distinguish missing vs false - Groups templ: use data attributes + event delegation instead of inline onclick for XSS safety, prevent stale async responses iam: add explicit group methods to PropagatingCredentialStore Add CreateGroup, GetGroup, DeleteGroup, ListGroups, and UpdateGroup methods instead of relying on embedded interface fallthrough. Group changes propagate via filer subscription so no RPC propagation needed. * iam: detect postgres unique constraint violation and add groups index Return ErrGroupAlreadyExists when INSERT hits SQLState 23505 instead of a generic error. Add index on groups(disabled) for filtered queries. * iam: add Marker field to group list response types Add Marker string field to GetGroupResult, ListGroupsResult, ListAttachedGroupPoliciesResult, and ListGroupsForUserResult to match AWS IAM pagination response format. * iam: check group attachment before policy deletion Reject DeletePolicy if the policy is attached to any group, matching AWS IAM behavior. Add PolicyArn to ListAttachedGroupPolicies response. * iam: include group policies in IAM authorization Merge policy names from user's enabled groups into the IAMIdentity used for authorization, so group-attached policies are evaluated alongside user-attached policies. * iam: check for name collision before renaming user in UpdateUser Scan identities and inline policies for newUserName before mutating, returning EntityAlreadyExists if a collision is found. Reuse the already-loaded policies instead of loading them again inside the loop. * test: use t.Cleanup for bucket cleanup in group policy test * iam: wrap ErrUserNotInGroup sentinel in RemoveGroupMember error Wrap credential.ErrUserNotInGroup so errors.Is works in groupErrorToHTTPStatus, returning proper 400 instead of 500. * admin: regenerate groups_templ.go with XSS-safe data attributes Regenerated from groups.templ which uses data-group-name attributes instead of inline onclick with string interpolation. * iam: add input validation and persist groups during migration - Validate nil/empty group name in CreateGroup and UpdateGroup - Save groups in migrateToMultiFile so they survive legacy migration * admin: use groupErrorToHTTPStatus in GetGroupMembers and GetGroupPolicies * iam: short-circuit UpdateUser when newUserName equals current name * iam: require empty PolicyNames before group deletion Reject DeleteGroup when group has attached policies, matching the existing members check. Also fix GetGroup error handling in DeletePolicy to only skip ErrGroupNotFound, not all errors. * ci: add weed/pb/** to S3 IAM test trigger paths * test: replace time.Sleep with require.Eventually for propagation waits Use polling with timeout instead of fixed sleeps to reduce flakiness in integration tests waiting for IAM policy propagation. * fix: use credentialManager.GetPolicy for AttachGroupPolicy validation Policies created via CreatePolicy through credentialManager are stored in the credential store, not in s3cfg.Policies (which only has static config policies). Change AttachGroupPolicy to use credentialManager.GetPolicy() for policy existence validation. * feat: add UpdateGroup handler to embedded IAM API Add UpdateGroup action to enable/disable groups and rename groups via the IAM API. This is a SeaweedFS extension (not in AWS SDK) used by tests to toggle group disabled status. * fix: authenticate raw IAM API calls in group tests The embedded IAM endpoint rejects anonymous requests. Replace callIAMAPI with callIAMAPIAuthenticated that uses JWT bearer token authentication via the test framework. * feat: add UpdateGroup handler to standalone IAM API Mirror the embedded IAM UpdateGroup handler in the standalone IAM API for parity. * fix: add omitempty to Marker XML tags in group responses Non-truncated responses should not emit an empty <Marker/> element. * fix: distinguish backend errors from missing policies in AttachGroupPolicy Return ServiceFailure for credential manager errors instead of masking them as NoSuchEntity. Also switch ListGroupsForUser to use s3cfg.Groups instead of in-memory reverse index to avoid stale data. Add duplicate name check to UpdateGroup rename. * fix: standalone IAM AttachGroupPolicy uses persisted policy store Check managed policies from GetPolicies() instead of s3cfg.Policies so dynamically created policies are found. Also add duplicate name check to UpdateGroup rename. * fix: rollback inline policies on UpdateUser PutPolicies failure If PutPolicies fails after moving inline policies to the new username, restore both the identity name and the inline policies map to their original state to avoid a partial-write window. * fix: correct test cleanup ordering for group tests Replace scattered defers with single ordered t.Cleanup in each test to ensure resources are torn down in reverse-creation order: remove membership, detach policies, delete access keys, delete users, delete groups, delete policies. Move bucket cleanup to parent test scope and delete objects before bucket. * fix: move identity nil check before map lookup and refine hasAttachedPolicies Move the nil check on identity before accessing identity.Name to prevent panic. Also refine hasAttachedPolicies to only consider groups that are enabled and have actual policies attached, so membership in a no-policy group doesn't incorrectly trigger IAM authorization. * fix: fail group reload on unreadable or corrupt group files Return errors instead of logging and continuing when group files cannot be read or unmarshaled. This prevents silently applying a partial IAM config with missing group memberships or policies. * fix: use errors.Is for sql.ErrNoRows comparison in postgres group store * docs: explain why group methods skip propagateChange Group changes propagate to S3 servers via filer subscription (watching /etc/iam/groups/) rather than gRPC RPCs, since there are no group-specific RPCs in the S3 cache protocol. * fix: remove unused policyNameFromArn and strings import * fix: update service account ParentUser on user rename When renaming a user via UpdateUser, also update ParentUser references in service accounts to prevent them from becoming orphaned after the next configuration reload. * fix: wrap DetachGroupPolicy error with ErrPolicyNotAttached sentinel Use credential.ErrPolicyNotAttached so groupErrorToHTTPStatus maps it to 400 instead of falling back to 500. * fix: use admin S3 client for bucket cleanup in enforcement test The user S3 client may lack permissions by cleanup time since the user is removed from the group in an earlier subtest. Use the admin S3 client to ensure bucket and object cleanup always succeeds. * fix: add nil guard for group param in propagating store log calls Prevent potential nil dereference when logging group.Name in CreateGroup and UpdateGroup of PropagatingCredentialStore. * fix: validate Disabled field in UpdateGroup handlers Reject values other than "true" or "false" with InvalidInputException instead of silently treating them as false. * fix: seed mergedGroups from existing groups in MergeS3ApiConfiguration Previously the merge started with empty group maps, dropping any static-file groups. Now seeds from existing iam.groups before overlaying dynamic config, and builds the reverse index after merging to avoid stale entries from overridden groups. * fix: use errors.Is for filer_pb.ErrNotFound comparison in group loading Replace direct equality (==) with errors.Is() to correctly match wrapped errors, consistent with the rest of the codebase. * fix: add ErrUserNotFound and ErrPolicyNotFound to groupErrorToHTTPStatus Map these sentinel errors to 404 so AddGroupMember and AttachGroupPolicy return proper HTTP status codes. * fix: log cleanup errors in group integration tests Replace fire-and-forget cleanup calls with error-checked versions that log failures via t.Logf for debugging visibility. * fix: prevent duplicate group test runs in CI matrix The basic lane's -run "TestIAM" regex also matched TestIAMGroup* tests, causing them to run in both the basic and group lanes. Replace with explicit test function names. * fix: add GIN index on groups.members JSONB for membership lookups Without this index, ListGroupsForUser and membership queries require full table scans on the groups table. * fix: handle cross-directory moves in IAM config subscription When a file is moved out of an IAM directory (e.g., /etc/iam/groups), the dir variable was overwritten with NewParentPath, causing the source directory change to be missed. Now also notifies handlers about the source directory for cross-directory moves. * fix: validate members/policies before deleting group in admin handler AdminServer.DeleteGroup now checks for attached members and policies before delegating to credentialManager, matching the IAM handler guards. * fix: merge groups by name instead of blind append during filer load Match the identity loader's merge behavior: find existing group by name and replace, only append when no match exists. Prevents duplicates when legacy and multi-file configs overlap. * fix: check DeleteEntry response error when cleaning obsolete group files Capture and log resp.Error from filer DeleteEntry calls during group file cleanup, matching the pattern used in deleteGroupFile. * fix: verify source user exists before no-op check in UpdateUser Reorder UpdateUser to find the source identity first and return NoSuchEntityException if not found, before checking if the rename is a no-op. Previously a non-existent user renamed to itself would incorrectly return success. * fix: update service account parent refs on user rename in embedded IAM The embedded IAM UpdateUser handler updated group membership but not service account ParentUser fields, unlike the standalone handler. * fix: replay source-side events for all handlers on cross-dir moves Pass nil newEntry to bucket, IAM, and circuit-breaker handlers for the source directory during cross-directory moves, so all watchers can clear caches for the moved-away resource. * fix: don't seed mergedGroups from existing iam.groups in merge Groups are always dynamic (from filer), never static (from s3.config). Seeding from iam.groups caused stale deleted groups to persist. Now only uses config.Groups from the dynamic filer config. * fix: add deferred user cleanup in TestIAMGroupUserDeletionSideEffect Register t.Cleanup for the created user so it gets cleaned up even if the test fails before the inline DeleteUser call. * fix: assert UpdateGroup HTTP status in disabled group tests Add require.Equal checks for 200 status after UpdateGroup calls so the test fails immediately on API errors rather than relying on the subsequent Eventually timeout. * fix: trim whitespace from group name in filer store operations Trim leading/trailing whitespace from group.Name before validation in CreateGroup and UpdateGroup to prevent whitespace-only filenames. Also merge groups by name during multi-file load to prevent duplicates. * fix: add nil/empty group validation in gRPC store Guard CreateGroup and UpdateGroup against nil group or empty name to prevent panics and invalid persistence. * fix: add nil/empty group validation in postgres store Guard CreateGroup and UpdateGroup against nil group or empty name to prevent panics from nil member access and empty-name row inserts. * fix: add name collision check in embedded IAM UpdateUser The embedded IAM handler renamed users without checking if the target name already existed, unlike the standalone handler. * fix: add ErrGroupNotEmpty sentinel and map to HTTP 409 AdminServer.DeleteGroup now wraps conflict errors with ErrGroupNotEmpty, and groupErrorToHTTPStatus maps it to 409 Conflict instead of 500. * fix: use appropriate error message in GetGroupDetails based on status Return "Group not found" only for 404, use "Failed to retrieve group" for other error statuses instead of always saying "Group not found". * fix: use backend-normalized group.Name in CreateGroup response After credentialManager.CreateGroup may normalize the name (e.g., trim whitespace), use group.Name instead of the raw input for the returned GroupData to ensure consistency. * fix: add nil/empty group validation in memory store Guard CreateGroup and UpdateGroup against nil group or empty name to prevent panics from nil pointer dereference on map access. * fix: reorder embedded IAM UpdateUser to verify source first Find the source identity before checking for collisions, matching the standalone handler's logic. Previously a non-existent user renamed to an existing name would get EntityAlreadyExists instead of NoSuchEntity. * fix: handle same-directory renames in metadata subscription Replay a delete event for the old entry name during same-directory renames so handlers like onBucketMetadataChange can clean up stale state for the old name. * fix: abort GetGroups on non-ErrGroupNotFound errors Only skip groups that return ErrGroupNotFound. Other errors (e.g., transient backend failures) now abort the handler and return the error to the caller instead of silently producing partial results. * fix: add aria-label and title to icon-only group action buttons Add accessible labels to View and Delete buttons so screen readers and tooltips provide meaningful context. * fix: validate group name in saveGroup to prevent invalid filenames Trim whitespace and reject empty names before writing group JSON files, preventing creation of files like ".json". * fix: add /etc/iam/groups to filer subscription watched directories The groups directory was missing from the watched directories list, so S3 servers in a cluster would not detect group changes made by other servers via filer. The onIamConfigChange handler already had code to handle group directory changes but it was never triggered. * add direct gRPC propagation for group changes to S3 servers Groups now have the same dual propagation as identities and policies: direct gRPC push via propagateChange + async filer subscription. - Add PutGroup/RemoveGroup proto messages and RPCs - Add PutGroup/RemoveGroup in-memory cache methods on IAM - Add PutGroup/RemoveGroup gRPC server handlers - Update PropagatingCredentialStore to call propagateChange on group mutations * reduce log verbosity for config load summary Change ReplaceS3ApiConfiguration log from Infof to V(1).Infof to avoid noisy output on every config reload. * admin: show user groups in view and edit user modals - Add Groups field to UserDetails and populate from credential manager - Show groups as badges in user details view modal - Add group management to edit user modal: display current groups, add to group via dropdown, remove from group via badge x button * fix: remove duplicate showAlert that broke modal-alerts.js admin.js defined showAlert(type, message) which overwrote the modal-alerts.js version showAlert(message, type), causing broken unstyled alert boxes. Remove the duplicate and swap all callers in admin.js to use the correct (message, type) argument order. * fix: unwrap groups API response in edit user modal The /api/groups endpoint returns {"groups": [...]}, not a bare array. * Update object_store_users_templ.go * test: assert AccessDenied error code in group denial tests Replace plain assert.Error checks with awserr.Error type assertion and AccessDenied code verification, matching the pattern used in other IAM integration tests. * fix: propagate GetGroups errors in ShowGroups handler getGroupsPageData was swallowing errors and returning an empty page with 200 status. Now returns the error so ShowGroups can respond with a proper error status. * fix: reject AttachGroupPolicy when credential manager is nil Previously skipped policy existence validation when credentialManager was nil, allowing attachment of nonexistent policies. Now returns a ServiceFailureException error. * fix: preserve groups during partial MergeS3ApiConfiguration updates UpsertIdentity calls MergeS3ApiConfiguration with a partial config containing only the updated identity (nil Groups). This was wiping all in-memory group state. Now only replaces groups when config.Groups is non-nil (full config reload). * fix: propagate errors from group lookup in GetObjectStoreUserDetails ListGroups and GetGroup errors were silently ignored, potentially showing incomplete group data in the UI. * fix: use DOM APIs for group badge remove button to prevent XSS Replace innerHTML with onclick string interpolation with DOM createElement + addEventListener pattern. Also add aria-label and title to the add-to-group button. * fix: snapshot group policies under RLock to prevent concurrent map access evaluateIAMPolicies was copying the map reference via groupMap := iam.groups under RLock then iterating after RUnlock, while PutGroup mutates the map in-place. Now copies the needed policy names into a slice while holding the lock. * fix: add nil IAM check to PutGroup and RemoveGroup gRPC handlers Match the nil guard pattern used by PutPolicy/DeletePolicy to prevent nil pointer dereference when IAM is not initialized.	2026-03-09 11:54:32 -07:00
Chris Lu	3f946fc0c0	mount: make metadata cache rebuilds snapshot-consistent (#8531 ) * filer: expose metadata events and list snapshots * mount: invalidate hot directory caches * mount: read hot directories directly from filer * mount: add sequenced metadata cache applier * mount: apply metadata responses through cache applier * mount: replay snapshot-consistent directory builds * mount: dedupe self metadata events * mount: factor directory build cleanup * mount: replace proto marshal dedup with composite key and ring buffer The dedup logic was doing a full deterministic proto.Marshal on every metadata event just to produce a dedup key. Replace with a cheap composite string key (TsNs\|Directory\|OldName\|NewName). Also replace the sliding-window slice (which leaked the backing array unboundedly) with a fixed-size ring buffer that reuses the same array. * filer: remove mutex and proto.Clone from request-scoped MetadataEventSink MetadataEventSink is created per-request and only accessed by the goroutine handling the gRPC call. The mutex and double proto.Clone (once in Record, once in Last) were unnecessary overhead on every filer write operation. Store the pointer directly instead. * mount: skip proto.Clone for caller-owned metadata events Add ApplyMetadataResponseOwned that takes ownership of the response without cloning. Local metadata events (mkdir, create, flush, etc.) are freshly constructed and never shared, so the clone is unnecessary. * filer: only populate MetadataEvent on successful DeleteEntry Avoid calling eventSink.Last() on error paths where the sink may contain a partial event from an intermediate child deletion during recursive deletes. * mount: avoid map allocation in collectDirectoryNotifications Replace the map with a fixed-size array and linear dedup. There are at most 3 directories to notify (old parent, new parent, new child if directory), so a 3-element array avoids the heap allocation on every metadata event. * mount: fix potential deadlock in enqueueApplyRequest Release applyStateMu before the blocking channel send. Previously, if the channel was full (cap 128), the send would block while holding the mutex, preventing Shutdown from acquiring it to set applyClosed. * mount: restore signature-based self-event filtering as fast path Re-add the signature check that was removed when content-based dedup was introduced. Checking signatures is O(1) on a small slice and avoids enqueuing and processing events that originated from this mount instance. The content-based dedup remains as a fallback. * filer: send snapshotTsNs only in first ListEntries response The snapshot timestamp is identical for every entry in a single ListEntries stream. Sending it in every response message wastes wire bandwidth for large directories. The client already reads it only from the first response. * mount: exit read-through mode after successful full directory listing MarkDirectoryRefreshed was defined but never called, so directories that entered read-through mode (hot invalidation threshold) stayed there permanently, hitting the filer on every readdir even when cold. Call it after a complete read-through listing finishes. * mount: include event shape and full paths in dedup key The previous dedup key only used Names, which could collapse distinct rename targets. Include the event shape (C/D/U/R), source directory, new parent path, and both entry names so structurally different events are never treated as duplicates. * mount: drain pending requests on shutdown in runApplyLoop After receiving the shutdown sentinel, drain any remaining requests from applyCh non-blockingly and signal each with errMetaCacheClosed so callers waiting on req.done are released. * mount: include IsDirectory in synthetic delete events metadataDeleteEvent now accepts an isDirectory parameter so the applier can distinguish directory deletes from file deletes. Rmdir passes true, Unlink passes false. * mount: fall back to synthetic event when MetadataEvent is nil In mknod and mkdir, if the filer response omits MetadataEvent (e.g. older filer without the field), synthesize an equivalent local metadata event so the cache is always updated. * mount: make Flush metadata apply best-effort after successful commit After filer_pb.CreateEntryWithResponse succeeds, the entry is persisted. Don't fail the Flush syscall if the local metadata cache apply fails — log and invalidate the directory cache instead. Also fall back to a synthetic event when MetadataEvent is nil. * mount: make Rename metadata apply best-effort The rename has already succeeded on the filer by the time we apply the local metadata event. Log failures instead of returning errors that would be dropped by the caller anyway. * mount: make saveEntry metadata apply best-effort with fallback After UpdateEntryWithResponse succeeds, treat local metadata apply as non-fatal. Log and invalidate the directory cache on failure. Also fall back to a synthetic event when MetadataEvent is nil. * filer_pb: preserve snapshotTsNs on error in ReadDirAllEntriesWithSnapshot Return the snapshot timestamp even when the first page fails, so callers receive the snapshot boundary when partial data was received. * filer: send snapshot token for empty directory listings When no entries are streamed, send a final ListEntriesResponse with only SnapshotTsNs so clients always receive the snapshot boundary. * mount: distinguish not-found vs transient errors in lookupEntry Return fuse.EIO for non-not-found filer errors instead of unconditionally returning ENOENT, so transient failures don't masquerade as missing entries. * mount: make CacheRemoteObject metadata apply best-effort The file content has already been cached successfully. Don't fail the read if the local metadata cache update fails. * mount: use consistent snapshot for readdir in direct mode Capture the SnapshotTsNs from the first loadDirectoryEntriesDirect call and store it on the DirectoryHandle. Subsequent batch loads pass this stored timestamp so all batches use the same snapshot. Also export DoSeaweedListWithSnapshot so mount can use it directly with snapshot passthrough. * filer_pb: fix test fake to send SnapshotTsNs only on first response Match the server behavior: only the first ListEntriesResponse in a page carries the snapshot timestamp, subsequent entries leave it zero. * Fix nil pointer dereference in ListEntries stream consumers Remove the empty-directory snapshot-only response from ListEntries that sent a ListEntriesResponse with Entry==nil, which crashed every raw stream consumer that assumed resp.Entry is always non-nil. Also add defensive nil checks for resp.Entry in all raw ListEntries stream consumers across: S3 listing, broker topic lookup, broker topic config, admin dashboard, topic retention, hybrid message scanner, Kafka integration, and consumer offset storage. * Add nil guards for resp.Entry in remaining ListEntries stream consumers Covers: S3 object lock check, MQ management dashboard (version/ partition/offset loops), and topic retention version loop. * Make applyLocalMetadataEvent best-effort in Link and Symlink The filer operations already succeeded; failing the syscall because the local cache apply failed is wrong. Log a warning and invalidate the parent directory cache instead. * Make applyLocalMetadataEvent best-effort in Mkdir/Rmdir/Mknod/Unlink The filer RPC already committed; don't fail the syscall when the local metadata cache apply fails. Log a warning and invalidate the parent directory cache to force a re-fetch on next access. * flushFileMetadata: add nil-fallback for metadata event and best-effort apply Synthesize a metadata event when resp.GetMetadataEvent() is nil (matching doFlush), and make the apply best-effort with cache invalidation on failure. * Prevent double-invocation of cleanupBuild in doEnsureVisited Add a cleanupDone guard so the deferred cleanup and inline error-path cleanup don't both call DeleteFolderChildren/AbortDirectoryBuild. * Fix comment: signature check is O(n) not O(1) * Prevent deferred cleanup after successful CompleteDirectoryBuild Set cleanupDone before returning from the success path so the deferred context-cancellation check cannot undo a published build. * Invalidate parent directory caches on rename metadata apply failure When applyLocalMetadataEvent fails during rename, invalidate the source and destination parent directory caches so subsequent accesses trigger a re-fetch from the filer. * Add event nil-fallback and cache invalidation to Link and Symlink Synthesize metadata events when the server doesn't return one, and invalidate parent directory caches on apply failure. * Match requested partition when scanning partition directories Parse the partition range format (NNNN-NNNN) and match against the requested partition parameter instead of using the first directory. * Preserve snapshot timestamp across empty directory listings Initialize actualSnapshotTsNs from the caller-requested value so it isn't lost when the server returns no entries. Re-add the server-side snapshot-only response for empty directories (all raw stream consumers now have nil guards for Entry). * Fix CreateEntry error wrapping to support errors.Is/errors.As Use errors.New + %w instead of %v for resp.Error so callers can unwrap the underlying error. * Fix object lock pagination: only advance on non-nil entries Move entriesReceived inside the nil check so nil entries don't cause repeated ListEntries calls with the same lastFileName. * Guard Attributes nil check before accessing Mtime in MQ management * Do not send nil-Entry response for empty directory listings The snapshot-only ListEntriesResponse (with Entry == nil) for empty directories breaks consumers that treat any received response as an entry (Java FilerClient, S3 listing). The Go client-side DoSeaweedListWithSnapshot already preserves the caller-requested snapshot via actualSnapshotTsNs initialization, so the server-side send is unnecessary. * Fix review findings: subscriber dedup, invalidation normalization, nil guards, shutdown race - Remove self-signature early-return in processEventFn so all events flow through the applier (directory-build buffering sees self-originated events that arrive after a snapshot) - Normalize NewParentPath in collectEntryInvalidations to avoid duplicate invalidations when NewParentPath is empty (same-directory update) - Guard resp.Entry.Attributes for nil in admin_server.go and topic_retention.go to prevent panics on entries without attributes - Fix enqueueApplyRequest race with shutdown by using select on both applyCh and applyDone, preventing sends after the apply loop exits - Add cleanupDone check to deferred cleanup in meta_cache_init.go for clarity alongside the existing guard in cleanupBuild - Add empty directory test case for snapshot consistency * Propagate authoritative metadata event from CacheRemoteObjectToLocalCluster and generate client-side snapshot for empty directories - Add metadata_event field to CacheRemoteObjectToLocalClusterResponse proto so the filer-emitted event is available to callers - Use WithMetadataEventSink in the server handler to capture the event from NotifyUpdateEvent and return it on the response - Update filehandle_read.go to prefer the RPC's metadata event over a locally fabricated one, falling back to metadataUpdateEvent when the server doesn't provide one (e.g., older filers) - Generate a client-side snapshot cutoff in DoSeaweedListWithSnapshot when the server sends no snapshot (empty directory), so callers like CompleteDirectoryBuild get a meaningful boundary for filtering buffered events * Skip directory notifications for dirs being built to prevent mid-build cache wipe When a metadata event is buffered during a directory build, applyMetadataSideEffects was still firing noteDirectoryUpdate for the building directory. If the directory accumulated enough updates to become "hot", markDirectoryReadThrough would call DeleteFolderChildren, wiping entries that EnsureVisited had already inserted. The build would then complete and mark the directory cached with incomplete data. Fix by using applyMetadataSideEffectsSkippingBuildingDirs for buffered events, which suppresses directory notifications for dirs currently in buildingDirs while still applying entry invalidations. * Add test for directory notification suppression during active build TestDirectoryNotificationsSuppressedDuringBuild verifies that metadata events targeting a directory under active EnsureVisited build do NOT fire onDirectoryUpdate for that directory. In production, this prevents markDirectoryReadThrough from calling DeleteFolderChildren mid-build, which would wipe entries already inserted by the listing. The test inserts an entry during a build, sends multiple metadata events for the building directory, asserts no notifications fired for it, verifies the entry survives, and confirms buffered events are replayed after CompleteDirectoryBuild. * Fix create invalidations, build guard, event shape, context, and snapshot error path - collectEntryInvalidations: invalidate FUSE kernel cache on pure create events (OldEntry==nil && NewEntry!=nil), not just updates and deletes - completeDirectoryBuildNow: only call markCachedFn when an active build existed (state != nil), preventing an unpopulated directory from being marked as cached - Add metadataCreateEvent helper that produces a create-shaped event (NewEntry only, no OldEntry) and use it in mkdir, mknod, symlink, and hardlink create fallback paths instead of metadataUpdateEvent which incorrectly set both OldEntry and NewEntry - applyMetadataResponseEnqueue: use context.Background() for the queued mutation so a cancelled caller context cannot abort the apply loop mid-write - DoSeaweedListWithSnapshot: move snapshot initialization before ListEntries call so the error path returns the preserved snapshot instead of 0 * Fix review findings: test loop, cache race, context safety, snapshot consistency - Fix build test loop starting at i=1 instead of i=0, missing new-0.txt verification - Re-check IsDirectoryCached after cache miss to avoid ENOENT race with markDirectoryReadThrough - Use context.Background() in enqueueAndWait so caller cancellation can't abort build/complete mid-way - Pass dh.snapshotTsNs in skip-batch loadDirectoryEntriesDirect for snapshot consistency - Prefer resp.MetadataEvent over fallback in Unlink event derivation - Add comment on MetadataEventSink.Record single-event assumption * Fix empty-directory snapshot clock skew and build cancellation race Empty-directory snapshot: Remove client-side time.Now() synthesis when the server returns no entries. Instead return snapshotTsNs=0, and in completeDirectoryBuildNow replay ALL buffered events when snapshot is 0. This eliminates the clock-skew bug where a client ahead of the filer would filter out legitimate post-list events. Build cancellation: Use context.Background() for BeginDirectoryBuild and CompleteDirectoryBuild calls in doEnsureVisited, so errgroup cancellation doesn't cause enqueueAndWait to return early and trigger cleanupBuild while the operation is still queued. * Add tests for empty-directory build replay and cancellation resilience TestEmptyDirectoryBuildReplaysAllBufferedEvents: verifies that when CompleteDirectoryBuild receives snapshotTsNs=0 (empty directory, no server snapshot), ALL buffered events are replayed regardless of their TsNs values — no clock-skew-sensitive filtering occurs. TestBuildCompletionSurvivesCallerCancellation: verifies that once CompleteDirectoryBuild is enqueued, a cancelled caller context does not prevent the build from completing. The apply loop runs with context.Background(), so the directory becomes cached and buffered events are replayed even when the caller gives up waiting. * Fix directory subtree cleanup, Link rollback, test robustness - applyMetadataResponseLocked: when a directory entry is deleted or moved, call DeleteFolderChildren on the old path so cached descendants don't leak as stale entries. - Link: save original HardLinkId/Counter before mutation. If CreateEntryWithResponse fails after the source was already updated, rollback the source entry to its original state via UpdateEntry. - TestBuildCompletionSurvivesCallerCancellation: replace fixed time.Sleep(50ms) with a deadline-based poll that checks IsDirectoryCached in a loop, failing only after 2s timeout. - TestReadDirAllEntriesWithSnapshotEmptyDirectory: assert that ListEntries was actually invoked on the mock client so the test exercises the RPC path. - newMetadataEvent: add early return when both oldEntry and newEntry are nil to avoid emitting events with empty Directory. --------- Co-authored-by: Copilot <copilot@github.com>	2026-03-07 09:19:40 -08:00
Chris Lu	b3620c7e14	admin: auto migrating master maintenance scripts to admin_script plugin config (#8509 ) * admin: seed admin_script plugin config from master maintenance scripts When the admin server starts, fetch the maintenance scripts configuration from the master via GetMasterConfiguration. If the admin_script plugin worker does not already have a saved config, use the master's scripts as the default value. This enables seamless migration from master.toml [master.maintenance] to the admin script plugin worker. Changes: - Add maintenance_scripts and maintenance_sleep_minutes fields to GetMasterConfigurationResponse in master.proto - Populate the new fields from viper config in master_grpc_server.go - On admin server startup, fetch the master config and seed the admin_script plugin config if no config exists yet - Strip lock/unlock commands from the master scripts since the admin script worker handles locking automatically Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: address review comments on admin_script seeding - Replace TOCTOU race (separate Load+Save) with atomic SaveJobTypeConfigIfNotExists on ConfigStore and Plugin - Replace ineffective polling loop with single GetMaster call using 30s context timeout, since GetMaster respects context cancellation - Add unit tests for SaveJobTypeConfigIfNotExists (in-memory + on-disk) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: apply maintenance script defaults in gRPC handler The gRPC handler for GetMasterConfiguration read maintenance scripts from viper without calling SetDefault, relying on startAdminScripts having run first. If the admin server calls GetMasterConfiguration before startAdminScripts sets the defaults, viper returns empty strings and the seeding is silently skipped. Apply SetDefault in the gRPC handler itself so it is self-contained. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Revert "fix: apply maintenance script defaults in gRPC handler" This reverts commit 068a5063303f6bc34825a07bb681adfa67e6f9de. * fix: use atomic save in ensureJobTypeConfigFromDescriptor ensureJobTypeConfigFromDescriptor used a separate Load + Save, racing with seedAdminScriptFromMaster. If the descriptor defaults (empty script) were saved first, SaveJobTypeConfigIfNotExists in the seeding goroutine would see an existing config and skip, losing the master's maintenance scripts. Switch to SaveJobTypeConfigIfNotExists so both paths are atomic. Whichever wins, the other is a safe no-op. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: fetch master scripts inline during config bootstrap, not in goroutine Replace the seedAdminScriptFromMaster goroutine with a ConfigDefaultsProvider callback. When the plugin bootstraps admin_script defaults from the worker descriptor, it calls the provider which fetches maintenance scripts from the master synchronously. This eliminates the race between the seeding goroutine and the descriptor-based config bootstrap. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * skip commented lock unlock Co-Authored-By: Copilot <223556219+Copilot@users.noreply.github.com> * reduce grpc calls --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-03-04 22:11:07 -08:00
Racci	9e26d6f5dd	fix: port in SNI address when using domainName instead of IP for master (#8500 )	2026-03-04 07:05:45 -08:00
Chris Lu	18ccc9b773	Plugin scheduler: sequential iterations with max runtime (#8496 ) * pb: add job type max runtime setting * plugin: default job type max runtime * plugin: redesign scheduler loop * admin ui: update scheduler settings * plugin: fix scheduler loop state name * plugin scheduler: restore backlog skip * plugin scheduler: drop legacy detection helper * admin api: require scheduler config body * admin ui: preserve detection interval on save * plugin scheduler: use job context and drain cancels * plugin scheduler: respect detection intervals * plugin scheduler: gate runs and drain queue * ec test: reuse req/resp vars * ec test: add scheduler debug logs * Adjust scheduler idle sleep and initial run delay * Clear pending job queue before scheduler runs * Log next detection time in EC integration test * Improve plugin scheduler debug logging in EC test * Expose scheduler next detection time * Log scheduler next detection time in EC test * Wake scheduler on config or worker updates * Expose scheduler sleep interval in UI * Fix scheduler sleep save value selection * Set scheduler idle sleep default to 613s * Show scheduler next run time in plugin UI --------- Co-authored-by: Copilot <copilot@github.com>	2026-03-03 23:09:49 -08:00
Chris Lu	f5c35240be	Add volume dir tags and EC placement priority (#8472 ) * Add volume dir tags to topology Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Add preferred tag config for EC Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Prioritize EC destinations by tags Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Add EC placement planner tag tests Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Refactor EC placement tests to reuse buildActiveTopology Remove buildActiveTopologyWithDiskTags helper function and consolidate tag setup inline in test cases. Tests now use UpdateTopology to apply tags after topology creation, reusing the existing buildActiveTopology function rather than duplicating its logic. All tag scenario tests pass: - TestECPlacementPlannerPrefersTaggedDisks - TestECPlacementPlannerFallsBackWhenTagsInsufficient Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Consolidate normalizeTagList into shared util package Extract normalizeTagList from three locations (volume.go, detection.go, erasure_coding_handler.go) into new weed/util/tag.go as exported NormalizeTagList function. Replace all duplicate implementations with imports and calls to util.NormalizeTagList. This improves code reuse and maintainability by centralizing tag normalization logic. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Add PreferredTags to EC config persistence Add preferred_tags field to ErasureCodingTaskConfig protobuf with field number 5. Update GetConfigSpec to include preferred_tags field in the UI configuration schema. Add PreferredTags to ToTaskPolicy to serialize config to protobuf. Add PreferredTags to FromTaskPolicy to deserialize from protobuf with defensive copy to prevent external mutation. This allows EC preferred tags to be persisted and restored across worker restarts. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Add defensive copy for Tags slice in DiskLocation Copy the incoming tags slice in NewDiskLocation instead of storing by reference. This prevents external callers from mutating the DiskLocation.Tags slice after construction, improving encapsulation and preventing unexpected changes to disk metadata. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Add doc comment to buildCandidateSets method Document the tiered candidate selection and fallback behavior. Explain that for a planner with preferredTags, it accumulates disks matching each tag in order into progressively larger tiers, emits a candidate set once a tier reaches shardsNeeded, and finally falls back to the full candidates set if preferred-tag tiers are insufficient. This clarifies the intended semantics for future maintainers. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Apply final PR review fixes 1. Update parseVolumeTags to replicate single tag entry to all folders instead of leaving some folders with nil tags. This prevents nil pointer dereferences when processing folders without explicit tags. 2. Add defensive copy in ToTaskPolicy for PreferredTags slice to match the pattern used in FromTaskPolicy, preventing external mutation of the returned TaskPolicy. 3. Add clarifying comment in buildCandidateSets explaining that the shardsNeeded <= 0 branch is a defensive check for direct callers, since selectDestinations guarantees shardsNeeded > 0. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Fix nil pointer dereference in parseVolumeTags Ensure all folder tags are initialized to either normalized tags or empty slices, not nil. When multiple tag entries are provided and there are more folders than entries, remaining folders now get empty slices instead of nil, preventing nil pointer dereference in downstream code. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Fix NormalizeTagList to return empty slice instead of nil Change NormalizeTagList to always return a non-nil slice. When all tags are empty or whitespace after normalization, return an empty slice instead of nil. This prevents nil pointer dereferences in downstream code that expects a valid (possibly empty) slice. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Add nil safety check for v.tags pointer Add a safety check to handle the case where v.tags might be nil, preventing a nil pointer dereference. If v.tags is nil, use an empty string instead. This is defensive programming to prevent panics in edge cases. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Add volume.tags flag to weed server and weed mini commands Add the volume.tags CLI option to both the 'weed server' and 'weed mini' commands. This allows users to specify disk tags when running the combined server modes, just like they can with 'weed volume'. The flag uses the same format and description as the volume command: comma-separated tag groups per data dir with ':' separators (e.g. fast:ssd,archive). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <copilot@github.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-03-01 10:22:00 -08:00
Chris Lu	c73e65ad5e	Add customizable plugin display names and weights (#8459 ) * feat: add customizable plugin display names and weights - Add weight field to JobTypeCapability proto message - Modify ListKnownJobTypes() to return JobTypeInfo with display names and weights - Modify ListPluginJobTypes() to return JobTypeInfo instead of string - Sort plugins by weight (descending) then alphabetically - Update admin API to return enriched job type metadata - Update plugin UI template to display names instead of IDs - Consolidate API by reusing existing function names instead of suffixed variants * perf: optimize plugin job type capability lookup and add null-safe parsing - Pre-calculate job type capabilities in a map to reduce O(nm) nested loops to O(n+m) lookup time in ListKnownJobTypes() - Add parseJobTypeItem() helper function for null-safe job type item parsing - Refactor plugin.templ to use parseJobTypeItem() in all job type access points (hasJobType, applyInitialNavigation, ensureActiveNavigation, renderTopTabs) - Deterministic capability resolution by using first worker's capability templ * refactor: use parseJobTypeItem helper consistently in plugin.templ Replace duplicated job type extraction logic at line 1296-1298 with parseJobTypeItem() helper function for consistency and maintainability. * improve: prefer richer capability metadata and add null-safety checks - Improve capability selection in ListKnownJobTypes() to prefer capabilities with non-empty DisplayName and higher Weight across all workers instead of first-wins approach. Handles mixed-version clusters better. - Add defensive null checks in renderJobTypeSummary() to safely access parseJobTypeItem() result before property access - Ensures malformed or missing entries won't break the rendering pipeline * fix: preserve existing DisplayName when merging capabilities Fix capability merge logic to respect existing DisplayName values: - If existing has DisplayName but candidate doesn't, preserve existing - If existing doesn't have DisplayName but candidate does, use candidate - Only use Weight comparison if DisplayName status is equal - Prevents higher-weight capabilities with empty DisplayName from overriding capabilities with non-empty DisplayName	2026-02-26 19:20:48 -08:00
Chris Lu	a2005cb2a6	fix: resolve gRPC DNS resolution issues in Kubernetes #8384 (#8387 ) * fix: resolve gRPC DNS resolution issues in Kubernetes #8384 - Replace direct `grpc.NewClient` calls with `pb.GrpcDial` for consistent connection establishment - Fix async DNS resolution behavior in K8s with `ndots:5` - Ensure high-level components use established helper for reliable networking * refactor: refine gRPC DNS fix and add documentation - Use instance's grpcDialOption in BrokerClient.ConfigureTopic - Add detailed comments to GrpcDial explaining Kubernetes DNS resolution rationale * fix: ensure proper context propagation in broker_client gRPC calls - Pass the provided `ctx` to `pb.GrpcDial` in `ConfigureTopic` and `GetUnflushedMessages` - Ensures that timeouts and cancellations are correctly honored during connection establishment * docs: refine gRPC resolver documentation and cleanup dead code - Enhanced documentation for `GrpcDial` with explicit warnings about global state mutation when using `resolver.SetDefaultScheme("passthrough")`. - Recommended `passthrough:///` prefix as the primary migration path for `grpc.NewClient`. - Removed dead commented-out code for `grpc.WithBlock()` and `grpc.WithTimeout()`.	2026-02-19 15:46:02 -08:00
Chris Lu	8ec9ff4a12	Refactor plugin system and migrate worker runtime (#8369 ) * admin: add plugin runtime UI page and route wiring * pb: add plugin gRPC contract and generated bindings * admin/plugin: implement worker registry, runtime, monitoring, and config store * admin/dash: wire plugin runtime and expose plugin workflow APIs * command: add flags to enable plugin runtime * admin: rename remaining plugin v2 wording to plugin * admin/plugin: add detectable job type registry helper * admin/plugin: add scheduled detection and dispatch orchestration * admin/plugin: prefetch job type descriptors when workers connect * admin/plugin: add known job type discovery API and UI * admin/plugin: refresh design doc to match current implementation * admin/plugin: enforce per-worker scheduler concurrency limits * admin/plugin: use descriptor runtime defaults for scheduler policy * admin/ui: auto-load first known plugin job type on page open * admin/plugin: bootstrap persisted config from descriptor defaults * admin/plugin: dedupe scheduled proposals by dedupe key * admin/ui: add job type and state filters for plugin monitoring * admin/ui: add per-job-type plugin activity summary * admin/plugin: split descriptor read API from schema refresh * admin/ui: keep plugin summary metrics global while tables are filtered * admin/plugin: retry executor reservation before timing out * admin/plugin: expose scheduler states for monitoring * admin/ui: show per-job-type scheduler states in plugin monitor * pb/plugin: rename protobuf package to plugin * admin/plugin: rename pluginRuntime wiring to plugin * admin/plugin: remove runtime naming from plugin APIs and UI * admin/plugin: rename runtime files to plugin naming * admin/plugin: persist jobs and activities for monitor recovery * admin/plugin: lease one detector worker per job type * admin/ui: show worker load from plugin heartbeats * admin/plugin: skip stale workers for detector and executor picks * plugin/worker: add plugin worker command and stream runtime scaffold * plugin/worker: implement vacuum detect and execute handlers * admin/plugin: document external vacuum plugin worker starter * command: update plugin.worker help to reflect implemented flow * command/admin: drop legacy Plugin V2 label * plugin/worker: validate vacuum job type and respect min interval * plugin/worker: test no-op detect when min interval not elapsed * command/admin: document plugin.worker external process * plugin/worker: advertise configured concurrency in hello * command/plugin.worker: add jobType handler selection * command/plugin.worker: test handler selection by job type * command/plugin.worker: persist worker id in workingDir * admin/plugin: document plugin.worker jobType and workingDir flags * plugin/worker: support cancel request for in-flight work * plugin/worker: test cancel request acknowledgements * command/plugin.worker: document workingDir and jobType behavior * plugin/worker: emit executor activity events for monitor * plugin/worker: test executor activity builder * admin/plugin: send last successful run in detection request * admin/plugin: send cancel request when detect or execute context ends * admin/plugin: document worker cancel request responsibility * admin/handlers: expose plugin scheduler states API in no-auth mode * admin/handlers: test plugin scheduler states route registration * admin/plugin: keep worker id on worker-generated activity records * admin/plugin: test worker id propagation in monitor activities * admin/dash: always initialize plugin service * command/admin: remove plugin enable flags and default to enabled * admin/dash: drop pluginEnabled constructor parameter * admin/plugin UI: stop checking plugin enabled state * admin/plugin: remove docs for plugin enable flags * admin/dash: remove unused plugin enabled check method * admin/dash: fallback to in-memory plugin init when dataDir fails * admin/plugin API: expose worker gRPC port in status * command/plugin.worker: resolve admin gRPC port via plugin status * split plugin UI into overview/configuration/monitoring pages * Update layout_templ.go * add volume_balance plugin worker handler * wire plugin.worker CLI for volume_balance job type * add erasure_coding plugin worker handler * wire plugin.worker CLI for erasure_coding job type * support multi-job handlers in plugin worker runtime * allow plugin.worker jobType as comma-separated list * admin/plugin UI: rename to Workers and simplify config view * plugin worker: queue detection requests instead of capacity reject * Update plugin_worker.go * plugin volume_balance: remove force_move/timeout from worker config UI * plugin erasure_coding: enforce local working dir and cleanup * admin/plugin UI: rename admin settings to job scheduling * admin/plugin UI: persist and robustly render detection results * admin/plugin: record and return detection trace metadata * admin/plugin UI: show detection process and decision trace * plugin: surface detector decision trace as activities * mini: start a plugin worker by default * admin/plugin UI: split monitoring into detection and execution tabs * plugin worker: emit detection decision trace for EC and balance * admin workers UI: split monitoring into detection and execution pages * plugin scheduler: skip proposals for active assigned/running jobs * admin workers UI: add job queue tab * plugin worker: add dummy stress detector and executor job type * admin workers UI: reorder tabs to detection queue execution * admin workers UI: regenerate plugin template * plugin defaults: include dummy stress and add stress tests * plugin dummy stress: rotate detection selections across runs * plugin scheduler: remove cross-run proposal dedupe * plugin queue: track pending scheduled jobs * plugin scheduler: wait for executor capacity before dispatch * plugin scheduler: skip detection when waiting backlog is high * plugin: add disk-backed job detail API and persistence * admin ui: show plugin job detail modal from job id links * plugin: generate unique job ids instead of reusing proposal ids * plugin worker: emit heartbeats on work state changes * plugin registry: round-robin tied executor and detector picks * add temporary EC overnight stress runner * plugin job details: persist and render EC execution plans * ec volume details: color data and parity shard badges * shard labels: keep parity ids numeric and color-only distinction * admin: remove legacy maintenance UI routes and templates * admin: remove dead maintenance endpoint helpers * Update layout_templ.go * remove dummy_stress worker and command support * refactor plugin UI to job-type top tabs and sub-tabs * migrate weed worker command to plugin runtime * remove plugin.worker command and keep worker runtime with metrics * update helm worker args for jobType and execution flags * set plugin scheduling defaults to global 16 and per-worker 4 * stress: fix RPC context reuse and remove redundant variables in ec_stress_runner * admin/plugin: fix lifecycle races, safe channel operations, and terminal state constants * admin/dash: randomize job IDs and fix priority zero-value overwrite in plugin API * admin/handlers: implement buffered rendering to prevent response corruption * admin/plugin: implement debounced persistence flusher and optimize BuildJobDetail memory lookups * admin/plugin: fix priority overwrite and implement bounded wait in scheduler reserve * admin/plugin: implement atomic file writes and fix run record side effects * admin/plugin: use P prefix for parity shard labels in execution plans * admin/plugin: enable parallel execution for cancellation tests * admin: refactor time.Time fields to pointers for better JSON omitempty support * admin/plugin: implement pointer-safe time assignments and comparisons in plugin core * admin/plugin: fix time assignment and sorting logic in plugin monitor after pointer refactor * admin/plugin: update scheduler activity tracking to use time pointers * admin/plugin: fix time-based run history trimming after pointer refactor * admin/dash: fix JobSpec struct literal in plugin API after pointer refactor * admin/view: add D/P prefixes to EC shard badges for UI consistency * admin/plugin: use lifecycle-aware context for schema prefetching * Update ec_volume_details_templ.go * admin/stress: fix proposal sorting and log volume cleanup errors * stress: refine ec stress runner with math/rand and collection name - Added Collection field to VolumeEcShardsDeleteRequest for correct filename construction. - Replaced crypto/rand with seeded math/rand PRNG for bulk payloads. - Added documentation for EcMinAge zero-value behavior. - Added logging for ignored errors in volume/shard deletion. * admin: return internal server error for plugin store failures Changed error status code from 400 Bad Request to 500 Internal Server Error for failures in GetPluginJobDetail to correctly reflect server-side errors. * admin: implement safe channel sends and graceful shutdown sync - Added sync.WaitGroup to Plugin struct to manage background goroutines. - Implemented safeSendCh helper using recover() to prevent panics on closed channels. - Ensured Shutdown() waits for all background operations to complete. * admin: robustify plugin monitor with nil-safe time and record init - Standardized nil-safe assignment for time.Time pointers (CreatedAt, UpdatedAt, CompletedAt). - Ensured persistJobDetailSnapshot initializes new records correctly if they don't exist on disk. - Fixed debounced persistence to trigger immediate write on job completion. admin: improve scheduler shutdown behavior and logic guards - Replaced brittle error string matching with explicit r.shutdownCh selection for shutdown detection. - Removed redundant nil guard in buildScheduledJobSpec. - Standardized WaitGroup usage for schedulerLoop. * admin: implement deep copy for job parameters and atomic write fixes - Implemented deepCopyGenericValue and used it in cloneTrackedJob to prevent shared state. - Ensured atomicWriteFile creates parent directories before writing. * admin: remove unreachable branch in shard classification Removed an unreachable 'totalShards <= 0' check in classifyShardID as dataShards and parityShards are already guarded. * admin: secure UI links and use canonical shard constants - Added rel="noopener noreferrer" to external links for security. - Replaced magic number 14 with erasure_coding.TotalShardsCount. - Used renderEcShardBadge for missing shard list consistency. * admin: stabilize plugin tests and fix regressions - Composed a robust plugin_monitor_test.go to handle asynchronous persistence. - Updated all time.Time literals to use timeToPtr helper. - Added explicit Shutdown() calls in tests to synchronize with debounced writes. - Fixed syntax errors and orphaned struct literals in tests. * Potential fix for code scanning alert no. 278: Slice memory allocation with excessive size value Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com> * Potential fix for code scanning alert no. 283: Uncontrolled data used in path expression Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com> * admin: finalize refinements for error handling, scheduler, and race fixes - Standardized HTTP 500 status codes for store failures in plugin_api.go. - Tracked scheduled detection goroutines with sync.WaitGroup for safe shutdown. - Fixed race condition in safeSendDetectionComplete by extracting channel under lock. - Implemented deep copy for JobActivity details. - Used defaultDirPerm constant in atomicWriteFile. * test(ec): migrate admin dockertest to plugin APIs * admin/plugin_api: fix RunPluginJobTypeAPI to return 500 for server-side detection/filter errors * admin/plugin_api: fix ExecutePluginJobAPI to return 500 for job execution failures * admin/plugin_api: limit parseProtoJSONBody request body to 1MB to prevent unbounded memory usage * admin/plugin: consolidate regex to package-level validJobTypePattern; add char validation to sanitizeJobID * admin/plugin: fix racy Shutdown channel close with sync.Once * admin/plugin: track sendLoop and recv goroutines in WorkerStream with r.wg * admin/plugin: document writeProtoFiles atomicity — .pb is source of truth, .json is human-readable only * admin/plugin: extract activityLess helper to deduplicate nil-safe OccurredAt sort comparators * test/ec: check http.NewRequest errors to prevent nil req panics * test/ec: replace deprecated ioutil/math/rand, fix stale step comment 5.1→3.1 * plugin(ec): raise default detection and scheduling throughput limits * topology: include empty disks in volume list and EC capacity fallback * topology: remove hard 10-task cap for detection planning * Update ec_volume_details_templ.go * adjust default * fix tests --------- Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>	2026-02-18 13:42:41 -08:00
Chris Lu	f44e25b422	fix(iam): ensure access key status is persisted and defaulted to Active (#8341 ) * Fix master leader election startup issue Fixes #error-log-leader-not-selected-yet * not useful test * fix(iam): ensure access key status is persisted and defaulted to Active * make pb * update tests * using constants	2026-02-13 20:28:41 -08:00
Chris Lu	b08bb8237c	Fix master leader election startup issue (#8340 ) * Fix master leader election startup issue Fixes #error-log-leader-not-selected-yet * Fix master leader election startup issue This change improves server address comparison using the 'Equals' method and handles recursion in topology leader lookup, resolving the 'leader not selected yet' error during master startup. * Merge user improvements: use MaybeLeader for non-blocking checks * not useful test * Address code review: optimize Equals, fix deadlock in IsLeader, safe access in Leader	2026-02-13 15:39:39 -08:00
Chris Lu	1e4f30c56f	pb: fix IPv6 double brackets in ServerAddress formatting (#8329 ) * pb: fix IPv6 double brackets in ServerAddress formatting * pb: refactor IPv6 tests into table-driven test * util: add JoinHostPortStr and use it in pb to avoid unsafe port parsing	2026-02-12 18:11:03 -08:00
Lisandro Pin	1a5679a5eb	Implement a `VolumeEcStatus()` RPC for volume servers. (#8006 ) Just like `VolumeStatus()`, this call allows inspecting details for a given EC volume - including number of files and their total size.	2026-02-09 11:52:08 -08:00
Chris Lu	a3b83f8808	test: add Trino Iceberg catalog integration test (#8228 ) * test: add Trino Iceberg catalog integration test - Create test/s3/catalog_trino/trino_catalog_test.go with TestTrinoIcebergCatalog - Tests integration between Trino SQL engine and SeaweedFS Iceberg REST catalog - Starts weed mini with all services and Trino in Docker container - Validates Iceberg catalog schema creation and listing operations - Uses native S3 filesystem support in Trino with path-style access - Add workflow job to s3-tables-tests.yml for CI execution * fix: preserve AWS environment credentials when replacing S3 configuration When S3 configuration is loaded from filer/db, it replaces the identities list and inadvertently removes AWS_ACCESS_KEY_ID credentials that were added from environment variables. This caused auth to remain disabled even though valid credentials were present. Fix by preserving environment-based identities when replacing the configuration and re-adding them after the replacement. This ensures environment credentials persist across configuration reloads and properly enable authentication. * fix: use correct ServerAddress format with gRPC port encoding The admin server couldn't connect to master because the master address was missing the gRPC port information. Use pb.NewServerAddress() which properly encodes both HTTP and gRPC ports in the address string. Changes: - weed/command/mini.go: Use pb.NewServerAddress for master address in admin - test/s3/policy/policy_test.go: Store and use gRPC ports for master/filer addresses This fix applies to: 1. Admin server connection to master (mini.go) 2. Test shell commands that need master/filer addresses (policy_test.go) * move * move * fix: always include gRPC port in server address encoding The NewServerAddress() function was omitting the gRPC port from the address string when it matched the port+10000 convention. However, gRPC port allocation doesn't always follow this convention - when the calculated port is busy, an alternative port is allocated. This caused a bug where: 1. Master's gRPC port was allocated as 50661 (sequential, not port+10000) 2. Address was encoded as '192.168.1.66:50660' (gRPC port omitted) 3. Admin client called ToGrpcAddress() which assumed port+10000 offset 4. Admin tried to connect to 60660 but master was on 50661 → connection failed Fix: Always include explicit gRPC port in address format (host:httpPort.grpcPort) unless gRPC port is 0. This makes addresses unambiguous and works regardless of the port allocation strategy used. Impacts: All server-to-server gRPC connections now use properly formatted addresses. * test: fix Iceberg REST API readiness check The Iceberg REST API endpoints require authentication. When checked without credentials, the API returns 403 Forbidden (not 401 Unauthorized). The readiness check now accepts both auth error codes (401/403) as indicators that the service is up and ready, it just needs credentials. This fixes the 'Iceberg REST API did not become ready' test failure. * Fix AWS SigV4 signature verification for base64-encoded payload hashes AWS SigV4 canonical requests must use hex-encoded SHA256 hashes, but the X-Amz-Content-Sha256 header may be transmitted as base64. Changes: - Added normalizePayloadHash() function to convert base64 to hex - Call normalizePayloadHash() in extractV4AuthInfoFromHeader() - Added encoding/base64 import Fixes 403 Forbidden errors on POST requests to Iceberg REST API when clients send base64-encoded content hashes in the header. Impacted services: Iceberg REST API, S3Tables * Fix AWS SigV4 signature verification for base64-encoded payload hashes AWS SigV4 canonical requests must use hex-encoded SHA256 hashes, but the X-Amz-Content-Sha256 header may be transmitted as base64. Changes: - Added normalizePayloadHash() function to convert base64 to hex - Call normalizePayloadHash() in extractV4AuthInfoFromHeader() - Added encoding/base64 import - Removed unused fmt import Fixes 403 Forbidden errors on POST requests to Iceberg REST API when clients send base64-encoded content hashes in the header. Impacted services: Iceberg REST API, S3Tables * pass sigv4 * s3api: fix identity preservation and logging levels - Ensure environment-based identities are preserved during config replacement - Update accessKeyIdent and nameToIdentity maps correctly - Downgrade informational logs to V(2) to reduce noise * test: fix trino integration test and s3 policy test - Pin Trino image version to 479 - Fix port binding to 0.0.0.0 for Docker connectivity - Fix S3 policy test hang by correctly assigning MiniClusterCtx - Improve port finding robustness in policy tests * ci: pre-pull trino image to avoid timeouts - Pull trinodb/trino:479 after Docker setup - Ensure image is ready before integration tests start * iceberg: remove unused checkAuth and improve logging - Remove unused checkAuth method - Downgrade informational logs to V(2) - Ensure loggingMiddleware uses a status writer for accurate reported codes - Narrow catch-all route to avoid interfering with other subsystems * iceberg: fix build failure by removing unused s3api import * Update iceberg.go * use warehouse * Update trino_catalog_test.go	2026-02-06 13:12:25 -08:00
Lisandro Pin	2cda4289f4	Add a version token on RPCs to read/update volume server states. (#8191 ) * Add a version token on `GetState()`/`SetState()` RPCs for volume server states. * Make state version a property ov `VolumeServerState` instead of an in-memory counter. Also extend state atomicity to reads, instead of just writes.	2026-02-06 10:58:43 -08:00
Lisandro Pin	9d751a7b61	Contrib/volume scrub local (#8226 )	2026-02-05 14:44:12 -08:00
Lisandro Pin	ff5a8f0579	Implement RPC skeleton for regular/EC volumes scrubbing. (#8187 ) * Implement RPC skeleton for regular/EC volumes scrubbing. See https://github.com/seaweedfs/seaweedfs/issues/8018 for details. * Minor proto improvements for `ScrubVolume()`, `ScrubEcVolume()`: - Add fields for scrubbing details in `ScrubVolumeResponse` and `ScrubEcVolumeResponse`, instead of reporting these through RPC errors. - Return a list of broken shards when scrubbing EC volumes, via `EcShardInfo'.	2026-02-02 17:55:04 -08:00
Lisandro Pin	345ac950b6	Add volume server RPCs to read and update state flags. (#8186 ) * Boostrap persistent state for volume servers. This PR implements logic load/save persistent state information for storages associated with volume servers, and reporting state changes back to masters via heartbeat messages. More work ensues! See https://github.com/seaweedfs/seaweedfs/issues/7977 for details. * Add volume server RPCs to read and update state flags.	2026-02-02 16:22:17 -08:00
Lisandro Pin	9638d37fe2	Block RPC write operations on volume servers when maintenance mode is enabled (#8115 ) * Boostrap persistent state for volume servers. This PR implements logic load/save persistent state information for storages associated with volume servers, and reporting state changes back to masters via heartbeat messages. More work ensues! See https://github.com/seaweedfs/seaweedfs/issues/7977 for details. * Block RPC operations writing to volume servers when maintenance mode is on.	2026-02-02 13:21:02 -08:00
Chris Lu	551a31e156	Implement IAM propagation to S3 servers (#8130 ) * Implement IAM propagation to S3 servers - Add PropagatingCredentialStore to propagate IAM changes to S3 servers via gRPC - Add Policy management RPCs to S3 proto and S3ApiServer - Update CredentialManager to use PropagatingCredentialStore when MasterClient is available - Wire FilerServer to enable propagation * Implement parallel IAM propagation and fix S3 cluster registration - Parallelized IAM change propagation with 10s timeout. - Refined context usage in PropagatingCredentialStore. - Added S3Type support to cluster node management. - Enabled S3 servers to register with gRPC address to the master. - Ensured IAM configuration reload after policy updates via gRPC. * Optimize IAM propagation with direct in-memory cache updates * Secure IAM propagation: Use metadata to skip persistence only on propagation * pb: refactor IAM and S3 services for unidirectional IAM propagation - Move SeaweedS3IamCache service from iam.proto to s3.proto. - Remove legacy IAM management RPCs and empty SeaweedS3 service from s3.proto. - Enforce that S3 servers only use the synchronization interface. * pb: regenerate Go code for IAM and S3 services Updated generated code following the proto refactoring of IAM synchronization services. * s3api: implement read-only mode for Embedded IAM API - Add readOnly flag to EmbeddedIamApi to reject write operations via HTTP. - Enable read-only mode by default in S3ApiServer. - Handle AccessDenied error in writeIamErrorResponse. - Embed SeaweedS3IamCacheServer in S3ApiServer. * credential: refactor PropagatingCredentialStore for unidirectional IAM flow - Update to use s3_pb.SeaweedS3IamCacheClient for propagation to S3 servers. - Propagate full Identity object via PutIdentity for consistency. - Remove redundant propagation of specific user/account/policy management RPCs. - Add timeout context for propagation calls. * s3api: implement SeaweedS3IamCacheServer for unidirectional sync - Update S3ApiServer to implement the cache synchronization gRPC interface. - Methods (PutIdentity, RemoveIdentity, etc.) now perform direct in-memory cache updates. - Register SeaweedS3IamCacheServer in command/s3.go. - Remove registration for the legacy and now empty SeaweedS3 service. * s3api: update tests for read-only IAM and propagation - Added TestEmbeddedIamReadOnly to verify rejection of write operations in read-only mode. - Update test setup to pass readOnly=false to NewEmbeddedIamApi in routing tests. - Updated EmbeddedIamApiForTest helper with read-only checks matching production behavior. * s3api: add back temporary debug logs for IAM updates Log IAM updates received via: - gRPC propagation (PutIdentity, PutPolicy, etc.) - Metadata configuration reloads (LoadS3ApiConfigurationFromCredentialManager) - Core identity management (UpsertIdentity, RemoveIdentity) * IAM: finalize propagation fix with reduced logging and clarified architecture * Allow configuring IAM read-only mode for S3 server integration tests * s3api: add defensive validation to UpsertIdentity * s3api: fix log message to reference correct IAM read-only flag * test/s3/iam: ensure WaitForS3Service checks for IAM write permissions * test: enable writable IAM in Makefile for integration tests * IAM: add GetPolicy/ListPolicies RPCs to s3.proto * S3: add GetBucketPolicy and ListBucketPolicies helpers * S3: support storing generic IAM policies in IdentityAccessManagement * S3: implement IAM policy RPCs using IdentityAccessManagement * IAM: fix stale user identity on rename propagation	2026-01-26 22:59:43 -08:00
Chris Lu	43229b05ce	Explicit IAM gRPC APIs for S3 Server (#8126 ) * Update IAM and S3 protobuf definitions for explicit IAM gRPC APIs * Refactor s3api: Extract generic ExecuteAction method for IAM operations * Implement explicit IAM gRPC APIs in S3 server * iam: remove deprecated GetConfiguration and PutConfiguration RPCs * iamapi: refactor handlers to use CredentialManager directly * s3api: refactor embedded IAM to use CredentialManager directly * server: remove deprecated configuration gRPC handlers * credential/grpc: refactor configuration calls to return error * shell: update s3.configure to list users instead of full config * s3api: fix CreateServiceAccount gRPC handler to map required fields * s3api: fix UpdateServiceAccount gRPC handler to map fields and safe status * s3api: enforce UserName in embedded IAM ListAccessKeys * test: fix test_config.json structure to match proto definition * Revert "credential/grpc: refactor configuration calls to return error" This reverts commit cde707dd8b88c7d1bd730271518542eceb5ed069. * Revert "server: remove deprecated configuration gRPC handlers" This reverts commit 7307e205a083c8315cf84ddc2614b3e50eda2e33. * Revert "s3api: enforce UserName in embedded IAM ListAccessKeys" This reverts commit adf727ba52b4f3ffb911f0d0df85db858412ff83. * Revert "s3api: fix UpdateServiceAccount gRPC handler to map fields and safe status" This reverts commit 6a4be3314d43b6c8fda8d5e0558e83e87a19df3f. * Revert "s3api: fix CreateServiceAccount gRPC handler to map required fields" This reverts commit 9bb4425f07fbad38fb68d33e5c0aa573d8912a37. * Revert "shell: update s3.configure to list users instead of full config" This reverts commit f3304ead537b3e6be03d46df4cb55983ab931726. * Revert "s3api: refactor embedded IAM to use CredentialManager directly" This reverts commit 9012f27af82d11f0e824877712a5ae2505a65f86. * Revert "iamapi: refactor handlers to use CredentialManager directly" This reverts commit 3a148212236576b0a3aa4d991c2abb014fb46091. * Revert "iam: remove deprecated GetConfiguration and PutConfiguration RPCs" This reverts commit e16e08aa0099699338d3155bc7428e1051ce0a6a. * s3api: address IAM code review comments (error handling, logging, gRPC response mapping) * s3api: add robustness to startup by retrying KEK and IAM config loading from Filer * s3api: address IAM gRPC code review comments (safety, validation, status logic) * fix return	2026-01-26 13:38:15 -08:00
Chris Lu	5a7c74feac	migrate IAM policies to multi-file storage (#8114 ) * Add IAM gRPC service definition - Add GetConfiguration/PutConfiguration for config management - Add CreateUser/GetUser/UpdateUser/DeleteUser/ListUsers for user management - Add CreateAccessKey/DeleteAccessKey/GetUserByAccessKey for access key management - Methods mirror existing IAM HTTP API functionality * Add IAM gRPC handlers on filer server - Implement IamGrpcServer with CredentialManager integration - Handle configuration get/put operations - Handle user CRUD operations - Handle access key create/delete operations - All methods delegate to CredentialManager for actual storage * Wire IAM gRPC service to filer server - Add CredentialManager field to FilerOption and FilerServer - Import credential store implementations in filer command - Initialize CredentialManager from credential.toml if available - Register IAM gRPC service on filer gRPC server - Enable credential management via gRPC alongside existing filer services * Regenerate IAM protobuf with gRPC service methods * fix: compilation error in DeleteUser * fix: address code review comments for IAM migration * feat: migrate policies to multi-file layout and fix identity duplicated content * refactor: remove configuration.json and migrate Service Accounts to multi-file layout * refactor: standardize Service Accounts as distinct store entities and fix Admin Server persistence * config: set ServiceAccountsDirectory to /etc/iam/service_accounts * Fix Chrome dialog auto-dismiss with Bootstrap modals - Add modal-alerts.js library with Bootstrap modal replacements - Replace all 15 confirm() calls with showConfirm/showDeleteConfirm - Auto-override window.alert() for all alert() calls - Fixes Chrome 132+ aggressively blocking native dialogs * Upgrade Bootstrap from 5.3.2 to 5.3.8 * Fix syntax error in object_store_users.templ - remove duplicate closing braces * create policy * display errors * migrate to multi-file policies * address PR feedback: use showDeleteConfirm and showErrorMessage in policies.templ, refine migration check * Update policies_templ.go * add service account to iam grpc * iam: fix potential path traversal in policy names by validating name pattern * iam: add GetServiceAccountByAccessKey to CredentialStore interface * iam: implement service account support for PostgresStore Includes full CRUD operations and efficient lookup by access key. * iam: implement GetServiceAccountByAccessKey for filer_etc, grpc, and memory stores Provides efficient lookup of service accounts by access key where possible, with linear scan fallbacks for file-based stores. * iam: remove filer_multiple support Deleted its implementation and references in imports, scaffold config, and core interface constants. Redundant with filer_etc. * clear comment * dash: robustify service account construction - Guard against nil sa.Credential when constructing responses - Fix Expiration logic to only set if > 0, avoiding Unix epoch 1970 - Ensure consistency across Get, Create, and Update handlers * credential/filer_etc: improve error propagation in configuration handlers - Return error from loadServiceAccountsFromMultiFile to callers - Ensure listEntries errors in SaveConfiguration (cleanup logic) are propagated unless they are "not found" failures. - Fixes potential silent failures during IAM configuration sync. * credential/filer_etc: add existence check to CreateServiceAccount Ensures consistency with other stores by preventing accidental overwrite of existing service accounts during creation. * credential/memory: improve store robustness and Reset logic - Enforce ID immutability in UpdateServiceAccount to prevent orphans - Update Reset() to also clear the policies map, ensuring full state cleanup for tests. * dash: improve service account robustness and policy docs - Wrap parent user lookup errors to preserve context - Strictly validate Status field in UpdateServiceAccount - Add deprecation comments to legacy policy management methods * credential/filer_etc: protect against path traversal in service accounts Implemented ID validation (alphanumeric, underscores, hyphens) and applied it to Get, Save, and Delete operations to ensure no directory traversal via saId.json filenames. * credential/postgres: improve robustness and cleanup comments - Removed brainstorming comments in GetServiceAccountByAccessKey - Added missing rows.Err() check during iteration - Properly propagate Scan and Unmarshal errors instead of swallowing them * admin: unify UI alerts and confirmations using Bootstrap modals - Updated modal-alerts.js with improved automated alert type detection - Replaced native alert() and confirm() with showAlert(), showConfirm(), and showDeleteConfirm() across various Templ components - Improved UX for delete operations by providing better context and styling - Ensured consistent error reporting across IAM and Maintenance views * admin: additional UI consistency fixes for alerts and confirmations - Replaced native alert() and confirm() with Bootstrap modals in: - EC volumes (repair flow) - Collection details (repair flow) - File browser (properties and delete) - Maintenance config schema (save and reset) - Improved delete confirmation in file browser with item context - Ensured consistent success/error/info styling for all feedbacks * make * iam: add GetServiceAccountByAccessKey RPC and update GetConfiguration * iam: implement GetServiceAccountByAccessKey on server and client * iam: centralize policy and service account validation * iam: optimize MemoryStore service account lookups with indexing * iam: fix postgres service_accounts table and optimize lookups * admin: refactor modal alerts and clean up dashboard logic * admin: fix EC shards table layout mismatch * admin: URL-encode IAM path parameters for safety * admin: implement pauseWorker logic in maintenance view * iam: add rows.Err() check to postgres ListServiceAccounts * iam: standardize ErrServiceAccountNotFound across credential stores * iam: map ErrServiceAccountNotFound to codes.NotFound in DeleteServiceAccount * iam: refine service account store logic, errors and schema * iam: add validation to GetServiceAccountByAccessKey * admin: refine modal titles and ensure URL safety * admin: address bot review comments for alerts and async usage * iam: fix syntax error by restoring missing function declaration * [FilerEtcStore] improve error handling in CreateServiceAccount Refine error handling to provide clearer messages when checking for existing service accounts. * [PostgresStore] add nil guards and validation to service account methods Ensure input parameters are not nil and required IDs are present to prevent runtime panics and ensure data integrity. * [JS] add shared IAM utility script Consolidate common IAM operations like deleteUser and deleteAccessKey into a shared utility script for better maintainability. * [View] include shared IAM utilities in layout Include iam-utils.js in the main layout to make IAM functions available across all administrative pages. * [View] refactor IAM logic and restore async in EC Shards view Remove redundant local IAM functions and ensure that delete confirmation callbacks are properly marked as async. * [View] consolidate IAM logic in Object Store Users view Remove redundant local definitions of deleteUser and deleteAccessKey, relying on the shared utilities instead. * [View] update generated templ files for UI consistency * credential/postgres: remove redundant name column from service_accounts table The id is already used as the unique identifier and was being copied to the name column. This removes the name column from the schema and updates the INSERT/UPDATE queries. * credential/filer_etc: improve logging for policy migration failures Added Errorf log if AtomicRenameEntry fails during migration to ensure visibility of common failure points. * credential: allow uppercase characters in service account ID username Updated ServiceAccountIdPattern to allow [A-Za-z0-9_-]+ for the username component, matching the actual service account creation logic which uses the parent user name directly. * Update object_store_users_templ.go * admin: fix ec_shards pagination to handle numeric page arguments Updated goToPage in cluster_ec_shards.templ to accept either an Event or a numeric page argument. This prevents errors when goToPage(1) is called directly. Corrected both the .templ source and generated Go code. * credential/filer_etc: improve service account storage robustness Added nil guard to saveServiceAccount, updated GetServiceAccount to return ErrServiceAccountNotFound for empty data, and improved deleteServiceAccount to handle response-level Filer errors.	2026-01-26 11:28:23 -08:00
Chris Lu	6bf088cec9	IAM Policy Management via gRPC (#8109 ) * Add IAM gRPC service definition - Add GetConfiguration/PutConfiguration for config management - Add CreateUser/GetUser/UpdateUser/DeleteUser/ListUsers for user management - Add CreateAccessKey/DeleteAccessKey/GetUserByAccessKey for access key management - Methods mirror existing IAM HTTP API functionality * Add IAM gRPC handlers on filer server - Implement IamGrpcServer with CredentialManager integration - Handle configuration get/put operations - Handle user CRUD operations - Handle access key create/delete operations - All methods delegate to CredentialManager for actual storage * Wire IAM gRPC service to filer server - Add CredentialManager field to FilerOption and FilerServer - Import credential store implementations in filer command - Initialize CredentialManager from credential.toml if available - Register IAM gRPC service on filer gRPC server - Enable credential management via gRPC alongside existing filer services * Regenerate IAM protobuf with gRPC service methods * iam_pb: add Policy Management to protobuf definitions * credential: implement PolicyManager in credential stores * filer: implement IAM Policy Management RPCs * shell: add s3.policy command * test: add integration test for s3.policy * test: fix compilation errors in policy_test * pb * fmt * test * weed shell: add -policies flag to s3.configure This allows linking/unlinking IAM policies to/from identities directly from the s3.configure command. * test: verify s3.configure policy linking and fix port allocation - Added test case for linking policies to users via s3.configure - Implemented findAvailablePortPair to ensure HTTP and gRPC ports are both available, avoiding conflicts with randomized port assignments. - Updated assertion to match jsonpb output (policyNames) * credential: add StoreTypeGrpc constant * credential: add IAM gRPC store boilerplate * credential: implement identity methods in gRPC store * credential: implement policy methods in gRPC store * admin: use gRPC credential store for AdminServer This ensures that all IAM and policy changes made through the Admin UI are persisted via the Filer's IAM gRPC service instead of direct file manipulation. * shell: s3.configure use granular IAM gRPC APIs instead of full config patching * shell: s3.configure use granular IAM gRPC APIs * shell: replace deprecated ioutil with os in s3.policy * filer: use gRPC FailedPrecondition for unconfigured credential manager * test: improve s3.policy integration tests and fix error checks * ci: add s3 policy shell integration tests to github workflow * filer: fix LoadCredentialConfiguration error handling * credential/grpc: propagate unmarshal errors in GetPolicies * filer/grpc: improve error handling and validation * shell: use gRPC status codes in s3.configure * credential: document PutPolicy as create-or-replace * credential/postgres: reuse CreatePolicy in PutPolicy to deduplicate logic * shell: add timeout context and strictly enforce flags in s3.policy * iam: standardize policy content field naming in gRPC and proto * shell: extract slice helper functions in s3.configure * filer: map credential store errors to gRPC status codes * filer: add input validation for UpdateUser and CreateAccessKey * iam: improve validation in policy and config handlers * filer: ensure IAM service registration by defaulting credential manager * credential: add GetStoreName method to manager * test: verify policy deletion in integration test	2026-01-25 13:39:30 -08:00
Lisandro Pin	59d40f7186	Return volume server state flags via `VolumeServerStatus()` RPCs. (#8016 )	2026-01-24 21:45:23 -08:00
Chris Lu	753e1db096	Prevent split-brain: Persistent ClusterID and Join Validation (#8022 ) * Prevent split-brain: Persistent ClusterID and Join Validation - Persist ClusterId in Raft store to survive restarts. - Validate ClusterId on Raft command application (piggybacked on MaxVolumeId). - Prevent masters with conflicting ClusterIds from joining/operating together. - Update Telemetry to report the persistent ClusterId. * Refine ClusterID validation based on feedback - Improved error message in cluster_commands.go. - Added ClusterId mismatch check in RaftServer.Recovery. * Handle Raft errors and support Hashicorp Raft for ClusterId - Check for errors when persisting ClusterId in legacy Raft. - Implement ClusterId generation and persistence for Hashicorp Raft leader changes. - Ensure consistent error logging. * Refactor ClusterId validation - Centralize ClusterId mismatch check in Topology.SetClusterId. - Simplify MaxVolumeIdCommand.Apply and RaftServer.Recovery to rely on SetClusterId. * Fix goroutine leak and add timeout - Handle channel closure in Hashicorp Raft leader listener. - Add timeout to Raft Apply call to prevent blocking. * Fix deadlock in legacy Raft listener - Wrap ClusterId generation/persistence in a goroutine to avoid blocking the Raft event loop (deadlock). * Rename ClusterId to SystemId - Renamed ClusterId to SystemId across the codebase (protobuf, topology, server, telemetry). - Regenerated telemetry.pb.go with new field. * Rename SystemId to TopologyId - Rename to SystemId was intermediate step. - Final name is TopologyId for the persistent cluster identifier. - Updated protobuf, topology, raft server, master server, and telemetry. * Optimize Hashicorp Raft listener - Integrated TopologyId generation into existing monitorLeaderLoop. - Removed extra goroutine in master_server.go. * Fix optimistic TopologyId update - Removed premature local state update of TopologyId in master_server.go and raft_hashicorp.go. - State is now solely updated via the Raft state machine Apply/Restore methods after consensus. * Add explicit log for recovered TopologyId - Added glog.V(0) info log in RaftServer.Recovery to print the recovered TopologyId on startup. * Add Raft barrier to prevent TopologyId race condition - Implement ensureTopologyId helper method - Send no-op MaxVolumeIdCommand to sync Raft log before checking TopologyId - Ensures persisted TopologyId is recovered before generating new one - Prevents race where generation happens during log replay * Serialize TopologyId generation with mutex - Add topologyIdGenLock mutex to MasterServer struct - Wrap ensureTopologyId method with lock to prevent concurrent generation - Fixes race where event listener and manual leadership check both generate IDs - Second caller waits for first to complete and sees the generated ID * Add TopologyId recovery logging to Apply method - Change log level from V(1) to V(0) for visibility - Log 'Recovered TopologyId' when applying from Raft log - Ensures recovery is visible whether from snapshot or log replay - Matches Recovery() method logging for consistency * Fix Raft barrier timing issue - Add 100ms delay after barrier command to ensure log application completes - Add debug logging to track barrier execution and TopologyId state - Return early if barrier command fails - Prevents TopologyId generation before old logs are fully applied * ensure leader * address comments * address comments * redundant * clean up * double check * refactoring * comment	2026-01-18 14:02:34 -08:00
Jaehoon Kim	f2e7af257d	Fix volume.fsck -forcePurging -reallyDeleteFromVolume to fail fast on filer traversal errors (#8015 ) * Add TraverseBfsWithContext and fix race conditions in error handling - Add TraverseBfsWithContext function to support context cancellation - Fix race condition in doTraverseBfsAndSaving using atomic.Bool and sync.Once - Improve error handling with fail-fast behavior and proper error propagation - Update command_volume_fsck to use error-returning saveFn callback - Enhance error messages in readFilerFileIdFile with detailed context * refactoring * fix error format * atomic * filer_pb: make enqueue return void * shell: simplify fs.meta.save error handling * filer_pb: handle enqueue return value * Revert "atomic" This reverts commit 712648bc354b186d6654fdb8a46fd4848fdc4e00. * shell: refine fs.meta.save logic --------- Co-authored-by: Chris Lu <chris.lu@gmail.com>	2026-01-14 21:37:50 -08:00
Lisandro Pin	2af293ce60	Boostrap persistent state for volume servers. (#7984 ) This PR implements logic load/save persistent state information for storages associated with volume servers, and reporting state changes back to masters via heartbeat messages. More work ensues! See https://github.com/seaweedfs/seaweedfs/issues/7977 for details.	2026-01-12 10:49:59 -08:00
promalert	9012069bd7	chore: execute goimports to format the code (#7983 ) * chore: execute goimports to format the code Signed-off-by: promalert <promalert@outlook.com> * goimports -w . --------- Signed-off-by: promalert <promalert@outlook.com> Co-authored-by: Chris Lu <chris.lu@gmail.com>	2026-01-07 13:06:08 -08:00
Chris Lu	e67973dc53	Support Policy Attachment for Object Store Users (#7981 ) * Implement Policy Attachment support for Object Store Users - Added policy_names field to iam.proto and regenerated protos. - Updated S3 API and IAM integration to support direct policy evaluation for users. - Enhanced Admin UI to allow attaching policies to users via modals. - Renamed 'policies' to 'policy_names' to clarify that it stores identifiers. - Fixed syntax error in user_management.go. * Fix policy dropdown not populating The API returns {policies: [...]} but JavaScript was treating response as direct array. Updated loadPolicies() to correctly access data.policies property. * Add null safety checks for policy dropdowns Added checks to prevent "undefined" errors when: - Policy select elements don't exist - Policy dropdowns haven't loaded yet - User is being edited before policies are loaded * Fix policy dropdown by using correct JSON field name JSON response has lowercase 'name' field but JavaScript was accessing 'Name'. Changed policy.Name to policy.name to match the IAMPolicy JSON structure. * Fix policy names not being saved on user update Changed condition from len(req.PolicyNames) > 0 to req.PolicyNames != nil to ensure policy names are always updated when present in the request, even if it's an empty array (to allow clearing policies). * Add debug logging for policy names update flow Added console.log in frontend and glog in backend to trace policy_names data through the update process. * Temporarily disable auto-reload for debugging Commented out window.location.reload() so console logs are visible when updating a user. * Add detailed debug logging and alert for policy selection Added console.log for each step and an alert to show policy_names value to help diagnose why it's not being included in the request. * Regenerate templ files for object_store_users Ran templ generate to ensure _templ.go files are up to date with the latest .templ changes including debug logging. * Remove debug logging and restore normal functionality Cleaned up temporary debug code (console.log and alert statements) and re-enabled automatic page reload after user update. * Add step-by-step alert debugging for policy update Added 5 alert checkpoints to trace policy data through the update flow: 1. Check if policiesSelect element exists 2. Show selected policy values 3. Show userData.policy_names 4. Show full request body 5. Confirm server response Temporarily disabled auto-reload to see alerts. * Add version check alert on page load Added alert on DOMContentLoaded to verify new JavaScript is being executed and not cached by the browser. * Compile templates using make Ran make to compile all template files and install the weed binary. * Add button click detection and make handleUpdateUser global - Added inline alert on button click to verify click is detected - Made handleUpdateUser a window-level function to ensure it's accessible - Added alert at start of handleUpdateUser function * Fix handleUpdateUser scope issue - remove duplicate definition Removed duplicate function definition that was inside DOMContentLoaded. Now handleUpdateUser is defined only once in global scope (line 383) making it accessible when button onclick fires. * Remove all duplicate handleUpdateUser definitions Now handleUpdateUser is defined only once at the very top of the script block (line 352), before DOMContentLoaded, ensuring it's available when the button onclick fires. * Add function existence check and error catching Added alerts to check if handleUpdateUser is defined and wrapped the function call in try-catch to capture any JavaScript errors. Also added console.log statements to verify function definition. * Simplify handleUpdateUser to non-async for testing Removed async/await and added early return to test if function can be called at all. This will help identify if async is causing the issue. * Add cache-control headers to prevent browser caching Added no-cache headers to ShowObjectStoreUsers handler to prevent aggressive browser caching of inline JavaScript in the HTML page. * Fix syntax error - make handleUpdateUser async Changed function back to async to fix 'await is only valid in async functions' error. The cache-control headers are working - browser is now loading new code. * Update version check to v3 to verify cache busting Changed version alert to 'v3 - WITH EARLY RETURN' to confirm the new code with early return statement is being loaded. * Remove all debug code - clean implementation Removed all alerts, console.logs, and test code. Implemented clean policy update functionality with proper error handling. * Add ETag header for cache-busting and update walkthrough * Fix policy pre-selection in Edit User modal - Updated admin.js editUser function to pre-select policies - Root cause: duplicate editUser in admin.js overwrote inline version - Added policy pre-selection logic to match inline template - Verified working in browser: policies now pre-select correctly * Fix policy persistence in handleUpdateUser - Added policy_names field to userData payload in handleUpdateUser - Policies were being lost because handleUpdateUser only sent email and actions - Now collects selected policies from editPolicies dropdown - Verified working: policies persist correctly across updates * Fix XSS vulnerability in access keys display - Escape HTML in access key display using escapeHtml utility - Replace inline onclick handlers with data attributes - Add event delegation for delete access key buttons - Prevents script injection via malicious access key values * Fix additional XSS vulnerabilities in user details display - Escape HTML in actions badges (line 626) - Escape HTML in policy_names badges (line 636) - Prevents script injection via malicious action or policy names * Fix XSS vulnerability in loadPolicies function - Replace innerHTML string concatenation with DOM API - Use createElement and textContent for safe policy name insertion - Prevents script injection via malicious policy names - Apply same pattern to both create and edit select elements * Remove debug logging from UpdateObjectStoreUser - Removed glog.V(0) debug statements - Clean up temporary debugging code before production * Remove duplicate handleUpdateUser function - Removed inline handleUpdateUser that duplicated admin.js logic - Removed debug console.log statement - admin.js version is now the single source of truth - Eliminates maintenance burden of keeping two versions in sync * Refine user management and address code review feedback - Preserve PolicyNames in UpdateUserPolicies - Allow clearing actions in UpdateObjectStoreUser by checking for nil - Remove version comment from object_store_users.templ - Refactor loadPolicies for DRYness using cloneNode while keeping DOM API security * IAM Authorization for Static Access Keys * verified XSS Fixes in Templates * fix div	2026-01-06 21:53:28 -08:00
Chris Lu	31a4f57cd9	Fix: Add -admin.grpc flag to worker for explicit gRPC port (#7926 ) (#7927 ) * Fix: Add -admin.grpc flag to worker for explicit gRPC port configuration * Fix(helm): Add adminGrpcServer to worker configuration * Refactor: Support host:port.grpcPort address format, revert -admin.grpc flag * Helm: Conditionally append grpcPort to worker admin address * weed/admin: fix "send on closed channel" panic in worker gRPC server Make unregisterWorker connection-aware to prevent closing channels belonging to newer connections. * weed/worker: improve gRPC client stability and logging - Fix goroutine leak in reconnection logic - Refactor reconnection loop to exit on success and prevent busy-waiting - Add session identification and enhanced logging to client handlers - Use constant for internal reset action and remove unused variables * weed/worker: fix worker state initialization and add lifecycle logs - Revert workerState to use running boolean correctly - Prevent handleStart failing by checking running state instead of startTime - Add more detailed logs for worker startup events	2025-12-31 11:55:09 -08:00
Chris Lu	ae9a943ef6	IAM: Add Service Account Support (#7744 ) (#7901 ) * iam: add ServiceAccount protobuf schema Add ServiceAccount message type to iam.proto with support for: - Unique ID and parent user linkage - Optional expiration timestamp - Separate credentials (access key/secret) - Action restrictions (subset of parent) - Enable/disable status This is the first step toward implementing issue #7744 (IAM Service Account Support). * iam: add service account response types Add IAM API response types for service account operations: - ServiceAccountInfo struct for marshaling account details - CreateServiceAccountResponse - DeleteServiceAccountResponse - ListServiceAccountsResponse - GetServiceAccountResponse - UpdateServiceAccountResponse Also add type aliases in iamapi package for backwards compatibility. Part of issue #7744 (IAM Service Account Support). * iam: implement service account API handlers Add CRUD operations for service accounts: - CreateServiceAccount: Creates service account with ABIA key prefix - DeleteServiceAccount: Removes service account and parent linkage - ListServiceAccounts: Lists all or filtered by parent user - GetServiceAccount: Retrieves service account details - UpdateServiceAccount: Modifies status, description, expiration Service accounts inherit parent user's actions by default and support optional expiration timestamps. Part of issue #7744 (IAM Service Account Support). * sts: add AssumeRoleWithWebIdentity HTTP endpoint Add STS API HTTP endpoint for AWS SDK compatibility: - Create s3api_sts.go with HTTP handlers matching AWS STS spec - Support AssumeRoleWithWebIdentity action with JWT token - Return XML response with temporary credentials (AccessKeyId, SecretAccessKey, SessionToken) matching AWS format - Register STS route at POST /?Action=AssumeRoleWithWebIdentity This enables AWS SDKs (boto3, AWS CLI, etc.) to obtain temporary S3 credentials using OIDC/JWT tokens. Part of issue #7744 (IAM Service Account Support). * test: add service account and STS integration tests Add integration tests for new IAM features: s3_service_account_test.go: - TestServiceAccountLifecycle: Create, Get, List, Update, Delete - TestServiceAccountValidation: Error handling for missing params s3_sts_test.go: - TestAssumeRoleWithWebIdentityValidation: Parameter validation - TestAssumeRoleWithWebIdentityWithMockJWT: JWT token handling Tests skip gracefully when SeaweedFS is not running or when IAM features are not configured. Part of issue #7744 (IAM Service Account Support). * iam: address code review comments - Add constants for service account ID and key lengths - Use strconv.ParseInt instead of fmt.Sscanf for better error handling - Allow clearing descriptions by checking key existence in url.Values - Replace magic numbers (12, 20, 40) with named constants Addresses review comments from gemini-code-assist[bot] * test: add proper error handling in service account tests Use require.NoError(t, err) for io.ReadAll and xml.Unmarshal to prevent silent failures and ensure test reliability. Addresses review comment from gemini-code-assist[bot] * test: add proper error handling in STS tests Use require.NoError(t, err) for io.ReadAll and xml.Unmarshal to prevent silent failures and ensure test reliability. Repeated this fix throughout the file. Addresses review comment from gemini-code-assist[bot] in PR #7901. * iam: address additional code review comments - Specific error code mapping for STS service errors - Distinguish between Sender and Receiver error types in STS responses - Add nil checks for credentials in List/GetServiceAccount - Validate expiration date is in the future - Improve integration test error messages (include response body) - Add credential verification step in service account tests Addresses remaining review comments from gemini-code-assist[bot] across multiple files. * iam: fix shared slice reference in service account creation Copy parent's actions to create an independent slice for the service account instead of sharing the underlying array. This prevents unexpected mutations when the parent's actions are modified later. Addresses review comment from coderabbitai[bot] in PR #7901. * iam: remove duplicate unused constant Removed redundant iamServiceAccountKeyPrefix as ServiceAccountKeyPrefix is already defined and used. Addresses remaining cleanup task. * sts: document limitation of string-based error mapping Added TODO comment explaining that the current string-based error mapping approach is fragile and should be replaced with typed errors from the STS service in a future refactoring. This addresses the architectural concern raised in code review while deferring the actual implementation to a separate PR to avoid scope creep in the current service account feature addition. * iam: fix remaining review issues - Add future-date validation for expiration in UpdateServiceAccount - Reorder tests so credential verification happens before deletion - Fix compilation error by using correct JWT generation methods Addresses final review comments from coderabbitai[bot]. * iam: fix service account access key length The access key IDs were incorrectly generated with 24 characters instead of the AWS-standard 20 characters. This was caused by generating 20 random characters and then prepending the 4-character ABIA prefix. Fixed by subtracting the prefix length from AccessKeyLength, so the final key is: ABIA (4 chars) + random (16 chars) = 20 chars total. This ensures compatibility with S3 clients that validate key length. * test: add comprehensive service account security tests Added comprehensive integration tests for service account functionality: - TestServiceAccountS3Access: Verify SA credentials work for S3 operations - TestServiceAccountExpiration: Test expiration date validation and enforcement - TestServiceAccountInheritedPermissions: Verify parent-child relationship - TestServiceAccountAccessKeyFormat: Validate AWS-compatible key format (ABIA prefix, 20 char length) These tests ensure SeaweedFS service accounts are compatible with AWS conventions and provide robust security coverage. * iam: remove unused UserAccessKeyPrefix constant Code cleanup to remove unused constants. * iam: remove unused iamCommonResponse type alias Code cleanup to remove unused type aliases. * iam: restore and use UserAccessKeyPrefix constant Restored UserAccessKeyPrefix constant and updated s3api tests to use it instead of hardcoded strings for better maintainability and consistency. * test: improve error handling in service account security tests Added explicit error checking for io.ReadAll and xml.Unmarshal in TestServiceAccountExpiration to ensure failures are reported correctly and cleanup is performed only when appropriate. Also added logging for failed responses. * test: use t.Cleanup for reliable resource cleanup Replaced defer with t.Cleanup to ensure service account cleanup runs even when require.NoError fails. Also switched from manual error checking to require.NoError for more idiomatic testify usage. * iam: add CreatedBy field and optimize identity lookups - Added createdBy parameter to CreateServiceAccount to track who created each service account - Extract creator identity from request context using GetIdentityNameFromContext - Populate created_by field in ServiceAccount protobuf - Added findIdentityByName helper function to optimize identity lookups - Replaced nested loops with O(n) helper function calls in CreateServiceAccount and DeleteServiceAccount This addresses code review feedback for better auditing and performance. * iam: prevent user deletion when service accounts exist Following AWS IAM behavior, prevent deletion of users that have active service accounts. This ensures explicit cleanup and prevents orphaned service account resources with invalid ParentUser references. Users must delete all associated service accounts before deleting the parent user, providing safer resource management. * sts: enhance TODO with typed error implementation guidance Updated TODO comment with detailed implementation approach for replacing string-based error matching with typed errors using errors.Is(). This provides a clear roadmap for a follow-up PR to improve error handling robustness and maintainability. * iam: add operational limits for service account creation Added AWS IAM-compatible safeguards to prevent resource exhaustion: - Maximum 100 service accounts per user (LimitExceededException) - Maximum 1000 character description length (InvalidInputException) These limits prevent accidental or malicious resource exhaustion while not impacting legitimate use cases. * iam: add missing operational limit constants Added MaxServiceAccountsPerUser and MaxDescriptionLength constants that were referenced in the previous commit but not defined. * iam: enforce service account expiration during authentication CRITICAL SECURITY FIX: Expired service account credentials were not being rejected during authentication, allowing continued access after expiration. Changes: - Added Expiration field to Credential struct - Populate expiration when loading service accounts from configuration - Check expiration in all authentication paths (V2 and V4 signatures) - Return ErrExpiredToken for expired credentials This ensures expired service accounts are properly rejected at authentication time, matching AWS IAM behavior and preventing unauthorized access. * iam: fix error code for expired service account credentials Use ErrAccessDenied instead of non-existent ErrExpiredToken for expired service account credentials. This provides appropriate access denial for expired credentials while maintaining AWS-compatible error responses. * iam: fix remaining ErrExpiredToken references Replace all remaining instances of non-existent ErrExpiredToken with ErrAccessDenied for expired service account credentials. * iam: apply AWS-standard key format to user access keys Updated CreateAccessKey to generate AWS-standard 20-character access keys with AKIA prefix for regular users, matching the format used for service accounts. This ensures consistency across all access key types and full AWS compatibility. - Access keys: AKIA + 16 random chars = 20 total (was 21 chars, no prefix) - Secret keys: 40 random chars (was 42, now matches AWS standard) - Uses AccessKeyLength and UserAccessKeyPrefix constants * sts: replace fragile string-based error matching with typed errors Implemented robust error handling using typed errors and errors.Is() instead of fragile strings.Contains() matching. This decouples the HTTP layer from service implementation details and prevents errors from being miscategorized if error messages change. Changes: - Added typed error variables to weed/iam/sts/constants.go: * ErrTypedTokenExpired * ErrTypedInvalidToken * ErrTypedInvalidIssuer * ErrTypedInvalidAudience * ErrTypedMissingClaims - Updated STS service to wrap provider authentication errors with typed errors - Replaced strings.Contains() with errors.Is() in HTTP layer for error checking - Removed TODO comment as the improvement is now implemented This makes error handling more maintainable and reliable. * sts: eliminate all string-based error matching with provider-level typed errors Completed the typed error implementation by adding provider-level typed errors and updating provider implementations to return them. This eliminates ALL fragile string matching throughout the entire error handling stack. Changes: - Added typed error definitions to weed/iam/providers/errors.go: * ErrProviderTokenExpired * ErrProviderInvalidToken * ErrProviderInvalidIssuer * ErrProviderInvalidAudience * ErrProviderMissingClaims - Updated OIDC provider to wrap JWT validation errors with typed provider errors - Replaced strings.Contains() with errors.Is() in STS service for error mapping - Complete error chain: Provider -> STS -> HTTP layer, all using errors.Is() This provides: - Reliable error classification independent of error message content - Type-safe error checking throughout the stack - No order-dependent string matching - Maintainable error handling that won't break with message changes * oidc: use jwt.ErrTokenExpired instead of string matching Replaced the last remaining string-based error check with the JWT library's exported typed error. This makes the error detection independent of error message content and more robust against library updates. Changed from: strings.Contains(errMsg, "expired") To: errors.Is(err, jwt.ErrTokenExpired) This completes the elimination of ALL string-based error matching throughout the entire authentication stack. * iam: add description length validation to UpdateServiceAccount Fixed inconsistency where UpdateServiceAccount didn't validate description length against MaxDescriptionLength, allowing operational limits to be bypassed during updates. Now validates that updated descriptions don't exceed 1000 characters, matching the validation in CreateServiceAccount. * iam: refactor expiration check into helper method Extracted duplicated credential expiration check logic into a helper method to reduce code duplication and improve maintainability. Added Credential.isCredentialExpired() method and replaced 5 instances of inline expiration checks across auth_signature_v2.go and auth_signature_v4.go. * iam: address critical Copilot security and consistency feedback Fixed three critical issues identified by Copilot code review: 1. SECURITY: Prevent loading disabled service account credentials - Added check to skip disabled service accounts during credential loading - Disabled accounts can no longer authenticate 2. Add DurationSeconds validation for STS AssumeRoleWithWebIdentity - Enforce AWS-compatible range: 900-43200 seconds (15 min - 12 hours) - Returns proper error for out-of-range values 3. Fix expiration update consistency in UpdateServiceAccount - Added key existence check like Description field - Allows explicit clearing of expiration by setting to empty string - Distinguishes between "not updating" and "clearing expiration" * sts: remove unused durationSecondsStr variable Fixed build error from unused variable after refactoring duration parsing. * iam: address remaining Copilot feedback and remove dead code Completed remaining Copilot code review items: 1. Remove unused getPermission() method (dead code) - Method was defined but never called anywhere 2. Improve slice modification safety in DeleteServiceAccount - Replaced append-with-slice-operations with filter pattern - Avoids potential issues from mutating slice during iteration 3. Fix route registration order - Moved STS route registration BEFORE IAM route - Prevents IAM route from intercepting STS requests - More specific route (with query parameter) now registered first * iam: improve expiration validation and test cleanup robustness Addressed additional Copilot feedback: 1. Make expiration validation more explicit - Added explicit check for negative values - Added comment clarifying that 0 is allowed to clear expiration - Improves code readability and intent 2. Fix test cleanup order in s3_service_account_test.go - Track created service accounts in a slice - Delete all service accounts before deleting parent user - Prevents DeleteConflictException during cleanup - More robust cleanup even if test fails mid-execution Note: s3_service_account_security_test.go already had correct cleanup order due to LIFO defer execution. * test: remove redundant variable assignments Removed duplicate assignments of createdSAId, createdAccessKeyId, and createdSecretAccessKey on lines 148-150 that were already assigned on lines 132-134.	2025-12-29 20:17:23 -08:00
Chris Lu	56ebd9236e	Fix: relax gRPC server keepalive enforcement to 20s (#7898 ) * Fix: relax gRPC server keepalive enforcement to 20s to prevent 'too_many_pings' errors * Refactor: introduce GrpcKeepAliveMinimumTime constant for clarity	2025-12-28 12:03:21 -08:00
Chris Lu	95716a2f87	less verbose	2025-12-26 12:55:19 -08:00
Chris Lu	5aa111708d	grpc: reduce client idle pings to avoid ENHANCE_YOUR_CALM (#7885 ) * grpc: reduce client idle pings to avoid ENHANCE_YOUR_CALM (too_many_pings) * test: use context.WithTimeout and pb constants for keepalive * test(kafka): use separate dial and client contexts in NewDirectBrokerClient * test(kafka): fix client context usage in NewDirectBrokerClient	2025-12-26 10:58:18 -08:00
Chris Lu	2567be8040	refactor: remove unused gRPC connection age parameters (#7852 ) The GrpcMaxConnectionAge and GrpcMaxConnectionAgeGrace constants have a troubled history - they were removed in 2022 due to gRPC issues, reverted later, and recently re-added. However, they are not essential to the core worker reconnection fix which was solved through proper goroutine ordering. The Docker Swarm DNS handling mentioned in the comments is not critical, and these parameters have caused problems in the past. Removing them simplifies the configuration without losing functionality.	2025-12-22 18:25:21 -08:00

1 2 3 4 5 ...

650 Commits