seaweedFS

Author	SHA1	Message	Date
Chris Lu	ba624f1f34	Rust volume server implementation with CI (#8539 ) * Match Go gRPC client transport defaults * Honor Go HTTP idle timeout * Honor maintenanceMBps during volume copy * Honor images.fix.orientation on uploads * Honor cpuprofile when pprof is disabled * Match Go memory status payloads * Propagate request IDs across gRPC calls * Format pending Rust source updates * Match Go stats endpoint payloads * Serve Go volume server UI assets * Enforce Go HTTP whitelist guards * Align Rust metrics admin-port test with Go behavior * Format pending Rust server updates * Honor access.ui without per-request JWT checks * Honor keepLocalDatFile in tier upload shortcut * Honor Go remote volume write mode * Load tier backends from master config * Check master config before loading volumes * Remove vif files on volume destroy * Delete remote tier data on volume destroy * Honor vif version defaults and overrides * Reject mismatched vif bytes offsets * Load remote-only tiered volumes * Report Go tail offsets in sync status * Stream remote dat in incremental copy * Honor collection vif for EC shard config * Persist EC expireAtSec in vif metadata * Stream remote volume reads through HTTP * Serve HTTP ranges from backend source * Match Go ReadAllNeedles scan order * Match Go CopyFile zero-stop metadata * Delete EC volumes with collection cleanup * Drop deleted collection metrics * Match Go tombstone ReadNeedleMeta * Match Go TTL parsing: all-digit default to minutes, two-pass fit algorithm * Match Go needle ID/cookie formatting and name size computation * Match Go image ext checks: webp resize only, no crop; empty healthz body * Match Go Prometheus metric names and add missing handler counter constants * Match Go ReplicaPlacement short string parsing with zero-padding * Add missing EC constants MAX_SHARD_COUNT and MIN_TOTAL_DISKS * Add walk_ecx_stats for accurate EC volume file counts and size * Match Go VolumeStatus dat file size, EC shard stats, and disk pct precision * Match Go needle map: unconditional delete counter, fix redb idx walk offset * Add CompactMapSegment overflow panic guard matching Go * Match Go volume: vif creation, version from superblock, TTL expiry, dedup data_size, garbage_level fallback * Match Go 304 Not Modified: return bare status with no headers * Match Go JWT error message: use "wrong jwt" instead of detailed error * Match Go read handler bare 400, delete error prefix, download throttle timeout * Match Go pretty JSON 1-space indent and "Deletion Failed:" error prefix * Match Go heartbeat: keep is_heartbeating on error, add EC shard identification * Match Go needle ReadBytes V2: tolerate EOF on truncated body * Match Go volume: cookie check on any existing needle, return DataSize, 128KB meta guard * Match Go DeleteCollection: propagate destroy errors * Match Go gRPC: BatchDelete no flag, IncrementalCopy error, FetchAndWrite concurrent, VolumeUnmount/DeleteCollection errors, tail draining, query error code * Match Go Content-Disposition RFC 6266 formatting with RFC 2231 encoding * Match Go Guard isWriteActive: combine whitelist and signing key check * Match Go DeleteCollectionMetrics: use partial label matching * Match Go heartbeat: send state-only delta on volume state changes * Match Go ReadNeedleMeta paged I/O: read header+tail only, skip data; add EIO tracking * Match Go ScrubVolume INDEX mode dispatch; add VolumeCopy preallocation and EC NeedleStatus TODOs * Add read_ec_shard_needle for full needle reconstruction from local EC shards * Make heartbeat master config helpers pub for VolumeCopy preallocation * Match Go gRPC: VolumeCopy preallocation, EC NeedleStatus full read, error message wording * Match Go HTTP responses: omitempty fields, 2-space JSON indent, JWT JSON error, delete pretty/JSONP, 304 Last-Modified, raw write error * Match Go WriteNeedleBlob V3 timestamp patching, fix makeup_diff double padding, count==0 read handling * Add rebuild_ecx_file for EC index reconstruction from data shards * Match Go gRPC: tail header first-chunk-only, EC cleanup on failure, copy append mode, ecx rebuild, compact cancellation * Add EC volume read and delete support in HTTP handlers * Add per-shard EC mount/unmount, location predicate search, idx directory for EC * Add CheckVolumeDataIntegrity on volume load matching Go * Match Go gRPC: EC multi-disk placement, per-shard mount/unmount, no auto-mount on reconstruct, streaming ReadAll/EcShardRead, ReceiveFile cleanup, version check, proxy streaming, redirect Content-Type * Match Go heartbeat metric accounting * Match Go duplicate UUID heartbeat retries * Delete expired EC volumes during heartbeat * Match Go volume heartbeat pruning * Honor master preallocate in volume max * Report remote storage info in heartbeats * Emit EC heartbeat deltas on shard changes * Match Go throttle boundary: use <= instead of <, fix pretty JSON to 1-space * Match Go write_needle_blob monotonic appendAtNs via get_append_at_ns * Match Go VolumeUnmount: idempotent success when volume not found * Match Go TTL Display: return empty string when unit is Empty Go checks `t.Unit == Empty` separately and returns "" for TTLs with nonzero count but Empty unit. Rust only checked is_empty() (count==0 && unit==0), so count>0 with unit=0 would format as "5 " instead of "". * Match Go error behavior for truncated needle data in read_body_v2 Go's readNeedleDataVersion2 returns "index out of range %d" errors (indices 1-7) when needle body or metadata fields are truncated. Rust was silently tolerating truncation and returning Ok. Now returns NeedleError::IndexOutOfRange with the matching index for each field. * Match Go download throttle: return JSON error instead of plain text * Match Go crop params: default x1/y1 to 0 when not provided * Match Go ScrubEcVolume: accumulate total_files from EC shards * Match Go ScrubVolume: count total_files even on scrub error * Match Go VolumeEcShardsCopy: set ignore_source_file_not_found for .vif * Match Go VolumeTailSender: send needle_header on every chunk * Match Go read_super_block: apply replication override from .vif * Match Go check_volume_data_integrity: verify all 10 entries, detect trailing corruption * Match Go WriteNeedleBlob: dedup check before writing during replication * handlers: use meta-only reads for HEAD * handlers: align range parsing and responses with Go * handlers: align upload parsing with Go * deps: enable webp support * Make 5bytes the default feature for idx entry compatibility * Match Go TTL: preserve original unit when count fits in byte * Fix EC locate_needle: use get_actual_size for full needle size * Fix raw body POST: only parse multipart when Content-Type contains form-data * Match Go ReceiveFile: return protocol errors in response body, not gRPC status * add docs * Match Go VolumeEcShardsCopy: append to .ecj file instead of truncating * Match Go ParsePath: support _delta suffix on file IDs for sub-file addressing * Match Go chunk manifest: add Accept-Ranges, Content-Disposition, filename fallback, MIME detection * Match Go privateStoreHandler: use proper JSON error for unsupported methods * Match Go Destroy: add only_empty parameter to reject non-empty volume deletion * Fix compilation: set_read_only_persist and set_writable return () These methods fire-and-forget save_vif internally, so gRPC callers should not try to chain .map_err() on the unit return type. * Match Go SaveVolumeInfo: check writability and propagate errors in save_vif * Match Go VolumeDelete: propagate only_empty to delete_volume for defense in depth The gRPC VolumeDelete handler had a pre-check for only_empty but then passed false to store.delete_volume(), bypassing the store-level check. Go passes req.OnlyEmpty directly to DeleteVolume. Now Rust does the same for defense in depth against TOCTOU races (though the store write lock makes this unlikely). * Match Go ProcessRangeRequest: return full content for empty/oversized ranges Go returns nil from ProcessRangeRequest when ranges are empty or total range size exceeds content length, causing the caller to serve the full content as a normal 200 response. Rust was returning an empty 200 body. * Match Go Query: quote JSON keys in output records Go's ToJson produces valid JSON with quoted keys like {"name":"Alice"}. Rust was producing invalid JSON with unquoted keys like {name:"Alice"}. * Match Go VolumeCopy: reject when no suitable disk location exists Go returns ErrVolumeNoSpaceLeft when no location matches the disk type and has sufficient space. Rust had an unsafe fallback that silently picked the first location regardless of type or available space. * Match Go DeleteVolumeNeedle: check noWriteOrDelete before allowing delete Go checks v.noWriteOrDelete before proceeding with needle deletion, returning "volume is read only" if true. Rust was skipping this check. * Match Go ReceiveFile: prefer HardDrive location for EC and use response-level write errors Two fixes: (1) Go prefers HardDriveType disk location for EC volumes, falling back to first location. Returns "no storage location available" when no locations exist. (2) Write failures are now response-level errors (in response body) instead of gRPC status errors, matching Go. * Match Go CopyFile: sync EC volume journal to disk before copying Go calls ecVolume.Sync() before copying EC volume files to ensure the .ecj journal is flushed to disk. Added sync_to_disk() to EcVolume and call it in the CopyFile EC branch. * Match Go readSuperBlock: propagate replication parse errors Go returns an error when parsing the replication string from the .vif file fails. Rust was silently ignoring the parse failure and using the super block's replication as-is. * Match Go TTL expiry: remove append_at_ns > 0 guard Go computes TTL expiry from AppendAtNs without guarding against zero. When append_at_ns is 0, the expiry is epoch + TTL which is in the past, correctly returning NotFound. Rust's extra guard skipped the check, incorrectly returning success for such needles. * Match Go delete_collection: skip volumes with compaction in progress Go checks !v.isCompactionInProgress.Load() before destroying a volume during collection deletion, skipping compacting volumes. Also changed destroy errors to log instead of aborting the entire collection delete. * Match Go MarkReadonly/MarkWritable: always notify master even on local error Go always notifies the master regardless of whether the local set_read_only_persist or set_writable step fails. The Rust code was using `?` which short-circuited on error, skipping the final master notification. Save the result and defer the `?` until after the notify call. * Match Go PostHandler: return 500 for all write errors Go returns 500 (InternalServerError) for all write failures. Rust was returning 404 for volume-not-found and 403 for read-only volumes. * Match Go makeupDiff: validate .cpd compaction revision is old + 1 Go reads the new .cpd file's super block and verifies the compaction revision is exactly old + 1. Rust only validated the old revision. * Match Go VolumeStatus: check data backend before returning status Go checks v.DataBackend != nil before building the status response, returning an error if missing. Rust was silently returning size 0. * Match Go PostHandler: always include mime field in upload response JSON Go always serializes the mime field even when empty ("mime":""). Rust was omitting it when empty due to Option<String> with skip_serializing_if. * Match Go FindFreeLocation: account for EC shards in free slot calculation Go subtracts EC shard equivalents when computing available volume slots. Rust was only comparing volume count, potentially over-counting free slots on locations with many EC shards. * Match Go privateStoreHandler: use INVALID as metrics label for unsupported methods Go records the method as INVALID in metrics for unsupported HTTP methods. Rust was using the actual method name. * Match Go volume: add commit_compact guard and scrub data size validation Two fixes: (1) commit_compact now checks/sets is_compacting flag to prevent concurrent commits, matching Go's CompareAndSwap guard. (2) scrub now validates total needle sizes against .dat file size. * Match Go gRPC: fix TailSender error propagation, EcShardsInfo all slots, EcShardRead .ecx check Three fixes: (1) VolumeTailSender now propagates binary search errors instead of silently falling back to start. (2) VolumeEcShardsInfo returns entries for all shard slots including unmounted. (3) VolumeEcShardRead checks .ecx index for deletions instead of .ecj. * Match Go metrics: add BuildInfo gauge and connection tracking functions Go exposes a BuildInfo Prometheus metric with version labels, and tracks open connections via stats.ConnectionOpen/Close. Added both to Rust. * Match Go NeedleMap.Delete: use !is_deleted() instead of is_valid() Go's CompactMap.Delete checks !IsDeleted() not IsValid(), so needles with size==0 (live but anomalous) can still be deleted. The Rust code was using is_valid() which returns false for size==0, preventing deletion of such needles. * Match Go fitTtlCount: always normalize TTL to coarsest unit Go's fitTtlCount always converts to seconds first, then finds the coarsest unit that fits in one byte (e.g., 120m → 2h). Rust had an early return for count<=255 that skipped normalization, producing different binary encodings for the same duration. * Match Go BuildInfo metric: correct name and add missing labels Go uses SeaweedFS_build_info (Namespace=SeaweedFS, Subsystem=build, Name=info) with labels [version, commit, sizelimit, goos, goarch]. Rust had SeaweedFS_volumeServer_buildInfo with only [version]. * Match Go HTTP handlers: fix UploadResult fields, DiskStatus JSON, chunk manifest ETag - UploadResult.mime: add skip_serializing_if to omit empty MIME (Go uses omitempty) - UploadResult.contentMd5: only include when request provided Content-MD5 header - Content-MD5 response header: only set when request provided it - DiskStatuses: use camelCase field names (percentFree, percentUsed, diskType) to match Go's protobuf JSON marshaling - Chunk manifest: preserve needle ETag in expanded response headers * Match Go volume: fix version(), integrity check, scrub, and commit_compact - version(): use self.version() instead of self.super_block.version in read_all_needles, check_volume_data_integrity, scan_raw_needles_from to respect volumeInfo.version override - check_volume_data_integrity: initialize healthy_index_size to idx_size (matching Go) and continue on EOF instead of returning error - scrub(): count deleted needles in total_read since they still occupy space in the .dat file (matches Go's totalRead += actualSize for deleted) - commit_compact: clean up .cpd/.cpx files on makeup_diff failure (matches Go's error path cleanup) * Match Go write queue: add 4MB batch byte limit Go's startWorker breaks the batch at either 128 requests or 4MB of accumulated write data. Rust only had the 128-request limit, allowing large writes to accumulate unbounded latency. * Add TTL normalization tests for Go parity verification Test that fit_ttl_count normalizes 120m→2h, 24h→1d, 7d→1w even when count fits in a byte, matching Go's fitTtlCount behavior. * Match Go FindFreeLocation: account for EC shards in free slot calculation Go's free volume count subtracts both regular volumes and EC volumes from max_volume_count. Rust was only counting regular volumes, which could over-report available slots when EC shards are mounted. * Match Go EC volume: mark deletions in .ecx and replay .ecj at startup Go's DeleteNeedleFromEcx marks needles as deleted in the .ecx index in-place (writing TOMBSTONE_FILE_SIZE at the size field) in addition to appending to the .ecj journal. Go's RebuildEcxFile replays .ecj entries into .ecx on startup, then removes the .ecj file. Rust was only appending to .ecj without marking .ecx, which meant deleted EC needles remained readable via .ecx binary search. This fix: - Opens .ecx in read/write mode (was read-only) - Adds mark_needle_deleted_in_ecx: binary search + in-place write - Calls it from journal_delete before appending to .ecj - Adds rebuild_ecx_from_journal: replays .ecj into .ecx on startup * Match Go check_all_ec_shards_deleted: use MAX_SHARD_COUNT instead of hardcoded 14 Go's TotalShardsCount is DataShardsCount + ParityShardsCount = 14 by default, but custom EC configs via .vif can have more shards (up to MaxShardCount = 32). Using MAX_SHARD_COUNT ensures all shard files are checked regardless of EC configuration. * Match Go EC locate: subtract 1 from shard size and use datFileSize override Go's LocateEcShardNeedleInterval passes shard.ecdFileSize-1 to LocateData (shards are padded, -1 avoids overcounting large block rows). When datFileSize is known, Go uses datFileSize/DataShards instead. Rust was passing the raw shard file size without adjustment. * Fix TTL parsing and DiskStatus field names to match Go exactly TTL::read: Go's ReadTTL preserves the original unit (7d stays 7d, not 1w) and errors on count > 255. The previous normalization change was incorrect — Go only normalizes internally via fitTtlCount, not during string parsing. DiskStatus: Go uses encoding/json on protobuf structs, which reads the json struct tags (snake_case: percent_free, percent_used, disk_type), not the protobuf JSON names (camelCase). Revert to snake_case to match Go's actual output. * Fix heartbeat: check leader != current master before redirect, process duplicated UUIDs first Match Go's volume_grpc_client_to_master.go behavior: 1. Only trigger leader redirect when the leader address differs from the current master (prevents unnecessary reconnect loops when master confirms its own address). 2. Process duplicated_uuids before leader redirect check, matching Go's ordering where duplicate UUID detection takes priority. * Remove SetState version check to match Go behavior Go's SetState unconditionally applies the state without any version mismatch check. The Rust version had an extra optimistic concurrency check that would reject valid requests from Go clients that don't track versions. * Fix TTL::read() to normalize via fit_ttl_count matching Go's ReadTTL Go's ReadTTL calls fitTtlCount which converts to seconds and normalizes to the coarsest unit that fits in a byte count (e.g. 120m->2h, 7d->1w, 24h->1d). The Rust version was preserving the original unit, producing different binary encodings on disk and in heartbeat messages. * Always return Content-MD5 header and JSON field on successful writes Go always sets Content-MD5 in the response regardless of whether the request included it. The Rust version was conditionally including it only when the request provided Content-MD5. * Include name and size in UploadResult JSON even when empty/zero Go's encoding/json always includes empty strings and zero values in the upload response. The Rust version was using skip_serializing_if to omit them, causing JSON structure differences. * Include deleted needles in scan_raw_needles_from to match Go Go's ScanVolumeFileFrom visits ALL needles including deleted ones. Skipping deleted entries during incremental copy would cause tombstones to not be propagated, making deleted files reappear on the receiving side. * Match Go NeedleMap.Delete: always write tombstone to idx file Go's NeedleMap.Delete unconditionally writes a tombstone entry to the idx file and updates metrics, even if the needle doesn't exist or is already deleted. This is important for replication where every delete operation must produce an idx write. The Rust version was skipping the tombstone write for non-existent or already-deleted needles. * Limit MIME type to 255 bytes matching Go's CreateNeedleFromRequest * Title-case Seaweed-* pair keys to match Go HTTP header canonicalization * Unify DiskType::Hdd into HardDrive to match Go's single HardDriveType * Skip tombstone entries in walk_ecx_stats total_size matching Go's Raw() * Return EMPTY TTL when computed seconds is zero matching Go's fitTtlCount * Include disk-space-low in Volume.is_read_only() matching Go * Log error on CIDR parse failure in whitelist matching Go's glog.Errorf * Log cookie mismatch in gRPC Query matching Go's V(0).Infof * Fix is_expired volume_size comparison to use < matching Go Go checks `volumeSize < super_block.SuperBlockSize` (strict less-than), but Rust used `<=`. This meant Rust would fail to expire a volume that is exactly SUPER_BLOCK_SIZE bytes. * Apply Go's JWT expiry defaults: 10s write, 60s read Go calls v.SetDefault("jwt.signing.expires_after_seconds", 10) and v.SetDefault("jwt.signing.read.expires_after_seconds", 60). Rust defaulted to 0 for both, which meant tokens would never expire when security.toml has a signing key but omits expires_after_seconds. * Stop [grpc.volume].ca from overriding [grpc].ca matching Go Go reads the gRPC CA file only from config.GetString("grpc.ca"), i.e. the [grpc] section. The [grpc.volume] section only provides cert and key. Rust was also reading ca from [grpc.volume] which would silently override the [grpc].ca value when both were present. * Fix free_volume_count to use EC shard count matching Go Was counting EC volumes instead of EC shards, which underestimates EC space usage. One EC volume with 14 shards uses ~1.4 volume slots, not 1. Now uses Go's formula: ((max - volumes) * DataShardsCount - ecShardCount) / DataShardsCount. * Include preallocate in compaction space check matching Go Go uses max(preallocate, estimatedCompactSize) for the free space check. Rust was only using the estimated volume size, which could start a compaction that fails mid-way if preallocate exceeds the volume size. * Check gzip magic bytes before setting Content-Encoding matching Go Go checks both Accept-Encoding contains "gzip" AND IsGzippedContent (data starts with 0x1f 0x8b) before setting Content-Encoding: gzip. Rust only checked Accept-Encoding, which could incorrectly declare gzip encoding for non-gzip compressed data. * Only set upload response name when needle HasName matching Go Go checks reqNeedle.HasName() before setting ret.Name. Rust always set the name from the filename variable, which could return the fid portion of the path as the name for raw PUT requests without a filename. * Treat MaxVolumeCount==0 as unlimited matching Go's hasFreeDiskLocation Go's hasFreeDiskLocation returns true immediately when MaxVolumeCount is 0, treating it as unlimited. Rust was computing effective_free as <= 0 for max==0, rejecting the location. This could fail volume creation during early startup before the first heartbeat adjusts max. * Read lastAppendAtNs from deleted V3 entries in integrity check Go's doCheckAndFixVolumeData reads AppendAtNs from both live entries (verifyNeedleIntegrity) and deleted tombstones (verifyDeletedNeedleIntegrity). Rust was skipping deleted entries, which could result in a stale last_append_at_ns if the last index entry is a deletion. * Return empty body for empty/oversized range requests matching Go Go's ProcessRangeRequest returns nil (empty body, 200 OK) when parsed ranges are empty or combined range size exceeds total content size. The Rust buffered path incorrectly returned the full file data for both cases. The streaming path already handled this correctly. * Dispatch ScrubEcVolume by mode matching Go's INDEX/LOCAL/FULL Go's ScrubEcVolume switches on mode: INDEX calls v.ScrubIndex() (ecx integrity only), LOCAL calls v.ScrubLocal(), FULL calls vs.store.ScrubEcVolume(). Rust was ignoring the mode and always running verify_ec_shards. Now INDEX mode checks ecx index integrity (sorted overlap detection + file size validation) without shard I/O, while LOCAL/FULL modes run the existing shard verification. * Fix TTL test expectation: 7d normalizes to 1w matching Go's fitTtlCount Go's ReadTTL calls fitTtlCount which normalizes to the coarsest unit that fits: 7 days = 1 week, so "7d" becomes {Count:1, Unit:Week} which displays as "1w". Both Go and Rust normalize identically. * Add version mismatch check to SetState matching Go's State.Update Go's State.Update compares the incoming version with the stored version and returns "version mismatch" error if they differ. This provides optimistic concurrency control. The Rust implementation was accepting any version unconditionally. * Use unquoted keys in Query JSON output matching Go's json.ToJson Go's json.ToJson produces records with unquoted keys like {score:12} not {"score":12}. This is a custom format used internally by SeaweedFS for query results. * Fix TTL test expectation in VolumeNeedleStatus: 7d normalizes to 1w Same normalization as the HTTP test: Go's ReadTTL calls fitTtlCount which converts 7 days to 1 week. * Include ETag header in 304 Not Modified responses matching Go behavior Go sets ETag on the response writer (via SetEtag) before the If-Modified-Since and If-None-Match conditional checks, so both 304 response paths include the ETag header. The Rust implementation was only adding ETag to 200 responses. * Remove needle-name fallback in chunk manifest filename resolution Go's tryHandleChunkedFile only falls back from URL filename to manifest name. Rust had an extra fallback to needle.name that Go does not perform, which could produce different Content-Disposition filenames for chunk manifests. * Validate JWT nbf (Not Before) claim matching Go's jwt-go/v5 Go's jwt.ParseWithClaims validates the nbf claim when present, rejecting tokens whose nbf is in the future. The Rust jsonwebtoken crate defaults validate_nbf to false, so tokens with future nbf were incorrectly accepted. * Set isHeartbeating to true at startup matching Go's VolumeServer init Go unconditionally sets isHeartbeating: true in the VolumeServer struct literal. Rust was starting with false when masters are configured, causing /healthz to return 503 until the first heartbeat succeeds. * Call store.close() on shutdown matching Go's Shutdown() Go's Shutdown() calls vs.store.Close() which closes all volumes and flushes file handles. The Rust server was relying on process exit for cleanup, which could leave data unflushed. * Include server ID in maintenance mode error matching Go's format Go returns "volume server %s is in maintenance mode" with the store ID. Rust was returning a generic "maintenance mode" message. * Fix DiskType test: use HardDrive variant matching Go's HddType="" Go maps both "" and "hdd" to HardDriveType (empty string). The Rust enum variant is HardDrive, not Hdd. The test referenced a nonexistent Hdd variant causing compilation failure. * Do not include ETag in 304 responses matching Go's GetOrHeadHandler Go sets ETag at L235 AFTER the If-Modified-Since and If-None-Match 304 return paths, so Go's 304 responses do not include the ETag header. The Rust code was incorrectly including ETag in both 304 response paths. * Return 400 on malformed query strings in PostHandler matching Go's ParseForm Go's r.ParseForm() returns HTTP 400 with "form parse error: ..." when the query string is malformed. Rust was silently falling back to empty query params via unwrap_or_default(). * Load EC volume version from .vif matching Go's NewEcVolume Go sets ev.Version = needle.Version(volumeInfo.Version) from the .vif file. Rust was always using Version::current() (V3), which would produce wrong needle actual size calculations for volumes created with V1 or V2. * Sync .ecx file before close matching Go's EcVolume.Close Go calls ev.ecxFile.Sync() before closing to ensure in-place deletion marks are flushed to disk. Without this, deletion marks written via MarkNeedleDeleted could be lost on crash. * Validate SuperBlock extra data size matching Go's Bytes() guard Go checks extraSize > 256256-2 and calls glog.Fatalf to prevent corrupt super block headers. Rust was silently truncating via u16 cast, which would write an incorrect extra_size field. Update quinn-proto 0.11.13 -> 0.11.14 to fix GHSA-6xvm-j4wr-6v98 Fixes Dependency Review CI failure: quinn-proto < 0.11.14 is vulnerable to unauthenticated remote DoS via panic in QUIC transport parameter parsing. * Skip TestMultipartUploadUsesFormFieldsForTimestampAndTTL for Go server Go's r.FormValue() cannot read multipart text fields after r.MultipartReader() consumes the body, so ts/ttl sent as multipart form fields only work with the Rust volume server. Skip this test when VOLUME_SERVER_IMPL != "rust" to fix CI failure. * Flush .ecx in EC volume sync_to_disk matching Go's Sync() Go's EcVolume.Sync() flushes both the .ecj journal and the .ecx index to disk. The Rust version only flushed .ecj, leaving in-place deletion marks in .ecx unpersisted until close(). This could cause data inconsistency if the server crashes after marking a needle deleted in .ecx but before close(). * Remove .vif file in EC volume destroy matching Go's Destroy() Go's EcVolume.Destroy() removes .ecx, .ecj, and .vif files. The Rust version only removed .ecx and .ecj, leaving orphaned .vif files on disk after EC volume destruction (e.g., after TTL expiry). * Fix is_expired to use <= for SuperBlockSize check matching Go Go checks contentSize <= SuperBlockSize to detect empty volumes (no needles). Rust used < which would incorrectly allow a volume with exactly SuperBlockSize bytes (header only, no data) to proceed to the TTL expiry check and potentially be marked as expired. * Fix read_append_at_ns to read timestamps from tombstone entries Go reads the full needle body for all entries including tombstones (deleted needles with size=0) to extract the actual AppendAtNs timestamp. The Rust version returned 0 early for size <= 0 entries, which would cause the binary search in incremental copy to produce incorrect results for positions containing deleted needles. Now uses get_actual_size to compute the on-disk size (which handles tombstones correctly) and only returns 0 when the actual size is 0. * Add X-Request-Id response header matching Go's requestIDMiddleware Go sets both X-Request-Id and x-amz-request-id response headers. The Rust server only set x-amz-request-id, missing X-Request-Id. * Add skip_serializing_if for UploadResult name and size fields Go's UploadResult uses json:"name,omitempty" and json:"size,omitempty", omitting these fields from JSON when they are zero values (empty string / 0). The Rust struct always serialized them, producing "name":"" and "size":0 where Go would omit them. * Support JSONP/pretty-print for write success responses Go's writeJsonQuiet checks for callback (JSONP) and pretty query parameters on all JSON responses including write success. The Rust write success path used axum::Json directly, bypassing JSONP and pretty-print support. Now uses json_result_with_query to match Go. * Include actual limit in file size limit error message Go returns "file over the limited %d bytes" with the actual limit value included. Rust returned a generic "file size limit exceeded" without the limit value, making it harder to debug. * Extract extension from 2-segment URL paths for image operations Go's parseURLPath extracts the file extension from all URL formats including 2-segment paths like /vid,fid.jpg. The Rust version only handled 3-segment paths (/vid/fid/filename.ext), so extensions in 2-segment paths were lost. This caused image resize/crop operations requested via query params to be silently skipped for those paths. * Add size_hint to TrackedBody so throttled downloads get Content-Length TrackedBody (used for download throttling) did not implement size_hint(), causing HTTP/1.1 to fall back to chunked transfer encoding instead of setting Content-Length. Go always sets Content-Length explicitly for non-range responses. * Add Last-Modified, pairs, and S3 headers to chunk manifest responses Go sets Last-Modified, needle pairs, and S3 pass-through headers on the response writer BEFORE calling tryHandleChunkedFile. Since the Rust chunk manifest handler created fresh response headers and returned early, these headers were missing from chunk manifest responses. Now passes last_modified_str into the chunk manifest handler and applies pairs and S3 pass-through query params (response-cache-control, response-content-encoding, etc.) to the chunk manifest response headers. * Fix multipart fallback to use first part data when no filename Go reads the first part's data unconditionally, then looks for a part with a filename. If none found, Go uses the first part's data (with empty filename). Rust only captured parts with filenames, so when no part had a filename it fell back to the raw multipart body bytes (including boundary delimiters), producing corrupt needle data. * Set HasName and HasMime flags for empty values matching Go Go's CreateNeedleFromRequest sets HasName and HasMime flags even when the filename or MIME type is empty (len < 256 is true for len 0). Rust skipped empty values, causing the on-disk needle format to differ: Go-written needles include extra bytes for the empty name/mime size fields, changing the serialized needle size in the idx entry. This ensures binary format compatibility between Go and Rust servers. * Add is_stopping guard to vacuum_volume_commit matching Go Go's CommitCompactVolume (store_vacuum.go L53-54) checks s.isStopping before committing compaction to prevent file swaps during shutdown. The Rust handler was missing this check, which could allow compaction commits while the server is stopping. * Remove disk_type from required status fields since Go omits it Go's default DiskType is "" (HardDriveType), and protobuf's omitempty tag causes empty strings to be dropped from JSON output. * test: honor rust env in dual volume harness * grpc: notify master after volume lifecycle changes * http: proxy to replicas before download-limit timeout * test: pass readMode to rust volume harnesses * fix store free-location predicate selection * fix volume copy disk placement and heartbeat notification * fix chunk manifest delete replication * fix write replication to survive client disconnects * fix download limit proxy and wait flow * fix crop gating for streamed reads * fix upload limit wait counter behavior * fix chunk manifest image transforms * fix has_resize_ops to check width/height > 0 instead of is_some() Go's shouldResizeImages condition is `width > 0 \|\| height > 0`, so `?width=0` correctly evaluates to false. Rust was using `is_some()` which made `?width=0` evaluate to true, unnecessarily disabling streaming reads for those requests. * fix Content-MD5 to only compute and return when provided by client Go only computes the MD5 of uncompressed data when a Content-MD5 header or multipart field is provided. Rust was always computing and returning it. Also fix the mismatch error message to include size, matching Go's format. * fix save_vif to compute ExpireAtSec from TTL Go's SaveVolumeInfo always computes ExpireAtSec = now + ttlSeconds when the volume has a TTL. The save_vif path (used by set_read_only and set_writable) was missing this computation, causing .vif files to be written without the correct expiration timestamp for TTL volumes. * fix set_writable to not modify no_write_can_delete Go's MarkVolumeWritable only sets noWriteOrDelete=false and persists. Rust was additionally setting no_write_can_delete=has_remote_file, which could incorrectly change the write mode for remote-file volumes when the master explicitly asks to make the volume writable. * fix write_needle_blob_and_index to error on too-small V3 blob Go returns an error when the needle blob is too small for timestamp patching. Rust was silently skipping the patch and writing the blob with a stale/zero timestamp, which could cause data integrity issues during incremental replication that relies on AppendAtNs ordering. * fix VolumeEcShardsToVolume to validate dataShards range Go validates that dataShards is > 0 and <= MaxShardCount before proceeding with EC-to-volume reconstruction. Without this check, a zero or excessively large data_shards value could cause confusing downstream failures. * fix destroy to use VolumeError::NotEmpty instead of generic Io error The dedicated NotEmpty variant exists in the enum but was not being used. This makes error matching consistent with Go's ErrVolumeNotEmpty. * fix SetState to persist state to disk with rollback on failure Go's State.Update saves VolumeServerState to a state.pb file after each SetState call, and rolls back the in-memory state if persistence fails. Rust was only updating in-memory atomics, so maintenance mode would be lost on server restart. Now saves protobuf-encoded state.pb and loads it on startup. * fix VolumeTierMoveDatToRemote to close local dat backend after upload Go calls v.LoadRemoteFile() after saving volume info, which closes the local DataBackend before transitioning to remote storage. Without this, the volume holds a stale file handle to the deleted local .dat file, causing reads to fail until server restart. * fix VolumeTierMoveDatFromRemote to close remote dat backend after download Go calls v.DataBackend.Close() and sets DataBackend=nil after removing the remote file reference. Without this, the stale remote backend state lingers and reads may not discover the newly downloaded local .dat file until server restart. * fix redirect to use internal url instead of public_url Go's proxyReqToTargetServer builds the redirect Location header from loc.Url (the internal URL), not publicUrl. Using public_url could cause redirect failures when internal and external URLs differ. * fix redirect test and add state_file_path to integration test Update redirect unit test to expect internal url (matching the previous fix). Add missing state_file_path field to the integration test VolumeServerState constructor. * fix FetchAndWriteNeedle to await all writes before checking errors Go uses a WaitGroup to await all writes (local + replicas) before checking errors. Rust was short-circuiting on local write failure, which could leave replica writes in-flight without waiting for completion. * fix shutdown to send deregister heartbeat before pre_stop delay Go's StopHeartbeat() closes stopChan immediately on interrupt, causing the heartbeat goroutine to send the deregister heartbeat right away, before the preStopSeconds delay. Rust was only setting is_stopping=true without waking the heartbeat loop, so the deregister was delayed until after the pre_stop sleep. Now we call volume_state_notify.notify_one() to wake the heartbeat immediately. * fix heartbeat response ordering to check duplicate UUIDs first Go processes heartbeat responses in this order: DuplicatedUuids first, then volume options (prealloc/size limit), then leader redirect. Rust was applying volume options before checking for duplicate UUIDs, which meant volume option changes would take effect even when the response contained a duplicate UUID error that should cause an immediate return. * the test thread was blocked * fix(deps): update aws-lc-sys 0.38.0 → 0.39.0 to resolve security advisories Bumps aws-lc-rs 1.16.1 → 1.16.2, pulling in aws-lc-sys 0.39.0 which fixes GHSA-394x-vwmw-crm3 (X.509 Name Constraints wildcard/unicode bypass) and GHSA-9f94-5g5w-gf6r (CRL Distribution Point scope check logic error). * fix: match Go Content-MD5 mismatch error message format Go uses "Content-MD5 did not match md5 of file data expected [X] received [Y] size Z" while Rust had a shorter format. Match the exact Go error string so clients see identical messages. * fix: match Go Bearer token length check (> 7, not >= 7) Go requires len(bearer) > 7 ensuring at least one char after "Bearer ". Rust used >= 7 which would accept an empty token. * fix(deps): drop legacy rustls 0.21 to resolve rustls-webpki GHSA-pwjx-qhcg-rvj4 aws-sdk-s3's default "rustls" feature enables tls-rustls in aws-smithy-runtime, which pulls in legacy-rustls-ring (rustls 0.21 → rustls-webpki 0.101.7, moderate CRL advisory). Replace with explicit default-https-client which uses only rustls 0.23 / rustls-webpki 0.103.9. * fix: use uploaded filename for auto-compression extension detection Go extracts the file extension from pu.FileName (the uploaded filename) for auto-compression decisions. Rust was using the URL path, which typically has no extension for SeaweedFS file IDs. * fix: add CRC legacy Value() backward-compat check on needle read Go double-checks CRC: n.Checksum != crc && uint32(n.Checksum) != crc.Value(). The Value() path is a deprecated transform for compat with seaweed versions prior to commit `056c480eb`. Rust had the legacy_value() method but wasn't using it in validation. * fix: remove /stats/* endpoints to match Go (commented out since L130) Go's volume_server.go has the /stats/counter, /stats/memory, and /stats/disk endpoints commented out (lines 130-134). Remove them from the Rust router along with the now-unused whitelist_guard middleware. * fix: filter application/octet-stream MIME for chunk manifests Go's tryHandleChunkedFile (L334) filters out application/octet-stream from chunk manifest MIME types, falling back to extension-based detection. Rust was returning the stored MIME as-is for manifests. * fix: VolumeMarkWritable returns error before notifying master Go returns early at L200 if MarkVolumeWritable fails, before reaching the master notification at L206. Rust was notifying master even on failure, creating inconsistent state where master thinks the volume is writable but local marking failed. * fix: check volume existence before maintenance in MarkReadonly/Writable Go's VolumeMarkReadonly (L239-241) and VolumeMarkWritable (L253-255) look up the volume first, then call makeVolumeReadonly/Writable which checks maintenance. Rust was checking maintenance first, returning "maintenance mode" instead of "not found" for missing volumes. * feat: implement ScrubVolume mark_broken_volumes_readonly (PR #8360) Add the mark_broken_volumes_readonly flag from PR #8360: - Sync proto field (tag 3) to local volume_server.proto - After scrubbing, if flag is set, call makeVolumeReadonly on each broken volume (notify master, mark local readonly, notify again) - Collect errors via joined error semantics matching Go's errors.Join - Factor out make_volume_readonly helper reused by both VolumeMarkReadonly and ScrubVolume Also refactors VolumeMarkReadonly to use the shared helper. * fix(deps): update rustls-webpki 0.103.9 → 0.103.10 (GHSA-pwjx-qhcg-rvj4) CRL Distribution Point matching logic fix for moderate severity advisory about CRLs not considered authoritative. * test: update integration tests for removed /stats/* endpoints Replace tests that expected /stats/* routes to return 200/401 with tests confirming they now fall through to the store handler (400), matching Go's commented-out stats endpoints. * docs: fix misleading comment about default offset feature The comment said "4-byte offsets unless explicitly built with 5-byte support" but the default feature enables 5bytes. This is intentional for production parity with Go -tags 5BytesOffset builds. Fix the comment to match reality.	2026-03-26 17:24:35 -07:00
Chris Lu	5fa5507234	Add Prometheus metric to count upload errors (#8788 ) Add Prometheus metric to count upload errors (#8775) Add SeaweedFS_upload_error_total counter labeled by HTTP status code, so operators can alert on write/replication failures. Code "0" indicates a transport error (no HTTP response received). Also add an "Upload Errors" panel to the Grafana dashboard.	2026-03-26 16:58:05 -07:00
Chris Lu	17028fbf59	fix: serialize SSE-KMS metadata when bucket default encryption applies KMS (#8780 ) * fix: serialize SSE-KMS metadata when bucket default encryption applies KMS When a bucket has default SSE-KMS encryption enabled and a file is uploaded without explicit SSE headers, the encryption was applied correctly but the SSE-KMS metadata (x-seaweedfs-sse-kms-key) was not serialized. This caused downloads to fail with "empty SSE-KMS metadata" because the entry's Extended map stored an empty byte slice. The existing code already handled this for SSE-S3 bucket defaults (SerializeSSES3Metadata) but was missing the equivalent call to SerializeSSEKMSMetadata for the KMS path. Fixes seaweedfs/seaweedfs#8776 * ci: add KMS integration tests to GitHub Actions Add a kms-tests.yml workflow that runs on changes to KMS/SSE code with two jobs: 1. KMS provider tests: starts OpenBao via Docker, runs Go integration tests in test/kms/ against a real KMS backend 2. S3 KMS e2e tests: starts OpenBao + weed mini built from source, runs test_s3_kms.sh which covers bucket-default SSE-KMS upload/download (the exact scenario from #8776) Supporting changes: - test/kms/Makefile: add CI targets (test-provider-ci, test-s3-kms-ci) that manage OpenBao via plain Docker and run weed from source - test/kms/s3-config-openbao-template.json: S3 config template with OpenBao KMS provider for weed mini * refactor: combine SSE-S3 and SSE-KMS metadata serialization into else-if SSE-S3 and SSE-KMS bucket default encryption are mutually exclusive, so use a single if/else-if block instead of two independent if blocks. * Update .github/workflows/kms-tests.yml Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> * fix(ci): start weed mini from data dir to avoid Docker filer.toml weed mini reads filer.toml from the current working directory first. When running from test/kms/, it picked up the Docker-targeted filer.toml which has dir="/data/filerdb" (a path that doesn't exist in CI), causing a fatal crash at filer store initialization. Fix by cd-ing to the data directory before starting weed mini. Also improve log visibility on failure. --------- Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>	2026-03-26 14:07:01 -07:00
Chris Lu	3a3fff1399	Fix TUS chunked upload and resume failures (#8783 ) (#8786 ) * Fix TUS chunked upload and resume failures caused by request context cancellation (#8783) The filer's TCP connections use a 10-second inactivity timeout (net_timeout.go). After the TUS PATCH request body is fully consumed, internal operations (assigning file IDs via gRPC to the master, uploading data to volume servers, completing uploads) do not generate any activity on the client connection, so the inactivity timer fires and Go's HTTP server cancels the request context. This caused HTTP 500 errors on PATCH requests where body reading + internal processing exceeded the timeout. Fix by using context.WithoutCancel in TUS create and patch handlers, matching the existing pattern used by assignNewFileInfo. This ensures internal operations complete regardless of client connection state. Fixes seaweedfs/seaweedfs#8783 * Add comment to tusCreateHandler explaining context.WithoutCancel rationale * Run TUS integration tests on all PRs, not just TUS file changes The previous path filter meant these tests only ran when TUS-specific files changed. This allowed regressions from changes to shared infrastructure (net_timeout.go, upload paths, gRPC) to go undetected — which is exactly how the context cancellation bug in #8783 was missed. Matches the pattern used by s3-go-tests.yml.	2026-03-26 14:06:21 -07:00
dependabot[bot]	77e4b92432	build(deps): bump io.netty:netty-codec-http2 from 4.1.129.Final to 4.1.132.Final in /test/java/spark (#8785 ) build(deps): bump io.netty:netty-codec-http2 in /test/java/spark Bumps [io.netty:netty-codec-http2](https://github.com/netty/netty) from 4.1.129.Final to 4.1.132.Final. - [Release notes](https://github.com/netty/netty/releases) - [Commits](https://github.com/netty/netty/compare/netty-4.1.129.Final...netty-4.1.132.Final) --- updated-dependencies: - dependency-name: io.netty:netty-codec-http2 dependency-version: 4.1.132.Final dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2026-03-26 13:00:18 -07:00
dependabot[bot]	8558c586a0	build(deps): bump io.netty:netty-codec-http from 4.1.129.Final to 4.1.132.Final in /test/java/spark (#8784 ) build(deps): bump io.netty:netty-codec-http in /test/java/spark Bumps [io.netty:netty-codec-http](https://github.com/netty/netty) from 4.1.129.Final to 4.1.132.Final. - [Release notes](https://github.com/netty/netty/releases) - [Commits](https://github.com/netty/netty/compare/netty-4.1.129.Final...netty-4.1.132.Final) --- updated-dependencies: - dependency-name: io.netty:netty-codec-http dependency-version: 4.1.132.Final dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2026-03-26 13:00:03 -07:00
Lars Lehtonen	e8888765a1	fix(weed/filer/store_test): fix dropped errors (#8782 )	2026-03-26 12:07:48 -07:00
Chris Lu	92c2fc0d52	Add insecure_skip_verify option for HTTPS client in security.toml (#8781 ) * Add -insecureSkipVerify flag and config option for filer.sync HTTPS connections When using filer.sync between clusters with different CAs (e.g., separate OpenShift clusters), TLS certificate verification fails with "x509: certificate signed by unknown authority". This adds two ways to skip TLS certificate verification: 1. CLI flag: `weed filer.sync -insecureSkipVerify ...` 2. Config option: `insecure_skip_verify = true` under [https.client] in security.toml Closes #8778 * Add insecure_skip_verify option for HTTPS client in security.toml When using filer.sync between clusters with different CAs (e.g., separate OpenShift clusters), TLS certificate verification fails. Adding insecure_skip_verify = true under [https.client] in security.toml allows skipping TLS certificate verification. The option is read during global HTTP client initialization so it applies to all HTTPS connections including filer.sync proxy reads and writes. Closes #8778 --------- Co-authored-by: Copilot <copilot@github.com>	2026-03-26 11:42:47 -07:00
Chris Lu	aa12b51cbf	test: restore coverage removed in PR #8360 (#8779 ) * test: restore maintenance mode coverage in TestVolumeMarkReadonlyWritableErrorPaths PR #8360 removed the maintenance mode assertions because the refactored check ordering (volume lookup before maintenance check) caused the original test to hit "not found" instead of "maintenance mode" — the test used a non-existent volume ID. Restore coverage by allocating a real volume, then verifying: - existing volume in maintenance mode returns "maintenance mode" - non-existent volume in maintenance mode still returns "not found" (validating the new check ordering) * test: add coverage for ScrubVolume MarkBrokenVolumesReadonly flag PR #8360 added the mark_broken_volumes_readonly field to ScrubVolumeRequest but no tests exercised the new logic paths. Add three integration tests: - HealthyVolume: flag is a no-op when scrub finds no broken volumes - CorruptVolume: corrupted .idx triggers broken detection; without the flag the volume stays writable, with the flag it becomes read-only - MaintenanceMode: makeVolumeReadonly fails under maintenance and ScrubVolume propagates the error via errors.Join * refactor: extract CorruptIndexFile and EnableMaintenanceMode test helpers Move duplicated idx corruption and maintenance mode setup into framework.CorruptIndexFile() and framework.EnableMaintenanceMode() helpers. Use defer for file close in the corruption helper.	2026-03-26 10:52:37 -07:00
Lisandro Pin	e5cf2d2a19	Give the `ScrubVolume()` RPC an option to flag found broken volumes as read-only. (#8360 ) * Give the `ScrubVolume()` RPC an option to flag found broken volumes as read-only. Also exposes this option in the shell `volume.scrub` command. * Remove redundant test in `TestVolumeMarkReadonlyWritableErrorPaths`. 417051bb slightly rearranges the logic for `VolumeMarkReadonly()` and `VolumeMarkWritable()`, so calling them for invalid volume IDs will actually yield that error, instead of checking maintnenance mode first.	2026-03-26 10:20:57 -07:00
Jaehoon Kim	6cf34f2376	Add -filerExcludePathPattern flag and fix nil panic in -filerExcludeFileName (#8756 ) * Fix filerExcludeFileName to support directory names and path components The original implementation only matched excludeFileName against message.NewEntry.Name, which caused two issues: 1. Nil pointer panic on delete events (NewEntry is nil) 2. Files inside excluded directories were still backed up because the parent directory name was not checked This patch: - Checks all path components in resp.Directory against the regexp - Adds nil guard for message.NewEntry before accessing .Name - Also checks message.OldEntry.Name for rename/delete events Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Add -filerExcludePathPattern flag and fix nil panic in filerExcludeFileName Separate concerns between two exclude mechanisms: - filerExcludeFileName: matches entry name only (leaf node) - filerExcludePathPattern (NEW): matches any path component via regexp, so files inside matched directories are also excluded Also fixes nil pointer panic when filerExcludeFileName encounters delete events where NewEntry is nil. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Refactor exclude logic: per-side exclusion for rename events, reduce duplication - Extract isEntryExcluded() to compute exclusion per old/new side, so rename events crossing an exclude boundary are handled as delete + create instead of being entirely skipped - Extract compileExcludePattern() to deduplicate regexp compilation - Replace strings.Split with allocation-free pathContainsMatch() - Check message.NewParentPath (not just resp.Directory) for new side Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Move regexp compilation out of retry loop to fail fast on config errors compileExcludePattern for -filerExcludeFileName and -filerExcludePathPattern are configuration-time validations that will never succeed on retry. Move them to runFilerBackup before the reconnect loop and use glog.Fatalf on failure, so invalid patterns are caught immediately at startup instead of being retried every 1.7 seconds indefinitely. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Add wildcard matching helpers for path and filename exclusion * Replace regexp exclude patterns with wildcard-based flags, deprecate -filerExcludeFileName Add -filerExcludeFileNames and -filerExcludePathPatterns flags that accept comma-separated wildcard patterns (*, ?) using the existing wildcard library. Mark -filerExcludeFileName as deprecated but keep its regexp behavior. --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Chris Lu <chris.lu@gmail.com>	2026-03-26 10:04:06 -07:00
Chris Lu	ccc662b90b	shell: add s3.bucket.access command for anonymous access policy (#8774 ) * shell: add s3.bucket.access command for anonymous access policy (#7738) Add a new weed shell command to view or change the anonymous access policy of an S3 bucket without external tools. Usage: s3.bucket.access -name <bucket> -access read,list s3.bucket.access -name <bucket> -access none Supported permissions: read, write, list. The command writes a standard bucket policy with Principal "" and warns if no anonymous IAM identity exists. shell: fix anonymous identity hint in s3.bucket.access warning The anonymous identity doesn't need IAM actions — the bucket policy controls what anonymous users can do. * shell: only warn about anonymous identity when write access is set Read and list operations use AuthWithPublicRead which evaluates bucket policies directly without requiring the anonymous identity. Only write operations go through the normal auth flow that needs it. * shell: rewrite s3.bucket.access to use IAM actions instead of bucket policies Replace the bucket policy approach with direct IAM identity actions, matching the s3.configure pattern. The user is auto-created if it does not exist. Usage: s3.bucket.access -name <bucket> -user anonymous -access Read,List s3.bucket.access -name <bucket> -user anonymous -access none s3.bucket.access -name <bucket> -user anonymous Actions are stored as "Action:bucket" on the identity, same as s3.configure -actions=Read -buckets=my-bucket. * shell: return flag parse errors instead of swallowing them * shell: normalize action names case-insensitively in s3.bucket.access Accept actions in any case (read, READ, Read) and normalize to canonical form (Read, Write, List, etc.) before storing. This matches the case-insensitive handling of "none" and avoids confusing rejections.	2026-03-25 23:09:53 -07:00
Chris Lu	67a551fd62	admin UI: add anonymous user creation checkbox (#8773 ) Add an "Anonymous" checkbox next to the username field in the Create User modal. When checked, the username is set to "anonymous" and the credential generation checkbox is disabled since anonymous users do not need keys. The checkbox is only shown when no anonymous user exists yet. The manage-access-keys button in the users table is hidden for the anonymous user.	2026-03-25 21:24:10 -07:00
Chris Lu	94bfa2b340	mount: stream all filer mutations over single ordered gRPC stream (#8770 ) * filer: add StreamMutateEntry bidi streaming RPC Add a bidirectional streaming RPC that carries all filer mutation types (create, update, delete, rename) over a single ordered stream. This eliminates per-request connection overhead for pipelined operations and guarantees mutation ordering within a stream. The server handler delegates each request to the existing unary handlers (CreateEntry, UpdateEntry, DeleteEntry) and uses a proxy stream adapter for rename operations to reuse StreamRenameEntry logic. The is_last field signals completion for multi-response operations (rename sends multiple events per request; create/update/delete always send exactly one response with is_last=true). * mount: add streaming mutation multiplexer (streamMutateMux) Implement a client-side multiplexer that routes all filer mutation RPCs (create, update, delete, rename) over a single bidirectional gRPC stream. Multiple goroutines submit requests through a send channel; a dedicated sendLoop serializes them on the stream; a recvLoop dispatches responses to waiting callers via per-request channels. Key features: - Lazy stream opening on first use - Automatic reconnection on stream failure - Permanent fallback to unary RPCs if filer returns Unimplemented - Monotonic request_id for response correlation - Multi-response support for rename operations (is_last signaling) The mux is initialized on WFS and closed during unmount cleanup. No call sites use it yet — wiring comes in subsequent commits. * mount: route CreateEntry and UpdateEntry through streaming mux Wire all CreateEntry call sites to use wfs.streamCreateEntry() which routes through the StreamMutateEntry stream when available, falling back to unary RPCs otherwise. Also wire Link's UpdateEntry calls through wfs.streamUpdateEntry(). Updated call sites: - flushMetadataToFiler (file flush after write) - Mkdir (directory creation) - Symlink (symbolic link creation) - createRegularFile non-deferred path (Mknod) - flushFileMetadata (periodic metadata flush) - Link (hard link: update source + create link + rollback) * mount: route UpdateEntry and DeleteEntry through streaming mux Wire remaining mutation call sites through the streaming mux: - saveEntry (Setattr/chmod/chown/utimes) → streamUpdateEntry - Unlink → streamDeleteEntry (replaces RemoveWithResponse) - Rmdir → streamDeleteEntry (replaces RemoveWithResponse) All filer mutations except Rename now go through StreamMutateEntry when the filer supports it, with automatic unary RPC fallback. * mount: route Rename through streaming mux Wire Rename to use streamMutate.Rename() when available, with fallback to the existing StreamRenameEntry unary stream. The streaming mux sends rename as a StreamRenameEntryRequest oneof variant. The server processes it through the existing rename logic and sends multiple StreamRenameEntryResponse events (one per moved entry), with is_last=true on the final response. All filer mutations now go through a single ordered stream. * mount: fix stream mux connection ownership WithGrpcClient(streamingMode=true) closes the gRPC connection when the callback returns, destroying the stream. Own the connection directly via pb.GrpcDial so it stays alive for the stream's lifetime. Close it explicitly in recvLoop on stream failure and in Close on shutdown. * mount: fix rename failure for deferred-create files Three fixes for rename operations over the streaming mux: 1. lookupEntry: fall back to local metadata store when filer returns "not found" for entries in uncached directories. Files created with deferFilerCreate=true exist only in the local leveldb store until flushed; lookupEntry skipped the local store when the parent directory had never been readdir'd, causing rename to fail with ENOENT. 2. Rename: wait for pending async flushes and force synchronous flush of dirty metadata before sending rename to the filer. Covers the writebackCache case where close() defers the flush to a background worker that may not complete before rename fires. 3. StreamMutateEntry: propagate rename errors from server to client. Add error/errno fields to StreamMutateEntryResponse so the mount can map filer errors to correct FUSE status codes instead of silently returning OK. Also fix the existing Rename error handler which could return fuse.OK on unrecognized errors. * mount: fix streaming mux error handling, sendLoop lifecycle, and fallback Address PR review comments: 1. Server: populate top-level Error/Errno on StreamMutateEntryResponse for create/update/delete errors, not just rename. Previously update errors were silently dropped and create/delete errors were only in nested response fields that the client didn't check. 2. Client: check nested error fields in CreateEntry (ErrorCode, Error) and DeleteEntry (Error) responses, matching CreateEntryWithResponse behavior. 3. Fix sendLoop lifecycle: give each stream generation a stopSend channel. recvLoop closes it on error to stop the paired sendLoop. Previously a reconnect left the old sendLoop draining sendCh, breaking ordering. 4. Transparent fallback: stream helpers and doRename fall back to unary RPCs on transport errors (ErrStreamTransport), including the first Unimplemented from ensureStream. Previously the first call failed instead of degrading. 5. Filer rotation in openStream: try all filer addresses on dial failure, matching WithFilerClient behavior. Stop early on Unimplemented. 6. Pass metadata-bearing context to StreamMutateEntry RPC call so sw-client-id header is actually sent. 7. Gate lookupEntry local-cache fallback on open dirty handle or pending async flush to avoid resurrecting deleted/renamed entries. 8. Remove dead code in flushFileMetadata (err=nil followed by if err!=nil). 9. Use string matching for rename error-to-errno mapping in the mount to stay portable across Linux/macOS (numeric errno values differ). * mount: make failAllPending idempotent with delete-before-close Change failAllPending to collect pending entries into a local slice (deleting from the sync.Map first) before closing channels. This prevents double-close panics if called concurrently. Also remove the unused err parameter. * mount: add stream generation tracking and teardownStream Introduce a generation counter on streamMutateMux that increments each time a new stream is created. Requests carry the generation they were enqueued for so sendLoop can reject stale requests after reconnect. Add teardownStream(gen) which is idempotent (only acts when gen matches current generation and stream is non-nil). Both sendLoop and recvLoop call it on error, replacing the inline cleanup in recvLoop. sendLoop now actively triggers teardown on send errors instead of silently exiting. ensureStream waits for the prior generation's recvDone before creating a new stream, ensuring all old pending waiters are failed before reconnect. recvLoop now takes the stream, generation, and recvDone channel as parameters to avoid accessing shared fields without the lock. * mount: harden Close to prevent races with teardownStream Nil out stream, cancel, and grpcConn under the lock so that any concurrent teardownStream call from recvLoop/sendLoop becomes a no-op. Call failAllPending before closing sendCh to unblock waiters promptly. Guard recvDone with a nil check for the case where Close is called before any stream was ever opened. * mount: make errCh receive ctx-aware in doUnary and Rename Replace the blocking <-sendReq.errCh with a select that also observes ctx.Done(). If sendLoop exits via stopSend without consuming a buffered request, the caller now returns ctx.Err() instead of blocking forever. The buffered errCh (capacity 1) ensures late acknowledgements from sendLoop don't block the sender. * mount: fix sendLoop/Close race and recvLoop/teardown pending channel race Three related fixes: 1. Stop closing sendCh in Close(). Closing the shared producer channel races with callers who passed ensureStream() but haven't sent yet, causing send-on-closed-channel panics. sendCh is now left open; ensureStream checks m.closed to reject new callers. 2. Drain buffered sendCh items on shutdown. sendLoop defers drainSendCh() on exit so buffered requests get an ErrStreamTransport on their errCh instead of blocking forever. Close() drains again for any stragglers enqueued between sendLoop's drain and the final shutdown. 3. Move failAllPending from teardownStream into recvLoop's defer. teardownStream (called from sendLoop on send error) was closing pending response channels while recvLoop could be between pending.Load and the channel send — a send-on-closed-channel panic. recvLoop is now the sole closer of pending channels, eliminating the race. Close() waits on recvDone (with cancel() to guarantee Recv unblocks) so pending cleanup always completes. * filer/mount: add debug logging for hardlink lifecycle Add V(0) logging at every point where a HardLinkId is created, stored, read, or deleted to trace orphaned hardlink references. Logging covers: - gRPC server: CreateEntry/UpdateEntry when request carries HardLinkId - FilerStoreWrapper: InsertEntry/UpdateEntry when entry has HardLinkId - handleUpdateToHardLinks: entry path, HardLinkId, counter, chunk count - setHardLink: KvPut with blob size - maybeReadHardLink: V(1) on read attempt and successful decode - DeleteHardLink: counter decrement/deletion events - Mount Link(): when NewHardLinkId is generated and link is created This helps diagnose how a git pack .rev file ended up with a HardLinkId during a clone (no hard links should be involved). * test: add git clone/pull integration test for FUSE mount Shell script that exercises git operations on a SeaweedFS mount: 1. Creates a bare repo on the mount 2. Clones locally, makes 3 commits, pushes to mount 3. Clones from mount bare repo into an on-mount working dir 4. Verifies clone integrity (files, content, commit hashes) 5. Pushes 2 more commits with renames and deletes 6. Checks out an older revision on the mount clone 7. Returns to branch and pulls with real changes 8. Verifies file content, renames, deletes after pull 9. Checks git log integrity and clean status 27 assertions covering file existence, content, commit hashes, file counts, renames, deletes, and git status. Run against any existing mount: bash test-git-on-mount.sh /path/to/mount * test: add git clone/pull FUSE integration test to CI suite Add TestGitOperations to the existing fuse_integration test framework. The test exercises git's full file operation surface on the mount: 1. Creates a bare repo on the mount (acts as remote) 2. Clones locally, makes 3 commits (files, bulk data, renames), pushes 3. Clones from mount bare repo into an on-mount working dir 4. Verifies clone integrity (content, commit hash, file count) 5. Pushes 2 more commits with new files, renames, and deletes 6. Checks out an older revision on the mount clone 7. Returns to branch and pulls with real fast-forward changes 8. Verifies post-pull state: content, renames, deletes, file counts 9. Checks git log integrity (5 commits) and clean status Runs automatically in the existing fuse-integration.yml CI workflow. * mount: fix permission check with uid/gid mapping The permission checks in createRegularFile() and Access() compared the caller's local uid/gid against the entry's filer-side uid/gid without applying the uid/gid mapper. With -map.uid 501:0, a directory created as uid 0 on the filer would not match the local caller uid 501, causing hasAccess() to fall through to "other" permission bits and reject write access (0755 → other has r-x, no w). Fix: map entry uid/gid from filer-space to local-space before the hasAccess() call so both sides are in the same namespace. This fixes rsync -a failing with "Permission denied" on mkstempat when using uid/gid mapping. * mount: fix Mkdir/Symlink returning filer-side uid/gid to kernel Mkdir and Symlink used `defer wfs.mapPbIdFromFilerToLocal(entry)` to restore local uid/gid, but `outputPbEntry` writes the kernel response before the function returns — so the kernel received filer-side uid/gid (e.g., 0:0). macFUSE then caches these and rejects subsequent child operations (mkdir, create) because the caller uid (501) doesn't match the directory owner (0), and "other" bits (0755 → r-x) lack write permission. Fix: replace the defer with an explicit call to mapPbIdFromFilerToLocal before outputPbEntry, so the kernel gets local uid/gid. Also add nil guards for UidGidMapper in Access and createRegularFile to prevent panics in tests that don't configure a mapper. This fixes rsync -a "Permission denied" on mkpathat for nested directories when using uid/gid mapping. * mount: fix Link outputting filer-side uid/gid to kernel, add nil guards Link had the same defer-before-outputPbEntry bug as Mkdir and Symlink: the kernel received filer-side uid/gid because the defer hadn't run yet when outputPbEntry wrote the response. Also add nil guards for UidGidMapper in Access and createRegularFile so tests without a mapper don't panic. Audit of all outputPbEntry/outputFilerEntry call sites: - Mkdir: fixed in prior commit (explicit map before output) - Symlink: fixed in prior commit (explicit map before output) - Link: fixed here (explicit map before output) - Create (existing file): entry from maybeLoadEntry (already mapped) - Create (deferred): entry has local uid/gid (never mapped to filer) - Create (non-deferred): createRegularFile defer runs before return - Mknod: createRegularFile defer runs before return - Lookup: entry from lookupEntry (already mapped) - GetAttr: entry from maybeReadEntry/maybeLoadEntry (already mapped) - readdir: entry from cache (mapIdFromFilerToLocal) or filer (mapped) - saveEntry: no kernel output - flushMetadataToFiler: no kernel output - flushFileMetadata: no kernel output * test: fix git test for same-filesystem FUSE clone When both the bare repo and working clone live on the same FUSE mount, git's local transport uses hardlinks and cross-repo stat calls that fail on FUSE. Fix: - Use --no-local on clone to disable local transport optimizations - Use reset --hard instead of checkout to stay on branch - Use fetch + reset --hard origin/<branch> instead of git pull to avoid local transport stat failures during fetch * adjust logging * test: use plain git clone/pull to exercise real FUSE behavior Remove --no-local and fetch+reset workarounds. The test should use the same git commands users run (clone, reset --hard, pull) so it reveals real FUSE issues rather than hiding them. * test: enable V(1) logging for filer/mount and collect logs on failure - Run filer and mount with -v=1 so hardlink lifecycle logs (V(0): create/delete/insert, V(1): read attempts) are captured - On test failure, automatically dump last 16KB of all process logs (master, volume, filer, mount) to test output - Copy process logs to /tmp/seaweedfs-fuse-logs/ for CI artifact upload - Update CI workflow to upload SeaweedFS process logs alongside test output * mount: clone entry for filer flush to prevent uid/gid race flushMetadataToFiler and flushFileMetadata used entry.GetEntry() which returns the file handle's live proto entry pointer, then mutated it in-place via mapPbIdFromLocalToFiler. During the gRPC call window, a concurrent Lookup (which takes entryLock.RLock but NOT fhLockTable) could observe filer-side uid/gid (e.g., 0:0) on the file handle entry and return it to the kernel. The kernel caches these attributes, so subsequent opens by the local user (uid 501) fail with EACCES. Fix: proto.Clone the entry before mapping uid/gid for the filer request. The file handle's live entry is never mutated, so concurrent Lookup always sees local uid/gid. This fixes the intermittent "Permission denied" on .git/FETCH_HEAD after the first git pull on a mount with uid/gid mapping. * mount: add debug logging for stale lock file investigation Add V(0) logging to trace the HEAD.lock recreation issue: - Create: log when O_EXCL fails (file already exists) with uid/gid/mode - completeAsyncFlush: log resolved path, saved path, dirtyMetadata, isDeleted at entry to trace whether async flush fires after rename - flushMetadataToFiler: log the dir/name/fullpath being flushed This will show whether the async flush is recreating the lock file after git renames HEAD.lock → HEAD. * mount: prevent async flush from recreating renamed .lock files When git renames HEAD.lock → HEAD, the async flush from the prior close() can run AFTER the rename and re-insert HEAD.lock into the meta cache via its CreateEntryRequest response event. The next git pull then sees HEAD.lock and fails with "File exists". Fix: add isRenamed flag on FileHandle, set by Rename before waiting for the pending async flush. The async flush checks this flag and skips the metadata flush for renamed files (same pattern as isDeleted for unlinked files). The data pages still flush normally. The Rename handler flushes deferred metadata synchronously (Case 1) before setting isRenamed, ensuring the entry exists on the filer for the rename to proceed. For already-released handles (Case 2), the entry was created by a prior flush. * mount: also mark renamed inodes via entry.Attributes.Inode fallback When GetInode fails (Forget already removed the inode mapping), the Rename handler couldn't find the pending async flush to set isRenamed. The async flush then recreated the .lock file on the filer. Fix: fall back to oldEntry.Attributes.Inode to find the pending async flush when the inode-to-path mapping is gone. Also extract MarkInodeRenamed into a method on FileHandleToInode for clarity. * mount: skip async metadata flush when saved path no longer maps to inode The isRenamed flag approach failed for refs/remotes/origin/HEAD.lock because neither GetInode nor oldEntry.Attributes.Inode could find the inode (Forget already evicted the mapping, and the entry's stored inode was 0). Add a direct check in completeAsyncFlush: before flushing metadata, verify that the saved path still maps to this inode in the inode-to-path table. If the path was renamed or removed (inode mismatch or not found), skip the metadata flush to avoid recreating a stale entry. This catches all rename cases regardless of whether the Rename handler could set the isRenamed flag. * mount: wait for pending async flush in Unlink before filer delete Unlink was deleting the filer entry first, then marking the draining async-flush handle as deleted. The async flush worker could race between these two operations and recreate the just-unlinked entry on the filer. This caused git's .lock files (e.g. refs/remotes/origin/HEAD.lock) to persist after git pull, breaking subsequent git operations. Move the isDeleted marking and add waitForPendingAsyncFlush() before the filer delete so any in-flight flush completes first. Even if the worker raced past the isDeleted check, the wait ensures it finishes before the filer delete cleans up any recreated entry. * mount: reduce async flush and metadata flush log verbosity Raise completeAsyncFlush entry log, saved-path-mismatch skip log, and flushMetadataToFiler entry log from V(0) to V(3)/V(4). These fire for every file close with writebackCache and are too noisy for normal use. * filer: reduce hardlink debug log verbosity from V(0) to V(4) HardLinkId logs in filerstore_wrapper, filerstore_hardlink, and filer_grpc_server fire on every hardlinked file operation (git pack files use hardlinks extensively) and produce excessive noise. * mount/filer: reduce noisy V(0) logs for link, rmdir, and empty folder check - weedfs_link.go: hardlink creation logs V(0) → V(4) - weedfs_dir_mkrm.go: non-empty folder rmdir error V(0) → V(1) - empty_folder_cleaner.go: "not empty" check log V(0) → V(4) * filer: handle missing hardlink KV as expected, not error A "kv: not found" on hardlink read is normal when the link blob was already cleaned up but a stale entry still references it. Log at V(1) for not-found; keep Error level for actual KV failures. * test: add waitForDir before git pull in FUSE git operations test After git reset --hard, the FUSE mount's metadata cache may need a moment to settle on slow CI. The git pull subprocess (unpack-objects) could fail to stat the working directory. Poll for up to 5s. * Update git_operations_test.go * wait * test: simplify FUSE test framework to use weed mini Replace the 4-process setup (master + volume + filer + mount) with 2 processes: "weed mini" (all-in-one) + "weed mount". This simplifies startup, reduces port allocation, and is faster on CI. * test: fix mini flag -admin → -admin.ui	2026-03-25 20:06:34 -07:00
Chris Lu	29bdbb3c48	filer.sync: replace O(n) conflict check with O(depth) index lookups (#8772 ) * filer.sync: replace O(n) conflict check with O(depth) index lookups The MetadataProcessor.conflictsWith() scanned all active jobs linearly for every new event dispatch. At high concurrency (256-1024), this O(n) scan under the activeJobsLock became a bottleneck that throttled the event dispatch pipeline, negating the benefit of higher -concurrency values. Replace the linear scan with three index maps: - activeFilePaths: O(1) exact file path lookup - activeDirPaths: O(1) directory path lookup per ancestor - descendantCount: O(1) check for active jobs under a directory Conflict check is now O(depth) where depth is the path depth (typically 3-6 levels), constant regardless of active job count. Benchmark confirms ~81ns per check whether there are 32 or 1024 active jobs. Also replace the O(n) watermark scan with minActiveTs tracking so non-oldest job completions are O(1). Ref: #8771 * filer.sync: replace O(n) watermark rescan with min-heap lazy deletion Address review feedback: - Replace minActiveTs O(n) rescan with a tsMinHeap using lazy deletion. Each TsNs is pushed once and popped once, giving O(log n) amortized watermark tracking regardless of completion order. - Fix benchmark to consume conflictsWith result via package-level sink variable to prevent compiler elision. The watermark advancement semantics (conservative, sets to completing job's TsNs) are unchanged from the original code. This is intentionally safe for idempotent replay on restart.	2026-03-25 15:43:25 -07:00
Andreas Røste	79f4a4579f	feat(k8s): added possibility to specify service.type for multiple ser… (#8372 ) * feat(k8s): added possibility to specify service.type for multiple services in helm chart * fix(k8s): removed headless (clusterIP: None) from services * fix(k8s): keep master and filer services headless for StatefulSet compatibility Master and filer services must remain headless (clusterIP: None) because their StatefulSets reference them via serviceName for stable pod DNS. Revert the service.type change for these two services and remove their unused service config from values.yaml. S3 and SFTP remain configurable. --------- Co-authored-by: Andreas Røste <andreas2101@gmail.com> Co-authored-by: Chris Lu <chrislusf@users.noreply.github.com> Co-authored-by: Chris Lu <chris.lu@gmail.com>	2026-03-25 11:30:14 -07:00
Chris Lu	e47054a7e7	mount: improve small file write performance (#8769 ) * mount: defer file creation gRPC to flush time for faster small file writes When creating a file via FUSE Create(), skip the synchronous gRPC CreateEntry call to the filer. Instead, allocate the inode and build the entry locally, deferring the filer create to the Flush/Release path where flushMetadataToFiler already sends a CreateEntry with chunk data. This eliminates one synchronous gRPC round-trip per file during creation. For workloads with many small files (e.g. 30K files), this reduces the per-file overhead from ~2 gRPC calls to ~1. Mknod retains synchronous filer creation since it has no file handle and thus no flush path. * mount: use bounded worker pool for async flush operations Replace unbounded goroutine spawning in writebackCache async flush with a fixed-size worker pool backed by a channel. When many files are closed rapidly (e.g., cp -r of 30K files), the previous approach spawned one goroutine per file, leading to resource contention on gRPC/HTTP connections and high goroutine overhead. The worker pool size matches ConcurrentWriters (default 128), which provides good parallelism while bounding resource usage. Work items are queued into a buffered channel and processed by persistent worker goroutines. * mount: fix deferred create cache visibility and async flush race Three fixes for the deferred create and async flush changes: 1. Insert a local placeholder entry into the metadata cache during deferred file creation so that maybeLoadEntry() can find the file for duplicate-create checks, stat, and readdir. Uses InsertEntry directly (not applyLocalMetadataEvent) to avoid triggering the directory hot-threshold eviction that would wipe the entry. 2. Fix race in ReleaseHandle where asyncFlushWg.Add(1) and the channel send happened after pendingAsyncFlushMu was unlocked. A concurrent WaitForAsyncFlush could observe a zero counter, close the channel, and cause a send-on-closed panic. Move Add(1) before the unlock; keep the send after unlock to avoid deadlock with workers that acquire the same mutex during cleanup. 3. Update TestCreateCreatesAndOpensFile to flush the file handle before verifying the CreateEntry gRPC call, since file creation is now deferred to flush time.	2026-03-24 20:31:53 -07:00
Chris Lu	28fe92065a	S3: reject part uploads after AbortMultipartUpload (#8768 ) * S3: reject part uploads after AbortMultipartUpload PutObjectPartHandler did not verify that the multipart upload session still exists before accepting parts. After AbortMultipartUpload deleted the upload directory, the ErrNotFound from getEntry was silently ignored (treated as "may be non-SSE upload"), allowing parts to be stored as orphaned files. Now return ErrNoSuchUpload when the upload directory is not found, matching AWS S3 behavior. Fixes #8766 * S3: check upload existence unconditionally in PutObjectPartHandler Move the getEntry call out of the SSE-type conditional so the upload existence check runs for all part uploads, including SSE-C. Previously the SSE-C path skipped the check entirely, allowing parts to be uploaded after abort when SSE-C headers were present. Also flattens the nested SSE branching by one level now that getEntry is called once upfront. * S3: address PR review feedback for PutObjectPartHandler - Log at error level when getEntry fails with an unexpected error, since we return ErrInternalError to the client - Distinguish base IV decode errors from length validation failures with separate, clearer error messages --------- Co-authored-by: Copilot <copilot@github.com>	2026-03-24 18:11:51 -07:00
Chris Lu	0b3867dca3	filer: add structured error codes to CreateEntryResponse (#8767 ) * filer: add FilerError enum and error_code field to CreateEntryResponse Add a machine-readable error code alongside the existing string error field. This follows the precedent set by PublishMessageResponse in the MQ broker proto. The string field is kept for human readability and backward compatibility. Defined codes: OK, ENTRY_NAME_TOO_LONG, PARENT_IS_FILE, EXISTING_IS_DIRECTORY, EXISTING_IS_FILE, ENTRY_ALREADY_EXISTS. * filer: add sentinel errors and error code mapping in filer_pb Define sentinel errors (ErrEntryNameTooLong, ErrParentIsFile, etc.) in the filer_pb package so both the filer and consumers can reference them without circular imports. Add FilerErrorToSentinel() to map proto error codes to sentinels, and update CreateEntryWithResponse() to check error_code first, falling back to the string-based path for backward compatibility with old servers. * filer: return wrapped sentinel errors and set proto error codes Replace fmt.Errorf string errors in filer.CreateEntry, UpdateEntry, and ensureParentDirectoryEntry with wrapped filer_pb sentinel errors (using %w). This preserves errors.Is() traversal on the server side. In the gRPC CreateEntry handler, map sentinel errors to the corresponding FilerError proto codes using errors.Is(), setting both resp.Error (string, for backward compat) and resp.ErrorCode (enum). * S3: use errors.Is() with filer sentinels instead of string matching Replace fragile string-based error matching in filerErrorToS3Error and other S3 API consumers with errors.Is() checks against filer_pb sentinel errors. This works because the updated CreateEntryWithResponse helper reconstructs sentinel errors from the proto FilerError code. Update iceberg stage_create and metadata_files to check resp.ErrorCode instead of parsing resp.Error strings. Update SSE-S3 to use errors.Is() for the already-exists check. String matching is retained only for non-filer errors (gRPC transport errors, checksum validation) that don't go through CreateEntryResponse. * filer: remove backward-compat string fallbacks for error codes Clients and servers are always deployed together, so there is no need for backward-compatibility fallback paths that parse resp.Error strings when resp.ErrorCode is unset. Simplify all consumers to rely solely on the structured error code. * iceberg: ensure unknown non-OK error codes are not silently ignored When FilerErrorToSentinel returns nil for an unrecognized error code, return an error including the code and message rather than falling through to return nil. * filer: fix redundant error message and restore error wrapping in helper Use request path instead of resp.Error in the sentinel error format string to avoid duplicating the sentinel message (e.g. "entry already exists: entry already exists"). Restore %w wrapping with errors.New() in the fallback paths so callers can use errors.Is()/errors.As(). * filer: promote file to directory on path conflict instead of erroring S3 allows both "foo/bar" (object) and "foo/bar/xyzzy" (another object) to coexist because S3 has a flat key space. When ensureParentDirectoryEntry finds a parent path that is a file instead of a directory, promote it to a directory by setting ModeDir while preserving the original content and chunks. Use Store.UpdateEntry directly to bypass the Filer.UpdateEntry type-change guard. This fixes the S3 compatibility test failures where creating overlapping keys (e.g. "foo/bar" then "foo/bar/xyzzy") returned ExistingObjectIsFile.	2026-03-24 17:08:22 -07:00
Chris Lu	152884eff2	S3: add s3: prefix to x-amz-* condition keys for AWS compatibility (#8765 ) AWS S3 policy conditions reference request headers with the s3: namespace prefix (e.g., s3:x-amz-server-side-encryption). The extraction code was storing these headers without the prefix, so bucket policy conditions using the standard AWS key names would never match.	2026-03-24 14:04:42 -07:00
Chris Lu	2877febd73	S3: fix silent PutObject failure and enforce 1024-byte key limit (#8764 ) * S3: add KeyTooLongError error code Add ErrKeyTooLongError (HTTP 400, code "KeyTooLongError") to match the standard AWS S3 error for object keys that exceed length limits. * S3: fix silent PutObject failure when entry name exceeds max_file_name_length putToFiler called client.CreateEntry() directly and discarded the gRPC response. The filer embeds application errors like "entry name too long" in resp.Error (not as gRPC transport errors), so the error was silently swallowed and clients received HTTP 200 with an ETag for objects that were never stored. Switch to the filer_pb.CreateEntry() helper which properly checks resp.Error, and map "entry name too long" to KeyTooLongError (HTTP 400). To avoid fragile string parsing across the gRPC boundary, define shared error message constants in weed/util/constants and use them in both the filer (producing errors) and S3 API (matching errors). Switch filerErrorToS3Error to use strings.Contains/HasSuffix with these constants so matches work regardless of any wrapper prefix. Apply filerErrorToS3Error to the mkdir path for directory markers. Fixes #8759 * S3: enforce 1024-byte maximum object key length AWS S3 limits object keys to 1024 bytes. Add early validation on write paths (PutObject, CopyObject, CreateMultipartUpload) to reject keys exceeding the limit with the standard KeyTooLongError (HTTP 400). The key length check runs before bucket auto-creation to prevent overlong keys from triggering unnecessary side effects. Also use filerErrorToS3Error for CopyObject's mkFile error paths so name-too-long errors from the filer return KeyTooLongError instead of InternalError. Ref #8758 * S3: add handler-level tests for key length validation and error mapping Add tests for filerErrorToS3Error mapping "entry name too long" to KeyTooLongError, including a regression test for the CreateEntry-prefixed "existing ... is a directory" form. Add handler-level integration tests that exercise PutObjectHandler, CopyObjectHandler, and NewMultipartUploadHandler via httptest, verifying HTTP 400 and KeyTooLongError XML response for overlong keys and acceptance of keys at the 1024-byte limit.	2026-03-24 13:35:28 -07:00
Chris Lu	7f3f61ea28	fix: resolve Kafka gateway response deadlocks causing Sarama client hangs (#8762 ) * fix: resolve Kafka gateway response deadlocks causing Sarama client hangs Fix three bugs in the Kafka protocol handler that caused sequential clients (notably Sarama) to hang during E2E tests: 1. Race condition in correlation queue ordering: the correlation ID was added to the response ordering queue AFTER sending the request to the processing channel. A fast processor (e.g. ApiVersions) could finish and send its response before the ID was in the queue, causing the response writer to miss it — permanently deadlocking the connection. Now the ID is added BEFORE the channel send, with error response injection on send failure. 2. Silent error response drops: when processRequestSync returned an error, the response writer logged it but never sent anything back to the client. The client would block forever waiting for bytes that never arrived. Now sends a Kafka UNKNOWN_SERVER_ERROR response. 3. Produce V0/V1 missing timeout_ms parsing: the handler skipped the 4-byte timeout field, reading it as topicsCount instead. This caused incorrect parsing of the entire produce request for V0/V1 clients. * fix: API-versioned error responses, unsupported-version queue fix, V0V1 header alignment 1. errors.go — BuildAPIErrorResponse: emits a minimal-but-valid error body whose layout matches the schema the client expects for each API key and version (throttle_time position, array fields, etc.). The old 2-byte generic body corrupted the protocol stream for APIs whose response begins with throttle_time_ms or an array. 2. handler.go — unsupported-version path: the correlationID was never added to correlationQueue before sending to responseChan, so the response writer could never match it and the client hung. Now appends the ID under correlationQueueMu before the send. 3. produce.go — handleProduceV0V1: requestBody is already post-header (HandleConn strips client_id). The handler was erroneously parsing acks bytes as a client_id length, misaligning all subsequent field reads. Removed the client_id parsing; offset now starts at 0 with acks(2) + timeout_ms(4) + topicsCount(4), matching handleProduceV2Plus. * fix: free pooled message buffer per-iteration instead of deferring The read loop allocated messageBuf via mem.Allocate and deferred mem.Free. Since the defer only runs when HandleConn returns, pool buffers accumulated for the entire connection lifetime — one per request. Worse, the deferred frees ran in LIFO order before wg.Wait(), so processing goroutines could read from already-freed pool buffers. Now: read into a pooled buffer, immediately copy to Go-managed memory, and return the pool buffer. messageBuf is a regular slice safe for async goroutine access with no defer accumulation. * fix: cancel context before wg.Wait and on worker response-send timeout Two related issues: 1. Cleanup defer ordering deadlock: defers run LIFO — the cleanup defer (close channels, wg.Wait) ran before the cancel() defer. The response writer is in the WaitGroup and exits only on ctx.Done() or responseChan close, but both signals came after wg.Wait(). Deadlock on every normal connection close (EOF, read error, queue-full). Fix: call cancel() at the start of the cleanup defer, before wg.Wait(). 2. Worker 5s response-send timeout: when the timeout fired, the response was silently dropped but the correlationID remained in the ordered queue. The response writer could never advance past it, stalling all subsequent responses permanently. Fix: call cancel() to tear down the connection — if we cannot deliver a response in 5s the connection is irrecoverable. * chore: remove empty no-op ListOffsets conditional The `if apiKey == 2 {}` block had no body — leftover debug code. ListOffsets routing is handled by isDataPlaneAPI (returns false, sending it to the control channel). No behavior change.	2026-03-24 13:17:25 -07:00
Chris Lu	6c35a3724a	weed/mount: simplify metadata flush retry returns (#8763 )	2026-03-24 12:24:56 -07:00
Chris Lu	cca1555cc7	mount: implement create for rsync temp files (#8749 ) * mount: implement create for rsync temp files * mount: move access implementation out of unsupported * mount: tighten access checks * mount: log access group lookup failures * mount: reset dirty pages on truncate * mount: tighten create and root access handling * mount: handle existing creates before quota checks * mount: restrict access fallback when group lookup fails When lookupSupplementaryGroupIDs returns an error, the previous code fell through to checking only the "other" permission bits, which could overgrant access. Require both group and other permission classes to satisfy the mask so access is never broader than intended. * mount: guard against nil entry in Create existing-file path maybeLoadEntry can return OK with a nil entry or nil Attributes in edge cases. Check before dereferencing to prevent a panic. * mount: reopen existing file on create race without O_EXCL When createRegularFile returns EEXIST because another process won the race, and O_EXCL is not set, reload the winner's entry and open it instead of propagating the error to the caller. * mount: check parent directory permission in createRegularFile Verify the caller has write+search (W_OK\|X_OK) permission on the parent directory before creating a file. This applies to both Create and Mknod. Update test fixture mount mode to 0o777 so the existing tests pass with the new check. * mount: enforce file permission bits in AcquireHandle Map the open flags (O_RDONLY/O_WRONLY/O_RDWR) to an access mask and call hasAccess before handing out a file handle. This makes AcquireHandle the single source of truth for mode-based access control across Open, Create-existing, and Create-new paths. --------- Co-authored-by: Copilot <copilot@github.com>	2026-03-24 11:43:41 -07:00
Chris Lu	805625d06e	Add FUSE integration tests for POSIX file locking (#8752 ) * Add FUSE integration tests for POSIX file locking Test flock() and fcntl() advisory locks over the FUSE mount: - Exclusive and shared flock with conflict detection - flock upgrade (shared to exclusive) and release on close - fcntl F_SETLK write lock conflicts and shared read locks - fcntl F_GETLK conflict reporting on overlapping byte ranges - Non-overlapping byte-range locks held independently - F_SETLKW blocking until conflicting lock is released - Lock release on file descriptor close - Concurrent lock contention with multiple workers * Fix review feedback in POSIX lock integration tests - Assert specific EAGAIN error on fcntl lock conflicts instead of generic Error - Use O_APPEND in concurrent contention test so workers append rather than overwrite - Verify exact line count (numWorkers * writesPerWorker) after concurrent test - Check unlock error in F_SETLKW blocking test goroutine * Refactor fcntl tests to use subprocesses for inter-process semantics POSIX fcntl locks use the process's files_struct as lock owner, so all fds in the same process share the same owner and never conflict. This caused the fcntl tests to silently pass without exercising lock conflicts. Changes: - Add TestFcntlLockHelper subprocess entry point with hold/try/getlk actions - Add lockHolder with channel-based coordination (no scanner race) - Rewrite all fcntl tests to run contenders in separate subprocesses - Fix F_UNLCK int16 cast in GetLk assertion for type-safe comparison - Fix concurrent test: use non-blocking flock with retry to avoid exhausting go-fuse server reader goroutines (blocking FUSE SETLKW can starve unlock request processing, causing deadlock) flock tests remain same-process since flock uses per-struct-file owners. * Fix misleading comment and error handling in lock test subprocess - Fix comment: tryLockInSubprocess tests a subprocess, not the test process - Distinguish EAGAIN/EACCES from unexpected errors in subprocess try mode so real failures aren't silently masked as lock conflicts * Fix CI race in FcntlReleaseOnClose and increase flock retry budget - FcntlReleaseOnClose: retry lock acquisition after subprocess exits since the FUSE server may not process Release immediately - ConcurrentLockContention: increase retry limit from 500 to 3000 (5s → 30s budget) to handle CI load * separating flock and fcntl in the in-memory lock table and cleaning them up through the right release path: PID for POSIX locks, lock owner for flock * ReleasePosixOwner * weed/mount: flush before releasing posix close owner * weed/mount: keep woken lock waiters from losing inode state * test/fuse: make blocking fcntl helper state explicit * test/fuse: assert flock contention never overlaps * test/fuse: stabilize concurrent lock contention check * test/fuse: make concurrent contention writes deterministic * weed/mount: retry synchronous metadata flushes	2026-03-24 11:43:25 -07:00
Lars Lehtonen	9cc26d09e8	chore:(weed/worker/tasks/erasure_coding): Prune Unused and Untested Functions (#8761 ) * chore(weed/worker/tasks/erasure_coding): prune unused findVolumeReplicas() * chore(weed/worker/tasks/erasure_coding): prune unused isDiskSuitableForEC() * chore(weed/worker/tasks/erasure_coding): prune unused selectBestECDestinations() * chore(weed/worker/tasks/erasure_coding): prune unused candidatesToDiskInfos()	2026-03-24 10:10:28 -07:00
Chris Lu	3d872e86f8	Implement POSIX file locking for FUSE mount (#8750 ) * Add POSIX byte-range lock table for FUSE mount Implement PosixLockTable with per-inode range lock tracking supporting: - Shared (F_RDLCK) and exclusive (F_WRLCK) byte-range locks - Conflict detection across different lock owners - Lock coalescing for adjacent/overlapping same-owner same-type locks - Lock splitting on partial-range unlock - Blocking waiter support for SetLkw with cancellation - Owner-based cleanup for Release * Wire POSIX lock handlers into FUSE mount Implement GetLk, SetLk, SetLkw on WFS delegating to PosixLockTable. Add posixLocks field to WFS and initialize in constructor. Clean up locks on Release via ReleaseOwner using ReleaseIn.LockOwner. Remove ENOSYS stubs from weedfs_unsupported.go. * Enable POSIX and flock lock capabilities in FUSE mount Set EnableLocks: true in mount options to advertise CAP_POSIX_LOCKS and CAP_FLOCK_LOCKS during FUSE INIT. * Avoid thundering herd in lock waiter wake-up Replace broadcast-all wakeWaiters with selective wakeEligibleWaiters that checks each waiter's requested lock against remaining held locks. Only waiters whose request no longer conflicts are woken; others stay queued. Store the requested lockRange in each lockWaiter to enable this. * Fix uint64 overflow in adjacency check for lock coalescing Guard h.End+1 and lk.End+1 with < ^uint64(0) checks so that End == math.MaxUint64 (EOF) does not wrap to 0 and falsely merge non-adjacent locks. * Add test for non-adjacent ranges with gap not being coalesced	2026-03-23 22:18:51 -07:00
Chris Lu	2844c70ecf	fix tests	2026-03-23 19:36:14 -07:00
Chris Lu	e5f72077ee	fix: resolve CORS cache race condition causing stale 404 responses (#8748 ) The metadata subscription handler (updateBucketConfigCacheFromEntry) was making a separate RPC call via loadCORSFromBucketContent to load CORS configuration. This created a race window where a slow CreateBucket subscription event could re-cache stale data after PutBucketCors had already cleared the cache, causing subsequent GetBucketCors to return 404 NoSuchCORSConfiguration. Parse CORS directly from the subscription entry's Content field instead of making a separate RPC. Also fix getBucketConfig to parse CORS from the already-fetched entry, eliminating a redundant RPC call. Fix TestCORSCaching to use require.NoError to prevent nil pointer dereference panics when GetBucketCors fails.	2026-03-23 19:33:20 -07:00
Chris Lu	c31e6b4684	Use filer-side copy for mounted whole-file copy_file_range (#8747 ) * Optimize mounted whole-file copy_file_range * Address mounted copy review feedback * Harden mounted copy fast path --------- Co-authored-by: Copilot <copilot@github.com>	2026-03-23 18:35:15 -07:00
Chris Lu	6bf654c25c	fix: keep metadata subscriptions progressing (#8730 ) (#8746 ) * fix: keep metadata subscriptions progressing (#8730) * test: cancel slow metadata writers with parent context * filer: ignore missing persisted log chunks	2026-03-23 15:26:54 -07:00
Chris Lu	d5ee35c8df	Fix S3 delete for non-empty directory markers (#8740 ) * Fix S3 delete for non-empty directory markers * Address review feedback on directory marker deletes * Stabilize FUSE concurrent directory operations	2026-03-23 13:35:16 -07:00
dependabot[bot]	b3b7033fe1	build(deps): bump github.com/klauspost/compress from 1.18.4 to 1.18.5 (#8739 ) Bumps [github.com/klauspost/compress](https://github.com/klauspost/compress) from 1.18.4 to 1.18.5. - [Release notes](https://github.com/klauspost/compress/releases) - [Commits](https://github.com/klauspost/compress/compare/v1.18.4...v1.18.5) --- updated-dependencies: - dependency-name: github.com/klauspost/compress dependency-version: 1.18.5 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2026-03-23 11:59:53 -07:00
Mmx233	ecadeddcbe	fix: extend ignore404Error to match 404 Not Found string from S3 sink… (#8741 ) * fix: extend ignore404Error to match 404 Not Found string from S3 sink errors * test: add unit tests for isIgnorable404 error matching * improve: pre-compute ignorable 404 string and simplify isIgnorable404 * test: replace init() with TestMain for global HTTP client setup	2026-03-23 11:59:34 -07:00
dependabot[bot]	156e1a6e64	build(deps): bump gocloud.dev/pubsub/rabbitpubsub from 0.44.0 to 0.45.0 (#8737 ) Bumps [gocloud.dev/pubsub/rabbitpubsub](https://github.com/google/go-cloud) from 0.44.0 to 0.45.0. - [Release notes](https://github.com/google/go-cloud/releases) - [Commits](https://github.com/google/go-cloud/compare/v0.44.0...v0.45.0) --- updated-dependencies: - dependency-name: gocloud.dev/pubsub/rabbitpubsub dependency-version: 0.45.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2026-03-23 11:55:16 -07:00
Copilot	2b6271469b	Merge branch 'master' of https://github.com/seaweedfs/seaweedfs	2026-03-23 11:22:45 -07:00
Copilot	963ec4c6e6	remove claude from github ci	2026-03-23 11:22:19 -07:00
dependabot[bot]	fb442a57d7	build(deps): bump actions/checkout from 4 to 6 (#8738 ) Bumps [actions/checkout](https://github.com/actions/checkout) from 4 to 6. - [Release notes](https://github.com/actions/checkout/releases) - [Commits](https://github.com/actions/checkout/compare/v4...v6) --- updated-dependencies: - dependency-name: actions/checkout dependency-version: '6' dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2026-03-23 10:50:22 -07:00
dependabot[bot]	a080fbb495	build(deps): bump github.com/go-ldap/ldap/v3 from 3.4.12 to 3.4.13 (#8736 ) Bumps [github.com/go-ldap/ldap/v3](https://github.com/go-ldap/ldap) from 3.4.12 to 3.4.13. - [Release notes](https://github.com/go-ldap/ldap/releases) - [Commits](https://github.com/go-ldap/ldap/compare/v3.4.12...v3.4.13) --- updated-dependencies: - dependency-name: github.com/go-ldap/ldap/v3 dependency-version: 3.4.13 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2026-03-23 10:50:01 -07:00
dependabot[bot]	d6479b6d4e	build(deps): bump github.com/klauspost/reedsolomon from 1.13.0 to 1.13.3 (#8735 ) Bumps [github.com/klauspost/reedsolomon](https://github.com/klauspost/reedsolomon) from 1.13.0 to 1.13.3. - [Release notes](https://github.com/klauspost/reedsolomon/releases) - [Commits](https://github.com/klauspost/reedsolomon/compare/v1.13.0...v1.13.3) --- updated-dependencies: - dependency-name: github.com/klauspost/reedsolomon dependency-version: 1.13.3 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2026-03-23 10:49:49 -07:00
dependabot[bot]	17800f63a7	build(deps): bump golang.org/x/crypto from 0.48.0 to 0.49.0 (#8734 ) Bumps [golang.org/x/crypto](https://github.com/golang/crypto) from 0.48.0 to 0.49.0. - [Commits](https://github.com/golang/crypto/compare/v0.48.0...v0.49.0) --- updated-dependencies: - dependency-name: golang.org/x/crypto dependency-version: 0.49.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2026-03-23 10:49:34 -07:00
dependabot[bot]	3a765df2ff	build(deps): bump dorny/test-reporter from 2 to 3 (#8733 ) Bumps [dorny/test-reporter](https://github.com/dorny/test-reporter) from 2 to 3. - [Release notes](https://github.com/dorny/test-reporter/releases) - [Changelog](https://github.com/dorny/test-reporter/blob/main/CHANGELOG.md) - [Commits](https://github.com/dorny/test-reporter/compare/v2...v3) --- updated-dependencies: - dependency-name: dorny/test-reporter dependency-version: '3' dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2026-03-23 09:16:38 -07:00
dependabot[bot]	bb151d8e57	build(deps): bump actions/setup-node from 4 to 6 (#8732 ) Bumps [actions/setup-node](https://github.com/actions/setup-node) from 4 to 6. - [Release notes](https://github.com/actions/setup-node/releases) - [Commits](https://github.com/actions/setup-node/compare/v4...v6) --- updated-dependencies: - dependency-name: actions/setup-node dependency-version: '6' dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2026-03-23 09:16:21 -07:00
Chris Lu	9434d3733d	mount: async flush on close() when writebackCache is enabled (#8727 ) * mount: async flush on close() when writebackCache is enabled When -writebackCache is enabled, defer data upload and metadata flush from Flush() (triggered by close()) to a background goroutine in Release(). This allows processes like rsync that write many small files to proceed to the next file immediately instead of blocking on two network round-trips (volume upload + filer metadata) per file. Fixes #8718 * mount: add retry with backoff for async metadata flush The metadata flush in completeAsyncFlush now retries up to 3 times with exponential backoff (1s, 2s, 4s) on transient gRPC errors. Since the chunk data is already safely on volume servers at this point, only the filer metadata reference needs persisting — retrying is both safe and effective. Data flush (FlushData) is not retried externally because UploadWithRetry already handles transient HTTP/gRPC errors internally; if it still fails, the chunk memory has been freed. * test: add integration tests for writebackCache async flush Add comprehensive FUSE integration tests for the writebackCache async flush feature (issue #8718): - Basic operations: write/read, sequential files, large files, empty files, overwrites - Fsync correctness: fsync forces synchronous flush even in writeback mode, immediate read-after-fsync - Concurrent small files: multi-worker parallel writes (rsync-like workload), multi-directory, rapid create/close - Data integrity: append after close, partial writes, file size correctness, binary data preservation - Performance comparison: writeback vs synchronous flush throughput - Stress test: 16 workers x 100 files with content verification - Mixed concurrent operations: reads, writes, creates running together Also fix pre-existing test infrastructure issues: - Rename framework.go to framework_test.go (fixes Go package conflict) - Fix undefined totalSize variable in concurrent_operations_test.go * ci: update fuse-integration workflow to run full test suite The workflow previously only ran placeholder tests (simple_test.go, working_demo_test.go) in a temp directory due to a Go module conflict. Now that framework.go is renamed to framework_test.go, the full test suite compiles and runs correctly from test/fuse_integration/. Changes: - Run go test directly in test/fuse_integration/ (no temp dir copy) - Install weed binary to /usr/local/bin for test framework discovery - Configure /etc/fuse.conf with user_allow_other for FUSE mounts - Install fuse3 for modern FUSE support - Stream test output to log file for artifact upload * mount: fix three P1 races in async flush P1-1: Reopen overwrites data still flushing in background ReleaseByHandle removes the old handle from fhMap before the deferred flush finishes. A reopen of the same inode during that window would build from stale filer metadata, overwriting the async flush. Fix: Track in-flight async flushes per inode via pendingAsyncFlush map. AcquireHandle now calls waitForPendingAsyncFlush(inode) to block until any pending flush completes before reading filer metadata. P1-2: Deferred flush races rename and unlink after close completeAsyncFlush captured the path once at entry, but rename or unlink after close() could cause metadata to be written under the wrong name or recreate a deleted file. Fix: Re-resolve path from inode via GetPath right before metadata flush. GetPath returns the current path (reflecting renames) or ENOENT (if unlinked), in which case we skip the metadata flush. P1-3: SIGINT/SIGTERM bypasses the async-flush drain grace.OnInterrupt runs hooks then calls os.Exit(0), so WaitForAsyncFlush after server.Serve() never executes on signal. Fix: Add WaitForAsyncFlush (with 10s timeout) to the WFS interrupt handler, before cache cleanup. The timeout prevents hanging on Ctrl-C when the filer is unreachable. * mount: fix P1 races — draining handle stays in fhMap P1-1: Reopen TOCTOU The gap between ReleaseByHandle removing from fhMap and submitAsyncFlush registering in pendingAsyncFlush allowed a concurrent AcquireHandle to slip through with stale metadata. Fix: Hold pendingAsyncFlushMu across both the counter decrement (ReleaseByHandle) and the pending registration. The handle is registered as pending before the lock is released, so waitForPendingAsyncFlush always sees it. P1-2: Rename/unlink can't find draining handle ReleaseByHandle deleted from fhMap immediately. Rename's FindFileHandle(inode) at line 251 could not find the handle to update entry.Name. Unlink could not coordinate either. Fix: When asyncFlushPending is true, ReleaseByHandle/ReleaseByInode leave the handle in fhMap (counter=0 but maps intact). The handle stays visible to FindFileHandle so rename can update entry.Name. completeAsyncFlush re-resolves the path from the inode (GetPath) right before metadata flush for correctness after rename/unlink. After drain, RemoveFileHandle cleans up the maps. Double-return prevention: ReleaseByHandle/ReleaseByInode return nil if counter is already <= 0, so Forget after Release doesn't start a second drain goroutine. P1-3: SIGINT deletes swap files under running goroutines After the 10s timeout, os.RemoveAll deleted the write cache dir (containing swap files) while FlushData goroutines were still reading from them. Fix: Increase timeout to 30s. If timeout expires, skip write cache dir removal so in-flight goroutines can finish reading swap files. The OS (or next mount) cleans them up. Read cache is always removed. * mount: never skip metadata flush when Forget drops inode mapping Forget removes the inode→path mapping when the kernel's lookup count reaches zero, but this does NOT mean the file was unlinked — it only means the kernel evicted its cache entry. completeAsyncFlush was treating GetPath failure as "file unlinked" and skipping the metadata flush, which orphaned the just-uploaded chunks for live files. Fix: Save dir and name at doFlush defer time. In completeAsyncFlush, try GetPath first to pick up renames; if the mapping is gone, fall back to the saved dir/name. Always attempt the metadata flush — the filer is the authority on whether the file exists, not the local inode cache. * mount: distinguish Forget from Unlink in async flush path fallback The saved-path fallback (from the previous fix) always flushed metadata when GetPath failed, which recreated files that were explicitly unlinked after close(). The same stale fallback could recreate the pre-rename path if Forget dropped the inode mapping after a rename. Root cause: GetPath failure has two meanings: 1. Forget — kernel evicted the cache entry (file still exists) 2. Unlink — file was explicitly deleted (should not recreate) Fix (three coordinated changes): Unlink (weedfs_file_mkrm.go): Before RemovePath, look up the inode and find any draining handle via FindFileHandle. Set fh.isDeleted = true so the async flush knows the file was explicitly removed. Rename (weedfs_rename.go): When renaming a file with a draining handle, update asyncFlushDir/asyncFlushName to the post-rename location. This keeps the saved-path fallback current so Forget after rename doesn't flush to the old (pre-rename) path. completeAsyncFlush (weedfs_async_flush.go): Check fh.isDeleted first — if true, skip metadata flush (file was unlinked, chunks become orphans for volume.fsck). Otherwise, try GetPath for the current path (renames); fall back to saved path if Forget dropped the mapping (file is live, just evicted from kernel cache). * test/ci: address PR review nitpicks concurrent_operations_test.go: - Restore precise totalSize assertion instead of info.Size() > 0 writeback_cache_test.go: - Check rand.Read errors in all 3 locations (lines 310, 512, 757) - Check os.MkdirAll error in stress test (line 752) - Remove dead verifyErrors variable (line 332) - Replace both time.Sleep(5s) with polling via waitForFileContent to avoid flaky tests under CI load (lines 638, 700) fuse-integration.yml: - Add set -o pipefail so go test failures propagate through tee * ci: fix fuse3/fuse package conflict on ubuntu-22.04 runner fuse3 is pre-installed on ubuntu-22.04 runners and conflicts with the legacy fuse package. Only install libfuse3-dev for the headers. * mount/page_writer: remove debug println statements Remove leftover debug println("read new data1/2") from ReadDataAt in MemChunk and SwapFileChunk. * test: fix findWeedBinary matching source directory instead of binary findWeedBinary() matched ../../weed (the source directory) via os.Stat before checking PATH, then tried to exec a directory which fails with "permission denied" on the CI runner. Fix: Check PATH first (reliable in CI where the binary is installed to /usr/local/bin). For relative paths, verify the candidate is a regular file (!info.IsDir()). Add ../../weed/weed as a candidate for in-tree builds. * test: fix framework — dynamic ports, output capture, data dirs The integration test framework was failing in CI because: 1. All tests used hardcoded ports (19333/18080/18888), so sequential tests could conflict when prior processes hadn't fully released their ports yet. 2. Data subdirectories (data/master, data/volume) were not created before starting processes. 3. Master was started with -peers=none which is not a valid address. 4. Process stdout/stderr was not captured, making failures opaque ("service not ready within timeout" with no diagnostics). 5. The unmount fallback used 'umount' instead of 'fusermount -u'. 6. The mount used -cacheSizeMB (nonexistent) instead of -cacheCapacityMB and was missing -allowOthers=false for unprivileged CI runners. Fixes: - Dynamic port allocation via freePort() (net.Listen ":0") - Explicit gRPC ports via -port.grpc to avoid default port conflicts - Create data/master and data/volume directories in Setup() - Remove invalid -peers=none and -raftBootstrap flags - Capture process output to logDir/.log via startProcess() helper - dumpLog() prints tail of log file on service startup failure - Use fusermount3/fusermount -u for unmount - Fix mount flag names (-cacheCapacityMB, -allowOthers=false) test: remove explicit -port.grpc flags from test framework SeaweedFS convention: gRPC port = HTTP port + 10000. Volume and filer discover the master gRPC port by this convention. Setting explicit -port.grpc on master/volume/filer broke inter-service communication because the volume server computed master gRPC as HTTP+10000 but the actual gRPC was on a different port. Remove all -port.grpc flags and let the default convention work. Dynamic HTTP ports already ensure uniqueness; the derived gRPC ports (HTTP+10000) will also be unique. --------- Co-authored-by: Copilot <copilot@github.com>	2026-03-22 15:24:08 -07:00
Chris Lu	d6a872c4b9	Preserve explicit directory markers with octet-stream MIME (#8726 ) * Preserve octet-stream MIME on explicit directory markers * Run empty directory marker regression in CI * Run S3 Spark workflow for filer changes	2026-03-21 19:31:56 -07:00
Anton	7f0cf72574	admin/plugin: delete job_detail files when jobs are pruned from memory (#8722 ) * admin/plugin: delete job_detail files when jobs are pruned from memory pruneTrackedJobsLocked evicts the oldest terminal jobs from the in-memory tracker when the total exceeds maxTrackedJobsTotal (1000). However the dedicated per-job detail files in jobs/job_details/ were never removed, causing them to accumulate indefinitely on disk. Add ConfigStore.DeleteJobDetail and call it from pruneTrackedJobsLocked so that the file is cleaned up together with the in-memory entry. Deletion errors are logged at verbosity level 2 and do not abort the prune. * admin/plugin: add test for DeleteJobDetail --------- Co-authored-by: Anton Ustyugov <anton@devops> Co-authored-by: Chris Lu <chris.lu@gmail.com>	2026-03-21 14:23:32 -07:00
Anton	90277ceed5	admin/plugin: migrate inline job details asynchronously to avoid slow startup (#8721 ) loadPersistedMonitorState performed a backward-compatibility migration that wrote every job with inline rich detail fields to a dedicated per-job detail file synchronously during startup. On deployments with many historical jobs (e.g. 1000+) stored on distributed block storage (e.g. Longhorn), each individual file write requires an fsync round-trip, making startup disproportionately slow and causing readiness/liveness probe failures. The in-memory state is populated correctly before the goroutine is started because stripTrackedJobDetailFields is still called in-place; only the disk writes are deferred. A completion log message at V(1) is emitted once the background migration finishes. Co-authored-by: Anton Ustyugov <anton@devops>	2026-03-21 14:18:42 -07:00
Anton	ae170f1fbb	admin: fix manual job run to use scheduler dispatch with capacity management and retry (#8720 ) RunPluginJobTypeAPI previously executed proposals with a naive sequential loop calling ExecutePluginJob per proposal. This had two bugs: 1. Double-lock: RunPluginJobTypeAPI held pluginLock while calling ExecutePluginJob, which tried to re-acquire the same lock for every job in the loop. 2. No capacity management: proposals were fired directly at workers without reserveScheduledExecutor, so every job beyond the worker concurrency limit received an immediate at_capacity error with no retry or backoff. Fix: add Plugin.DispatchProposals which reuses dispatchScheduledProposals - the same code path the scheduler loop uses - with executor reservation, configurable concurrency, and per-job retry with backoff. RunPluginJobTypeAPI now calls DispatchPluginProposals (a thin AdminServer wrapper) after holding pluginLock once. Co-authored-by: Anton Ustyugov <anton@devops>	2026-03-21 14:09:31 -07:00
Anton	8e7b15a995	wdclient/exclusive_locks: replace println with glog in ExclusiveLocker (#8723 ) RequestLock used a bare println to report transient lock acquisition failures ('lock: already locked by ...'), which writes directly to stdout instead of going through the structured logging pipeline. This causes log noise at the wrong level and cannot be filtered with -v or redirected like glog output. Changes: - println("lock:", ...) -> glog.V(2).Infof for per-retry acquisition errors (transient, high-frequency during startup when another instance still holds) - Add glog.V(1).Infof when the lock is successfully acquired - Add glog.V(2).Infof for successful renewals (replaces commented-out println) - Errorf -> Warningf for renewal failures (the goroutine exits cleanly, it is not a fatal error; the caller will re-acquire via RequestLock) Co-authored-by: Anton Ustyugov <anton@devops>	2026-03-21 13:36:26 -07:00
dependabot[bot]	cc781f57dc	build(deps): bump google.golang.org/grpc from 1.77.0 to 1.79.3 in /seaweedfs-rdma-sidecar (#8716 )	2026-03-21 05:12:17 -07:00

1 2 3 4 5 ...

13239 Commits