Commit Graph

13317 Commits

Author SHA1 Message Date
Chris Lu
0884acd70c ci: add S3 mutation regression coverage (#8804)
* test/s3: stabilize distributed lock regression

* ci: add S3 mutation regression workflow

* test: fix delete regression readiness probe

* test: address mutation regression review comments
2026-03-28 20:17:20 -07:00
Chris Lu
297cdef1a3 s3api: accept all supported lifecycle rule types (#8813)
* s3api: accept NoncurrentVersionExpiration, AbortIncompleteMultipartUpload, Expiration.Date

Update PutBucketLifecycleConfigurationHandler to accept all newly-supported
lifecycle rule types. Only Transition and NoncurrentVersionTransition are
still rejected (require storage class tier infrastructure).

Changes:
- Remove ErrNotImplemented for Expiration.Date (handled by worker at scan time)
- Only reject rules with Transition.set or NoncurrentVersionTransition.set
- Extract prefix from Filter.And when present
- Add comment explaining that non-Expiration.Days rules are evaluated by
  the lifecycle worker from stored lifecycle XML, not via filer.conf TTL

The lifecycle XML is already stored verbatim in bucket metadata, so new
rule types are preserved on Get even without explicit handler support.
Filer.conf TTL entries are only created for Expiration.Days (fast path).

* s3api: skip TTL fast path for rules with tag or size filters

Rules with tag or size constraints (Filter.Tag, Filter.And with tags
or size bounds, Filter.ObjectSizeGreaterThan/LessThan) must not be
lowered to filer.conf TTL entries, because TTL applies unconditionally
to all objects under the prefix. These rules are evaluated at scan
time by the lifecycle worker which checks each object's tags and size.

Only simple Expiration.Days rules with prefix-only filters use the
TTL fast path (RocksDB compaction filter).

---------

Co-authored-by: Copilot <copilot@github.com>
2026-03-28 19:39:21 -07:00
Chris Lu
55318fe5ec lifecycle worker: add integration tests with in-memory filer (#8818) 2026-03-28 15:19:58 -07:00
Chris Lu
782ab84f95 lifecycle worker: drive MPU abort from lifecycle rules (#8812)
* lifecycle worker: drive MPU abort from lifecycle rules

Update the multipart upload abort phase to read
AbortIncompleteMultipartUpload.DaysAfterInitiation from the parsed
lifecycle rules. Falls back to the worker config abort_mpu_days when
no lifecycle XML rule specifies the value.

This means per-bucket MPU abort thresholds are now respected when
set via PutBucketLifecycleConfiguration, instead of using a single
global worker config value for all buckets.

* lifecycle worker: only use config AbortMPUDays when no lifecycle XML exists

When a bucket has lifecycle XML (useRuleEval=true) but no
AbortIncompleteMultipartUpload rule, mpuAbortDays should be 0
(no abort), not the worker config default. The config fallback
should only apply to buckets without lifecycle XML.

* lifecycle worker: only skip .uploads at bucket root

* lifecycle worker: use per-upload rule evaluation for MPU abort

Replace the single bucket-wide mpuAbortDays with per-upload evaluation
using s3lifecycle.EvaluateMPUAbort, which respects each rule's prefix
filter and DaysAfterInitiation threshold.

Previously the code took the first enabled abort rule's days value
and applied it to all uploads, ignoring prefix scoping and multiple
rules with different thresholds.

Config fallback (abort_mpu_days) now only applies when lifecycle XML
is truly absent (xmlPresent=false), not when XML exists but has no
abort rules.

Also fix EvaluateMPUAbort to use expectedExpiryTime for midnight-UTC
semantics matching other lifecycle cutoffs.

---------

Co-authored-by: Copilot <copilot@github.com>
2026-03-28 13:50:33 -07:00
Chris Lu
f52a3c87ce lifecycle worker: fix ExpiredObjectDeleteMarker to match AWS semantics (#8811)
* lifecycle worker: add NoncurrentVersionExpiration support

Add version-aware scanning to the rule-based execution path. When the
walker encounters a .versions directory, processVersionsDirectory():
- Lists all version entries (v_<versionId>)
- Sorts by version timestamp (newest first)
- Walks non-current versions with ShouldExpireNoncurrentVersion()
  which handles both NoncurrentDays and NewerNoncurrentVersions
- Extracts successor time from version IDs (both old/new format)
- Skips delete markers in noncurrent version counting
- Falls back to entry Mtime when version ID timestamp is unavailable

Helper functions:
- sortVersionsByTimestamp: insertion sort by version ID timestamp
- getEntryVersionTimestamp: extracts timestamp with Mtime fallback

* lifecycle worker: address review feedback for noncurrent versions

- Use sentinel errLimitReached in versions directory handler
- Set NoncurrentIndex on ObjectInfo for proper NewerNoncurrentVersions
  evaluation

* lifecycle worker: fail closed on XML parse error, guard zero Mtime

- Fail closed when lifecycle XML exists but fails to parse, instead
  of falling back to TTL which could apply broader rules
- Guard Mtime > 0 before using time.Unix(mtime, 0) to avoid mapping
  unset Mtime to 1970, which would misorder versions and cause
  premature expiration

* lifecycle worker: count delete markers toward NoncurrentIndex

Noncurrent delete markers should count toward the
NewerNoncurrentVersions retention threshold so data versions
get the correct position index. Previously, skipping delete
markers without incrementing the index could retain too many
versions after delete/recreate cycles.

* lifecycle worker: fix version ordering, error propagation, and fail-closed scope

1. Use full version ID comparison (CompareVersionIds) for sorting
   .versions entries, not just decoded timestamps. Two versions with
   the same timestamp prefix but different random suffixes were
   previously misordered, potentially treating the newest version as
   noncurrent and deleting it.

2. Propagate .versions listing failures to the caller instead of
   swallowing them with (nil, 0). Transient filer errors on a
   .versions directory now surface in the job result.

3. Narrow the fail-closed path to only malformed lifecycle XML
   (errMalformedLifecycleXML). Transient filer LookupEntry errors
   now fall back to TTL with a warning, matching the original intent
   of "fail closed on bad config, not on network blips."

* lifecycle worker: only skip .uploads at bucket root

* lifecycle worker: sort.Slice, mixed-format test, XML presence tracking

- Replace manual insertion sort with sort.Slice in sortVersionsByVersionId
- Add TestCompareVersionIdsMixedFormats covering old/new format ordering
- Distinguish "no lifecycle XML" (nil) from "XML present but no effective
  rules" (non-nil empty slice) so buckets with all-disabled rules don't
  incorrectly fall back to filer.conf TTL expiration

* lifecycle worker: guard nil Attributes, use TrimSuffix in test

- Guard entry.Attributes != nil before accessing GetFileSize() and
  Mtime in both listExpiredObjectsByRules and processVersionsDirectory
- Use strings.TrimPrefix/TrimSuffix in TestVersionsDirectoryNaming
  to match the production code pattern

* lifecycle worker: skip TTL scan when XML present, fix test assertions

- When lifecycle XML is present but has no effective rules, skip
  object scanning entirely instead of falling back to TTL path
- Test sort output against concrete expected names instead of
  re-using the same comparator as the sort itself

* lifecycle worker: fix ExpiredObjectDeleteMarker to match AWS semantics

Rewrite cleanupDeleteMarkers() to only remove delete markers that are
the sole remaining version of an object. Previously, delete markers
were removed unconditionally which could resurface older versions in
versioned buckets.

New algorithm:
1. Walk bucket tree looking for .versions directories
2. Check ExtLatestVersionIsDeleteMarker from directory metadata
3. Count versions in the .versions directory
4. Only remove if count == 1 (delete marker is sole version)
5. Require an ExpiredObjectDeleteMarker=true rule (when lifecycle
   XML rules are present)
6. Remove the empty .versions directory after cleanup

This phase runs after NoncurrentVersionExpiration so version counts
are accurate.

* lifecycle worker: respect prefix filter in ExpiredObjectDeleteMarker rules

Previously hasDeleteMarkerRule was a bucket-wide boolean that ignored
rule prefixes. A prefix-scoped rule like "logs/" would incorrectly
clean up delete markers in all paths.

Add matchesDeleteMarkerRule() that checks if a matching enabled
ExpiredObjectDeleteMarker rule exists for the specific object key,
respecting the rule's prefix filter. Falls back to legacy behavior
(allow cleanup) when no lifecycle XML rules are provided.

* lifecycle worker: only skip .uploads at bucket root

Check dir == bucketPath before skipping directories named .uploads.
Previously a user-created directory like data/.uploads/ at any depth
would be incorrectly skipped during lifecycle scanning.

* lifecycle worker: fix delete marker cleanup with XML-present empty rules

1. matchesDeleteMarkerRule now uses nil check (not len==0) for legacy
   fallback. A non-nil empty slice means lifecycle XML was present but
   had no ExpiredObjectDeleteMarker rules, so cleanup is blocked.
   Previously, an empty slice triggered the legacy true path.

2. Use per-directory removedHere flag instead of cumulative cleaned
   counter when deciding to remove .versions directories. Previously,
   after the first successful cleanup anywhere in the bucket, every
   subsequent .versions directory would be removed even if its own
   delete marker was not actually deleted.

* lifecycle worker: use full filter matching for delete marker rules

matchesDeleteMarkerRule now uses s3lifecycle.MatchesFilter (exported)
instead of prefix-only matching. This ensures tag and size filters
on ExpiredObjectDeleteMarker rules are respected, preventing broader
deletions than the configured policy intends.

Add TestMatchesDeleteMarkerRule covering: nil rules (legacy), empty
rules (XML present), prefix match/mismatch, disabled rules, rules
without the flag, and tag-filtered rules against tagless markers.

---------

Co-authored-by: Copilot <copilot@github.com>
2026-03-28 13:26:57 -07:00
Lars Lehtonen
b01a74c6bb Prune Unused Functions from weed/s3api (#8815)
* weed/s3api: prune calculatePartOffset()

* weed/s3api: prune clearCachedListMetadata()

* weed/s3api: prune S3ApiServer.isObjectRetentionActive()

* weed/s3api: prune S3ApiServer.ensureDirectoryAllEmpty()

* weed/s3api: prune s3ApiServer.getEncryptionTypeString()

* weed/s3api: prune newStreamError()

* weed/s3api: prune S3ApiServer.rotateSSECKey()

weed/s3api: prune S3ApiServer.rotateSSEKMSKey()

weed/s3api: prune S3ApiServer.rotateSSEKMSMetadataOnly()

weed/s3api: prune S3ApiServer.rotateSSECChunks()

weed/s3api: prune S3ApiServer.rotateSSEKMSChunks()

weed/s3api: prune S3ApiServer.rotateSSECChunk()

weed/s3api: prune S3ApiServer.rotateSSEKMSChunk()

* weed/s3api: prune addCounterToIV()

* weed/s3api: prune minInt()

* weed/s3api: prune isMethodActionMismatch()

* weed/s3api: prune hasSpecificQueryParameters()

* weed/s3api: prune handlePutToFilerError()

weed/s3api: prune handlePutToFilerInternalError()

weed/s3api: prune logErrorAndReturn()

weed/s3api: prune logInternalError

weed/s3api: prune handleSSEError()

weed/s3api: prune handleSSEInternalError()

* weed/s3api: prune encryptionConfigToProto()

* weed/s3api: prune S3ApiServer.touch()
2026-03-28 13:24:11 -07:00
Chris Lu
f6ec9941cb lifecycle worker: NoncurrentVersionExpiration support (#8810)
* lifecycle worker: add NoncurrentVersionExpiration support

Add version-aware scanning to the rule-based execution path. When the
walker encounters a .versions directory, processVersionsDirectory():
- Lists all version entries (v_<versionId>)
- Sorts by version timestamp (newest first)
- Walks non-current versions with ShouldExpireNoncurrentVersion()
  which handles both NoncurrentDays and NewerNoncurrentVersions
- Extracts successor time from version IDs (both old/new format)
- Skips delete markers in noncurrent version counting
- Falls back to entry Mtime when version ID timestamp is unavailable

Helper functions:
- sortVersionsByTimestamp: insertion sort by version ID timestamp
- getEntryVersionTimestamp: extracts timestamp with Mtime fallback

* lifecycle worker: address review feedback for noncurrent versions

- Use sentinel errLimitReached in versions directory handler
- Set NoncurrentIndex on ObjectInfo for proper NewerNoncurrentVersions
  evaluation

* lifecycle worker: fail closed on XML parse error, guard zero Mtime

- Fail closed when lifecycle XML exists but fails to parse, instead
  of falling back to TTL which could apply broader rules
- Guard Mtime > 0 before using time.Unix(mtime, 0) to avoid mapping
  unset Mtime to 1970, which would misorder versions and cause
  premature expiration

* lifecycle worker: count delete markers toward NoncurrentIndex

Noncurrent delete markers should count toward the
NewerNoncurrentVersions retention threshold so data versions
get the correct position index. Previously, skipping delete
markers without incrementing the index could retain too many
versions after delete/recreate cycles.

* lifecycle worker: fix version ordering, error propagation, and fail-closed scope

1. Use full version ID comparison (CompareVersionIds) for sorting
   .versions entries, not just decoded timestamps. Two versions with
   the same timestamp prefix but different random suffixes were
   previously misordered, potentially treating the newest version as
   noncurrent and deleting it.

2. Propagate .versions listing failures to the caller instead of
   swallowing them with (nil, 0). Transient filer errors on a
   .versions directory now surface in the job result.

3. Narrow the fail-closed path to only malformed lifecycle XML
   (errMalformedLifecycleXML). Transient filer LookupEntry errors
   now fall back to TTL with a warning, matching the original intent
   of "fail closed on bad config, not on network blips."

* lifecycle worker: only skip .uploads at bucket root

* lifecycle worker: sort.Slice, mixed-format test, XML presence tracking

- Replace manual insertion sort with sort.Slice in sortVersionsByVersionId
- Add TestCompareVersionIdsMixedFormats covering old/new format ordering
- Distinguish "no lifecycle XML" (nil) from "XML present but no effective
  rules" (non-nil empty slice) so buckets with all-disabled rules don't
  incorrectly fall back to filer.conf TTL expiration

* lifecycle worker: guard nil Attributes, use TrimSuffix in test

- Guard entry.Attributes != nil before accessing GetFileSize() and
  Mtime in both listExpiredObjectsByRules and processVersionsDirectory
- Use strings.TrimPrefix/TrimSuffix in TestVersionsDirectoryNaming
  to match the production code pattern

* lifecycle worker: skip TTL scan when XML present, fix test assertions

- When lifecycle XML is present but has no effective rules, skip
  object scanning entirely instead of falling back to TTL path
- Test sort output against concrete expected names instead of
  re-using the same comparator as the sort itself

---------

Co-authored-by: Copilot <copilot@github.com>
2026-03-28 12:58:21 -07:00
Chris Lu
9c3bc138a0 lifecycle worker: scan-time rule evaluation for object expiration (#8809)
* s3api: extend lifecycle XML types with NoncurrentVersionExpiration, AbortIncompleteMultipartUpload

Add missing S3 lifecycle rule types to the XML data model:
- NoncurrentVersionExpiration with NoncurrentDays and NewerNoncurrentVersions
- NoncurrentVersionTransition with NoncurrentDays and StorageClass
- AbortIncompleteMultipartUpload with DaysAfterInitiation
- Filter.ObjectSizeGreaterThan and ObjectSizeLessThan
- And.ObjectSizeGreaterThan and ObjectSizeLessThan
- Filter.UnmarshalXML to properly parse Tag, And, and size filter elements

Each new type follows the existing set-field pattern for conditional
XML marshaling. No behavior changes - these types are not yet wired
into handlers or the lifecycle worker.

* s3lifecycle: add lifecycle rule evaluator package

New package weed/s3api/s3lifecycle/ provides a pure-function lifecycle
rule evaluation engine. The evaluator accepts flattened Rule structs and
ObjectInfo metadata, and returns the appropriate Action.

Components:
- evaluator.go: Evaluate() for per-object actions with S3 priority
  ordering (delete marker > noncurrent version > current expiration),
  ShouldExpireNoncurrentVersion() with NewerNoncurrentVersions support,
  EvaluateMPUAbort() for multipart upload rules
- filter.go: prefix, tag, and size-based filter matching
- tags.go: ExtractTags() extracts S3 tags from filer Extended metadata,
  HasTagRules() for scan-time optimization
- version_time.go: GetVersionTimestamp() extracts timestamps from
  SeaweedFS version IDs (both old and new format)

Comprehensive test coverage: 54 tests covering all action types,
filter combinations, edge cases, and version ID formats.

* s3api: add UnmarshalXML for Expiration, Transition, ExpireDeleteMarker

Add UnmarshalXML methods that set the internal 'set' flag during XML
parsing. Previously these flags were only set programmatically, causing
XML round-trip to drop elements. This ensures lifecycle configurations
stored as XML survive unmarshal/marshal cycles correctly.

Add comprehensive XML round-trip tests for all lifecycle rule types
including NoncurrentVersionExpiration, AbortIncompleteMultipartUpload,
Filter with Tag/And/size constraints, and a complete Terraform-style
lifecycle configuration.

* s3lifecycle: address review feedback

- Fix version_time.go overflow: guard timestampPart > MaxInt64 before
  the inversion subtraction to prevent uint64 wrap
- Make all expiry checks inclusive (!now.Before instead of now.After)
  so actions trigger at the exact scheduled instant
- Add NoncurrentIndex to ObjectInfo so Evaluate() can properly handle
  NewerNoncurrentVersions via ShouldExpireNoncurrentVersion()
- Add test for high-bit overflow version ID

* s3lifecycle: guard ShouldExpireNoncurrentVersion against zero SuccessorModTime

Add early return when obj.IsLatest or obj.SuccessorModTime.IsZero()
to prevent premature expiration of versions with uninitialized
successor timestamps (zero value would compute to epoch, always expired).

* lifecycle worker: detect buckets with lifecycle XML, not just filer.conf TTLs

Update the detection phase to check for stored lifecycle XML in bucket
metadata (key: s3-bucket-lifecycle-configuration-xml) in addition to
filer.conf TTL entries. A bucket is proposed for lifecycle processing if
it has lifecycle XML OR filer.conf TTLs (backward compatible).

New proposal parameters:
- has_lifecycle_xml: whether the bucket has stored lifecycle XML
- versioning_status: the bucket's versioning state (Enabled/Suspended/"")

These parameters will be used by the execution phase (subsequent PR)
to determine which evaluation path to use.

* lifecycle worker: update detection function comment to reflect XML support

* lifecycle worker: add lifecycle XML parsing and rule conversion

Add rules.go with:
- parseLifecycleXML() converts stored lifecycle XML to evaluator-friendly
  s3lifecycle.Rule structs, handling Filter.Prefix, Filter.Tag, Filter.And,
  size constraints, NoncurrentVersionExpiration, AbortIncompleteMultipartUpload,
  Expiration.Date, and ExpiredObjectDeleteMarker
- loadLifecycleRulesFromBucket() reads lifecycle XML from bucket metadata
- parseExpirationDate() supports RFC3339 and ISO 8601 date-only formats

Comprehensive tests for all XML variants, filter types, and date formats.

* lifecycle worker: add scan-time rule evaluation for object expiration

Update executeLifecycleForBucket to try lifecycle XML evaluation first,
falling back to TTL-only evaluation when no lifecycle XML exists.

New listExpiredObjectsByRules() function:
- Walks the bucket directory tree
- Builds s3lifecycle.ObjectInfo from each filer entry
- Calls s3lifecycle.Evaluate() to check lifecycle rules
- Skips objects already handled by TTL fast path (TtlSec set)
- Extracts tags only when rules use tag-based filters (optimization)
- Skips .uploads and .versions directories (handled by other phases)

Supports Expiration.Days, Expiration.Date, Filter.Prefix, Filter.Tag,
Filter.And, and Filter.ObjectSize* in the scan-time evaluation path.
Existing TTL-based path remains for backward compatibility.

* lifecycle worker: address review feedback

- Use sentinel error (errLimitReached) instead of string matching
  for scan limit detection
- Fix loadLifecycleRulesFromBucket path: use bucketsPath directly
  as directory for LookupEntry instead of path.Dir which produced
  the wrong parent

* lifecycle worker: fix And filter detection for size-only constraints

The And branch condition only triggered when Prefix or Tags were present,
missing the case where And contains only ObjectSizeGreaterThan or
ObjectSizeLessThan without a prefix or tags.

* lifecycle worker: address review feedback round 3

- rules.go: pass through Filter-level size constraints when Tag is
  present without And (Tag+size combination was dropping sizes)
- execution.go: add doc comment to listExpiredObjectsByRules noting
  that it handles non-versioned objects only; versioned objects are
  handled by processVersionsDirectory
- rules_test.go: add bounds checks before indexing rules[0]

---------

Co-authored-by: Copilot <copilot@github.com>
2026-03-28 11:39:50 -07:00
Chris Lu
98f545c7fa lifecycle worker: detect buckets via lifecycle XML metadata (#8808)
* s3api: extend lifecycle XML types with NoncurrentVersionExpiration, AbortIncompleteMultipartUpload

Add missing S3 lifecycle rule types to the XML data model:
- NoncurrentVersionExpiration with NoncurrentDays and NewerNoncurrentVersions
- NoncurrentVersionTransition with NoncurrentDays and StorageClass
- AbortIncompleteMultipartUpload with DaysAfterInitiation
- Filter.ObjectSizeGreaterThan and ObjectSizeLessThan
- And.ObjectSizeGreaterThan and ObjectSizeLessThan
- Filter.UnmarshalXML to properly parse Tag, And, and size filter elements

Each new type follows the existing set-field pattern for conditional
XML marshaling. No behavior changes - these types are not yet wired
into handlers or the lifecycle worker.

* s3lifecycle: add lifecycle rule evaluator package

New package weed/s3api/s3lifecycle/ provides a pure-function lifecycle
rule evaluation engine. The evaluator accepts flattened Rule structs and
ObjectInfo metadata, and returns the appropriate Action.

Components:
- evaluator.go: Evaluate() for per-object actions with S3 priority
  ordering (delete marker > noncurrent version > current expiration),
  ShouldExpireNoncurrentVersion() with NewerNoncurrentVersions support,
  EvaluateMPUAbort() for multipart upload rules
- filter.go: prefix, tag, and size-based filter matching
- tags.go: ExtractTags() extracts S3 tags from filer Extended metadata,
  HasTagRules() for scan-time optimization
- version_time.go: GetVersionTimestamp() extracts timestamps from
  SeaweedFS version IDs (both old and new format)

Comprehensive test coverage: 54 tests covering all action types,
filter combinations, edge cases, and version ID formats.

* s3api: add UnmarshalXML for Expiration, Transition, ExpireDeleteMarker

Add UnmarshalXML methods that set the internal 'set' flag during XML
parsing. Previously these flags were only set programmatically, causing
XML round-trip to drop elements. This ensures lifecycle configurations
stored as XML survive unmarshal/marshal cycles correctly.

Add comprehensive XML round-trip tests for all lifecycle rule types
including NoncurrentVersionExpiration, AbortIncompleteMultipartUpload,
Filter with Tag/And/size constraints, and a complete Terraform-style
lifecycle configuration.

* s3lifecycle: address review feedback

- Fix version_time.go overflow: guard timestampPart > MaxInt64 before
  the inversion subtraction to prevent uint64 wrap
- Make all expiry checks inclusive (!now.Before instead of now.After)
  so actions trigger at the exact scheduled instant
- Add NoncurrentIndex to ObjectInfo so Evaluate() can properly handle
  NewerNoncurrentVersions via ShouldExpireNoncurrentVersion()
- Add test for high-bit overflow version ID

* s3lifecycle: guard ShouldExpireNoncurrentVersion against zero SuccessorModTime

Add early return when obj.IsLatest or obj.SuccessorModTime.IsZero()
to prevent premature expiration of versions with uninitialized
successor timestamps (zero value would compute to epoch, always expired).

* lifecycle worker: detect buckets with lifecycle XML, not just filer.conf TTLs

Update the detection phase to check for stored lifecycle XML in bucket
metadata (key: s3-bucket-lifecycle-configuration-xml) in addition to
filer.conf TTL entries. A bucket is proposed for lifecycle processing if
it has lifecycle XML OR filer.conf TTLs (backward compatible).

New proposal parameters:
- has_lifecycle_xml: whether the bucket has stored lifecycle XML
- versioning_status: the bucket's versioning state (Enabled/Suspended/"")

These parameters will be used by the execution phase (subsequent PR)
to determine which evaluation path to use.

* lifecycle worker: update detection function comment to reflect XML support

---------

Co-authored-by: Copilot <copilot@github.com>
2026-03-28 11:16:58 -07:00
Chris Lu
54dd4f091d s3lifecycle: add lifecycle rule evaluator package and extend XML types (#8807)
* s3api: extend lifecycle XML types with NoncurrentVersionExpiration, AbortIncompleteMultipartUpload

Add missing S3 lifecycle rule types to the XML data model:
- NoncurrentVersionExpiration with NoncurrentDays and NewerNoncurrentVersions
- NoncurrentVersionTransition with NoncurrentDays and StorageClass
- AbortIncompleteMultipartUpload with DaysAfterInitiation
- Filter.ObjectSizeGreaterThan and ObjectSizeLessThan
- And.ObjectSizeGreaterThan and ObjectSizeLessThan
- Filter.UnmarshalXML to properly parse Tag, And, and size filter elements

Each new type follows the existing set-field pattern for conditional
XML marshaling. No behavior changes - these types are not yet wired
into handlers or the lifecycle worker.

* s3lifecycle: add lifecycle rule evaluator package

New package weed/s3api/s3lifecycle/ provides a pure-function lifecycle
rule evaluation engine. The evaluator accepts flattened Rule structs and
ObjectInfo metadata, and returns the appropriate Action.

Components:
- evaluator.go: Evaluate() for per-object actions with S3 priority
  ordering (delete marker > noncurrent version > current expiration),
  ShouldExpireNoncurrentVersion() with NewerNoncurrentVersions support,
  EvaluateMPUAbort() for multipart upload rules
- filter.go: prefix, tag, and size-based filter matching
- tags.go: ExtractTags() extracts S3 tags from filer Extended metadata,
  HasTagRules() for scan-time optimization
- version_time.go: GetVersionTimestamp() extracts timestamps from
  SeaweedFS version IDs (both old and new format)

Comprehensive test coverage: 54 tests covering all action types,
filter combinations, edge cases, and version ID formats.

* s3api: add UnmarshalXML for Expiration, Transition, ExpireDeleteMarker

Add UnmarshalXML methods that set the internal 'set' flag during XML
parsing. Previously these flags were only set programmatically, causing
XML round-trip to drop elements. This ensures lifecycle configurations
stored as XML survive unmarshal/marshal cycles correctly.

Add comprehensive XML round-trip tests for all lifecycle rule types
including NoncurrentVersionExpiration, AbortIncompleteMultipartUpload,
Filter with Tag/And/size constraints, and a complete Terraform-style
lifecycle configuration.

* s3lifecycle: address review feedback

- Fix version_time.go overflow: guard timestampPart > MaxInt64 before
  the inversion subtraction to prevent uint64 wrap
- Make all expiry checks inclusive (!now.Before instead of now.After)
  so actions trigger at the exact scheduled instant
- Add NoncurrentIndex to ObjectInfo so Evaluate() can properly handle
  NewerNoncurrentVersions via ShouldExpireNoncurrentVersion()
- Add test for high-bit overflow version ID

* s3lifecycle: guard ShouldExpireNoncurrentVersion against zero SuccessorModTime

Add early return when obj.IsLatest or obj.SuccessorModTime.IsZero()
to prevent premature expiration of versions with uninitialized
successor timestamps (zero value would compute to epoch, always expired).

---------

Co-authored-by: Copilot <copilot@github.com>
2026-03-28 11:10:31 -07:00
Chris Lu
7d5cbfd547 s3: support s3:x-amz-server-side-encryption policy condition (#8806)
* s3: support s3:x-amz-server-side-encryption policy condition (#7680)

- Normalize x-amz-server-side-encryption header values to canonical form
  (aes256 → AES256, aws:kms mixed-case → aws:kms) so StringEquals
  conditions work regardless of client capitalisation
- Exempt UploadPart and UploadPartCopy from SSE Null conditions: these
  actions inherit SSE from the initial CreateMultipartUpload request and
  do not re-send the header, so Deny/Null("true") should not block them
- Add sse_condition_test.go covering StringEquals, Null, case-insensitive
  normalisation, and multipart continuation action exemption

* s3: address review comments on SSE condition support

- Replace "inherited" sentinel in injectSSEForMultipart with "AES256" so
  that StringEquals/Null conditions evaluate against a meaningful value;
  add TODO noting that KMS multipart uploads need the actual algorithm
  looked up from the upload state
- Rewrite TestSSECaseInsensitiveNormalization to drive normalisation
  through EvaluatePolicyForRequest with a real *http.Request so regressions
  in the production code path are caught; split into AES256 and aws:kms
  variants to cover both normalisation branches

* s3: plumb real inherited SSE from multipart upload state into policy eval

Instead of injecting a static "AES256" sentinel for UploadPart/UploadPartCopy,
look up the actual SSE algorithm from the stored CreateMultipartUpload entry
and pass it through the evaluation chain.

Changes:
- PolicyEvaluationArgs gains InheritedSSEAlgorithm string; set by the
  BucketPolicyEngine wrapper for multipart continuation actions
- injectSSEForMultipart(conditions, inheritedSSE) now accepts the real
  algorithm; empty string means no SSE → Null("true") fires correctly
- IsMultipartContinuationAction exported so the s3api wrapper can use it
- BucketPolicyEngine gets a MultipartSSELookup callback (set by S3ApiServer)
  that fetches the upload entry and reads SeaweedFSSSEKMSKeyID /
  SeaweedFSSSES3Encryption to determine the algorithm
- S3ApiServer.getMultipartSSEAlgorithm implements the lookup via getEntry
- Tests updated: three multipart cases (AES256, aws:kms, no-SSE-must-deny)
  plus UploadPartCopy coverage
2026-03-27 23:15:01 -07:00
Chris Lu
e3f052cd84 s3api: preserve lifecycle config responses for Terraform (#8805)
* s3api: preserve lifecycle configs for terraform

* s3api: bound lifecycle config request bodies

* s3api: make bucket config updates copy-on-write

* s3api: tighten string slice cloning
2026-03-27 22:50:02 -07:00
Chris Lu
0adb78bc6b s3api: make conditional mutations atomic and AWS-compatible (#8802)
* s3api: serialize conditional write finalization

* s3api: add conditional delete mutation checks

* s3api: enforce destination conditions for copy

* s3api: revalidate multipart completion under lock

* s3api: rollback failed put finalization hooks

* s3api: report delete-marker version deletions

* s3api: fix copy destination versioning edge cases

* s3api: make versioned multipart completion idempotent

* test/s3: cover conditional mutation regressions

* s3api: rollback failed copy version finalization

* s3api: resolve suspended delete conditions via latest entry

* s3api: remove copy test null-version injection

* s3api: reject out-of-order multipart completions

* s3api: preserve multipart replay version metadata

* s3api: surface copy destination existence errors

* s3api: simplify delete condition target resolution

* test/s3: make conditional delete assertions order independent

* test/s3: add distributed lock gateway integration

* s3api: fail closed multipart versioned completion

* s3api: harden copy metadata and overwrite paths

* s3api: create delete markers for suspended deletes

* s3api: allow duplicate multipart completion parts
2026-03-27 19:22:26 -07:00
Chris Lu
bf2a2d2538 test: preserve branch when recovering bare git repo (#8803)
* test: preserve branch when recovering bare git repo

* Replaced the standalone ensureMountClone + gitRun in Phase 5 with a new resetToCommitWithRecovery function that mirrors   the existing pullFromCommitWithRecovery pattern
2026-03-27 17:13:00 -07:00
Chris Lu
f256002d0b fix ec.balance failing to rebalance when all nodes share all volumes (#8796)
* fix ec.balance failing to rebalance when all nodes share all volumes (#8793)

Two bugs in doBalanceEcRack prevented rebalancing:

1. Sorting by freeEcSlot instead of actual shard count caused incorrect
   empty/full node selection when nodes have different total capacities.

2. The volume-level check skipped any volume already present on the
   target node. When every node has a shard of every volume (common
   with many EC volumes across N nodes with N shards each), no moves
   were possible.

Fix: sort by actual shard count, and use a two-pass approach - first
prefer moving shards of volumes not on the target (best diversity),
then fall back to moving specific shard IDs not yet on the target.

* add test simulating real cluster topology from issue #8793

Uses the actual node addresses and mixed max capacities (80 vs 33)
from the reporter's 14-node cluster to verify ec.balance correctly
rebalances with heterogeneous node sizes.

* fix pass comments to match 0-indexed loop variable
2026-03-27 11:14:10 -07:00
Chris Lu
c2c58419b8 filer.sync: send log file chunk fids to clients for direct volume server reads (#8792)
* filer.sync: send log file chunk fids to clients for direct volume server reads

Instead of the server reading persisted log files from volume servers, parsing
entries, and streaming them over gRPC (serial bottleneck), clients that opt in
via client_supports_metadata_chunks receive log file chunk references (fids)
and read directly from volume servers in parallel.

New proto messages:
- LogFileChunkRef: chunk fids + timestamp + filer ID for one log file
- SubscribeMetadataRequest.client_supports_metadata_chunks: client opt-in
- SubscribeMetadataResponse.log_file_refs: server sends refs during backlog

Server changes:
- CollectLogFileRefs: lists log files and returns chunk refs without any
  volume server I/O (metadata-only operation)
- SubscribeMetadata/SubscribeLocalMetadata: when client opts in, sends refs
  during persisted log phase, then falls back to normal streaming for
  in-memory events

Client changes:
- ReadLogFileRefs: reads log files from volume servers, parses entries,
  filters by path prefix, invokes processEventFn
- MetadataFollowOption.LogFileReaderFn: factory for chunk readers,
  enables metadata chunks when non-nil
- Both filer_pb_tail.go and meta_aggregator.go recv loops accumulate
  refs then process them at the disk→memory transition

Backward compatible: old clients don't set the flag, get existing behavior.

Ref: #8771

* filer.sync: merge entries across filers in timestamp order on client side

ReadLogFileRefs now groups refs by filer ID and merges entries from
multiple filers using a min-heap priority queue — the same algorithm
the server uses in OrderedLogVisitor + LogEntryItemPriorityQueue.

This ensures events are processed in correct timestamp order even when
log files from different filers have interleaved timestamps. Single-filer
case takes the fast path (no heap allocation).

* filer.sync: integration tests for direct-read metadata chunks

Three test categories:

1. Merge correctness (TestReadLogFileRefsMergeOrder):
   Verifies entries from 3 filers are delivered in strict timestamp order,
   matching the server-side OrderedLogVisitor guarantee.

2. Path filtering (TestReadLogFileRefsPathFilter):
   Verifies client-side path prefix filtering works correctly.

3. Throughput comparison (TestDirectReadVsServerSideThroughput):
   3 filers × 7 files × 300 events = 6300 events, 2ms per file read:

   server-side:  6300 events  218ms   28,873 events/sec
   direct-read:  6300 events   51ms  123,566 events/sec  (4.3x)
   parallel:     6300 events   17ms  378,628 events/sec  (13.1x)

   Direct-read eliminates gRPC send overhead per event (4.3x).
   Parallel per-filer reading eliminates serial file I/O (13.1x).

* filer.sync: parallel per-filer reads with prefetching in ReadLogFileRefs

ReadLogFileRefs now has two levels of I/O overlap:

1. Cross-filer parallelism: one goroutine per filer reads its files
   concurrently. Entries feed into per-filer channels, merged by the
   main goroutine via min-heap (same ordering guarantee as the server's
   OrderedLogVisitor).

2. Within-filer prefetching: while the current file's entries are being
   consumed by the merge heap, the next file is already being read from
   the volume server in a background goroutine.

Single-filer fast path avoids the heap and channels.

Test results (3 filers × 7 files × 300 events, 2ms per file read):

  server-side sequential:  6300 events  212ms   29,760 events/sec
  parallel + prefetch:     6300 events   36ms  177,443 events/sec
  Speedup: 6.0x

* filer.sync: address all review comments on metadata chunks PR

Critical fixes:
- sendLogFileRefs: bypass pipelinedSender, send directly on gRPC stream.
  Ref messages have TsNs=0 and were being incorrectly batched into the
  Events field by the adaptive batching logic, corrupting ref delivery.
- readLogFileEntries: use io.ReadFull instead of reader.Read to prevent
  partial reads from corrupting size values or protobuf data.
- Error handling: only skip chunk-not-found errors (matching server-side
  isChunkNotFoundError). Other I/O or decode failures are propagated so
  the follower can retry.

High-priority fixes:
- CollectLogFileRefs: remove incorrect +24h padding from stopTime. The
  extra day caused unnecessary log file refs to be collected.
- Path filtering: ReadLogFileRefs now accepts PathFilter struct with
  PathPrefix, AdditionalPathPrefixes, and DirectoriesToWatch. Uses
  util.Join for path construction (avoids "//foo" on root). Excludes
  /.system/log/ internal entries. Matches server-side
  eachEventNotificationFn filtering logic.

Medium-priority fixes:
- CollectLogFileRefs: accept context.Context, propagate to
  ListDirectoryEntries calls for cancellation support.
- NewChunkStreamReaderFromLookup: accept context.Context, propagate to
  doNewChunkStreamReader.

Test fixes:
- Check error returns from ReadLogFileRefs in all test call sites.

---------

Co-authored-by: Copilot <copilot@github.com>
2026-03-27 11:01:29 -07:00
Chris Lu
e52a94a3a7 sftpd: use global TLS-aware HTTP client for filer uploads (#8795)
* sftpd: use global TLS-aware HTTP client for filer uploads (#8794)

putFile() hardcoded http:// and used http.DefaultClient, which broke
file uploads when the filer has HTTPS/TLS enabled. Switch to the global
HTTP client which reads [https.client] from security.toml and
automatically normalizes the URL scheme.

* sftpd: propagate NormalizeUrl error instead of swallowing it
2026-03-27 10:29:49 -07:00
Lars Lehtonen
41aac90a9c chore(feed/worker): prune unused registerWorker() (#8799) 2026-03-27 07:36:55 -07:00
Chris Lu
d34cf0d046 adjust default timing 2026-03-27 01:29:05 -07:00
Chris Lu
d97660d0cd filer.sync: pipelined subscription with adaptive batching for faster catch-up (#8791)
* filer.sync: pipelined subscription with adaptive batching for faster catch-up

The SubscribeMetadata pipeline was fully serial: reading a log entry from a
volume server, unmarshaling, filtering, and calling stream.Send() all happened
one-at-a-time. stream.Send() blocked the entire pipeline until the client
acknowledged each event, limiting throughput to ~80 events/sec regardless of
the -concurrency setting.

Three server-side optimizations that stack:

1. Pipelined sender: decouple stream.Send() from the read loop via a buffered
   channel (1024 messages). A dedicated goroutine handles gRPC delivery while
   the reader continues processing the next events.

2. Adaptive batching: when event timestamps are >2min behind wall clock
   (backlog catch-up), drain multiple events from the channel and pack them
   into a single stream.Send() using a new `repeated events` field on
   SubscribeMetadataResponse. When events are recent (real-time), send
   one-by-one for low latency. Old clients ignore the new field (backward
   compatible).

3. Persisted log readahead: run the OrderedLogVisitor in a background
   goroutine so volume server I/O for the next log file overlaps with event
   processing and gRPC delivery.

4. Event-driven aggregated subscription: replace time.Sleep(1127ms) polling
   in SubscribeMetadata with notification-driven wake-up using the
   MetaLogBuffer subscriber mechanism, reducing real-time latency from
   ~1127ms to sub-millisecond.

Combined, these create a 3-stage pipeline:
  [Volume I/O → readahead buffer] → [Filter → send buffer] → [gRPC Send]

Test results (simulated backlog with 50µs gRPC latency per Send):
  direct (old):        2100 events  2100 sends  168ms   12,512 events/sec
  pipelined+batched:   2100 events    14 sends   40ms   52,856 events/sec
  Speedup: 4.2x single-stream throughput

Ref: #8771

* filer.sync: require client opt-in for batch event delivery

Add ClientSupportsBatching field to SubscribeMetadataRequest. The server
only packs events into the Events batch field when the client explicitly
sets this flag to true. Old clients (Java SDK, third-party) that don't
set the flag get one-event-per-Send, preserving backward compatibility.

All Go callers (FollowMetadata, MetaAggregator) set the flag to true
since their recv loops already unpack batched events.

* filer.sync: clear batch Events field after Send to release references

Prevents the envelope message from holding references to the rest of the
batch after gRPC serialization, allowing the GC to collect them sooner.

* filer.sync: fix Send deadlock, add error propagation test, event-driven local subscribe

- pipelinedSender.Send: add case <-s.done to unblock when sender goroutine
  exits (fixes deadlock when errCh was already consumed by a prior Send).
- pipelinedSender.reportErr: remove for-range drain on sendCh that could
  block indefinitely. Send() now detects exit via s.done instead.
- SubscribeLocalMetadata: replace remaining time.Sleep(1127ms) in the
  gap-detected-no-memory-data path with event-driven listenersCond.Wait(),
  consistent with the rest of the subscription paths.
- Add TestPipelinedSenderErrorPropagation: verifies error surfaces via
  Send and Close when the underlying stream fails.
- Replace goto with labeled break in test simulatePipeline.

* filer.sync: check error returns in test code

- direct_send: check slowStream.Send error return
- pipelined_batched_send: check sender.Close error return
- simulatePipeline: return error from sender.Close, propagate to callers

---------

Co-authored-by: Copilot <copilot@github.com>
2026-03-26 23:55:42 -07:00
Chris Lu
8c8d21d7e2 Update plugin_lane_templ.go 2026-03-26 23:11:10 -07:00
Chris Lu
2604ec7deb Remove min_interval_seconds from plugin workers; vacuum default to 17m (#8790)
remove min_interval_seconds from plugin workers and default vacuum interval to 17m

The worker-level min_interval_seconds was redundant with the admin-side
DetectionIntervalSeconds, complicating scheduling logic. Remove it from
vacuum, volume_balance, erasure_coding, and ec_balance handlers.

Also change the vacuum default DetectionIntervalSeconds from 2 hours to
17 minutes to match the previous default behavior.
2026-03-26 23:04:36 -07:00
Chris Lu
f98d63fcd0 Merge branch 'master' of https://github.com/seaweedfs/seaweedfs 2026-03-26 19:41:02 -07:00
Chris Lu
db9ea7c87c fix(fuse-test): recover from FUSE directory loss in git pull test (#8789)
* fix(fuse-test): harden ensureMountClone to verify .git/HEAD

On FUSE, the kernel dcache can retain a stale directory entry after
heavy git operations. Checking only os.Stat(mountClone) may find
the top-level directory but the clone internals (.git/HEAD) are gone.

Now ensureMountClone verifies .git/HEAD exists, cleans up stale
remnants with os.RemoveAll before re-cloning, and adds a brief
settle delay for the FUSE metadata cache.

* fix(fuse-test): add pull recovery loop for Phase 6 git operations

After git reset --hard on a FUSE mount, the kernel dcache can
permanently lose the clone directory entry. The existing retry logic
polls for 60+ seconds but the directory never recovers.

Add pullFromCommitWithRecovery which wraps the Phase 6 pull in a
recovery loop: if the clone directory vanishes, it removes the stale
clone, re-creates it from the bare repo, resets to the target commit,
and retries the pull (up to 3 attempts).

Also adds tryGitCommand, a non-fatal git runner that returns
(output, error) instead of calling require.NoError, enabling the
recovery loop to handle failures gracefully without aborting the test.

Fixes flaky TestGitOperations/CloneAndPull on CI.

* fix(fuse-test): make recovery loop fully non-fatal

Address review feedback:

- Extract tryEnsureMountClone (returns error) so a transient FUSE
  failure during re-clone doesn't abort the test — the recovery loop
  can retry instead.
- Check waitForDirEventually return after git reset --hard and return
  a specific error if the directory doesn't recover.

* style(fuse-test): simplify ensureMountClone error check

* fix(fuse-test): recover bare repo when FUSE mount loses it

CI showed that not just the working clone but also the bare repo on
the FUSE mount can vanish after heavy git operations. The recovery
loop now re-creates the bare repo from the local clone (which lives
on local disk and is always available) before attempting to re-clone.

Adds tryEnsureBareRepo: checks HEAD exists in the bare repo, and if
not, re-inits and force-pushes from the local clone.
2026-03-26 19:33:56 -07:00
Chris Lu
cc2f790c73 feat: add per-lane scheduler status API and lane worker UI pages
- GET /api/plugin/lanes returns all lanes with status and job types
- GET /api/plugin/workers?lane=X filters workers by lane
- GET /api/plugin/scheduler-states?lane=X filters job types by lane
- GET /api/plugin/scheduler-status?lane=X returns lane-scoped status
- GET /plugin/lanes/{lane}/workers renders per-lane worker page
- SchedulerJobTypeState now includes a "lane" field

The lane worker pages show scheduler status, job type configuration,
and connected workers scoped to a single lane, with links back to
the main plugin overview.
2026-03-26 19:33:42 -07:00
Chris Lu
e3e015e108 feat: introduce scheduler lanes for independent per-workload scheduling
Split the single plugin scheduler loop into independent per-lane
goroutines so that volume management, iceberg compaction, and lifecycle
operations never block each other.

Each lane has its own:
- Goroutine (laneSchedulerLoop)
- Wake channel for immediate scheduling
- Admin lock scope (e.g. "plugin scheduler:default")
- Configurable idle sleep duration
- Loop state tracking

Three lanes are defined:
- default: vacuum, volume_balance, ec_balance, erasure_coding, admin_script
- iceberg: iceberg_maintenance
- lifecycle: s3_lifecycle (new, handler coming in a later commit)

Job types are mapped to lanes via a hardcoded map with LaneDefault as
the fallback. The SchedulerJobTypeState and SchedulerStatus types now
include a Lane field for API consumers.
2026-03-26 19:32:18 -07:00
Chris Lu
d95df76bca feat: separate scheduler lanes for iceberg, lifecycle, and volume management (#8787)
* feat: introduce scheduler lanes for independent per-workload scheduling

Split the single plugin scheduler loop into independent per-lane
goroutines so that volume management, iceberg compaction, and lifecycle
operations never block each other.

Each lane has its own:
- Goroutine (laneSchedulerLoop)
- Wake channel for immediate scheduling
- Admin lock scope (e.g. "plugin scheduler:default")
- Configurable idle sleep duration
- Loop state tracking

Three lanes are defined:
- default: vacuum, volume_balance, ec_balance, erasure_coding, admin_script
- iceberg: iceberg_maintenance
- lifecycle: s3_lifecycle (new, handler coming in a later commit)

Job types are mapped to lanes via a hardcoded map with LaneDefault as
the fallback. The SchedulerJobTypeState and SchedulerStatus types now
include a Lane field for API consumers.

* feat: per-lane execution reservation pools for resource isolation

Each scheduler lane now maintains its own execution reservation map
so that a busy volume lane cannot consume execution slots needed by
iceberg or lifecycle lanes. The per-lane pool is used by default when
dispatching jobs through the lane scheduler; the global pool remains
as a fallback for the public DispatchProposals API.

* feat: add per-lane scheduler status API and lane worker UI pages

- GET /api/plugin/lanes returns all lanes with status and job types
- GET /api/plugin/workers?lane=X filters workers by lane
- GET /api/plugin/scheduler-states?lane=X filters job types by lane
- GET /api/plugin/scheduler-status?lane=X returns lane-scoped status
- GET /plugin/lanes/{lane}/workers renders per-lane worker page
- SchedulerJobTypeState now includes a "lane" field

The lane worker pages show scheduler status, job type configuration,
and connected workers scoped to a single lane, with links back to
the main plugin overview.

* feat: add s3_lifecycle worker handler for object store lifecycle management

Implements a full plugin worker handler for S3 lifecycle management,
assigned to the new "lifecycle" scheduler lane.

Detection phase:
- Reads filer.conf to find buckets with TTL lifecycle rules
- Creates one job proposal per bucket with active lifecycle rules
- Supports bucket_filter wildcard pattern from admin config

Execution phase:
- Walks the bucket directory tree breadth-first
- Identifies expired objects by checking TtlSec + Crtime < now
- Deletes expired objects in configurable batches
- Reports progress with scanned/expired/error counts
- Supports dry_run mode for safe testing

Configurable via admin UI:
- batch_size: entries per filer listing page (default 1000)
- max_deletes_per_bucket: safety cap per run (default 10000)
- dry_run: detect without deleting
- delete_marker_cleanup: clean expired delete markers
- abort_mpu_days: abort stale multipart uploads

The handler integrates with the existing PutBucketLifecycle flow which
sets TtlSec on entries via filer.conf path rules.

* feat: add per-lane submenu items under Workers sidebar menu

Replace the single "Workers" sidebar link with a collapsible submenu
containing three lane entries:
- Default (volume management + admin scripts) -> /plugin
- Iceberg (table compaction) -> /plugin/lanes/iceberg/workers
- Lifecycle (S3 object expiration) -> /plugin/lanes/lifecycle/workers

The submenu auto-expands when on any /plugin page and highlights the
active lane. Icons match each lane's job type descriptor (server,
snowflake, hourglass).

* feat: scope plugin pages to their scheduler lane

The plugin overview, configuration, detection, queue, and execution
pages now filter workers, job types, scheduler states, and scheduler
status to only show data for their lane.

- Plugin() templ function accepts a lane parameter (default: "default")
- JavaScript appends ?lane= to /api/plugin/workers, /job-types,
  /scheduler-states, and /scheduler-status API calls
- GET /api/plugin/job-types now supports ?lane= filtering
- When ?job= is provided (e.g. ?job=iceberg_maintenance), the lane is
  auto-derived from the job type so the page scopes correctly

This ensures /plugin shows only default-lane workers and
/plugin/configuration?job=iceberg_maintenance scopes to the iceberg lane.

* fix: remove "Lane" from lane worker page titles and capitalize properly

"lifecycle Lane Workers" -> "Lifecycle Workers"
"iceberg Lane Workers" -> "Iceberg Workers"

* refactor: promote lane items to top-level sidebar menu entries

Move Default, Iceberg, and Lifecycle from a collapsible submenu to
direct top-level items under the WORKERS heading. Removes the
intermediate "Workers" parent link and collapse toggle.

* admin: unify plugin lane routes and handlers

* admin: filter plugin jobs and activities by lane

* admin: reuse plugin UI for worker lane pages

* fix: use ServerAddress.ToGrpcAddress() for filer connections in lifecycle handler

ClusterContext addresses use ServerAddress format (host:port.grpcPort).
Convert to the actual gRPC address via ToGrpcAddress() before dialing,
and add a Ping verification after connecting.

Fixes: "dial tcp: lookup tcp/8888.18888: unknown port"

* fix: resolve ServerAddress gRPC port in iceberg and lifecycle filer connections

ClusterContext addresses use ServerAddress format (host:httpPort.grpcPort).
Both the iceberg and lifecycle handlers now detect the compound format
and extract the gRPC port via ToGrpcAddress() before dialing. Plain
host:port addresses (e.g. from tests) are passed through unchanged.

Fixes: "dial tcp: lookup tcp/8888.18888: unknown port"

* align url

* Potential fix for code scanning alert no. 335: Incorrect conversion between integer types

Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>

* fix: address PR review findings across scheduler lanes and lifecycle handler

- Fix variable shadowing: rename loop var `w` to `worker` in
  GetPluginWorkersAPI to avoid shadowing the http.ResponseWriter param
- Fix stale GetSchedulerStatus: aggregate loop states across all lanes
  instead of reading never-updated legacy schedulerLoopState
- Scope InProcessJobs to lane in GetLaneSchedulerStatus
- Fix AbortMPUDays=0 treated as unset: change <= 0 to < 0 so 0 disables
- Propagate listing errors in lifecycle bucket walk instead of swallowing
- Implement DeleteMarkerCleanup: scan for S3 delete marker entries and
  remove them
- Implement AbortMPUDays: scan .uploads directory and remove stale
  multipart uploads older than the configured threshold
- Fix success determination: mark job failed when result.errors > 0
  even if no fatal error occurred
- Add regression test for jobTypeLaneMap to catch drift from handler
  registrations

* fix: guard against nil result in lifecycle completion and trim filer addresses

- Guard result dereference in completion summary: use local vars
  defaulting to 0 when result is nil to prevent panic
- Append trimmed filer addresses instead of originals so whitespace
  is not passed to the gRPC dialer

* fix: propagate ctx cancellation from deleteExpiredObjects and add config logging

- deleteExpiredObjects now returns a third error value when the context
  is canceled mid-batch; the caller stops processing further batches
  and returns the cancellation error to the job completion handler
- readBoolConfig and readInt64Config now log unexpected ConfigValue
  types at V(1) for debugging, consistent with readStringConfig

* fix: propagate errors in lifecycle cleanup helpers and use correct delete marker key

- cleanupDeleteMarkers: return error on ctx cancellation and SeaweedList
  failures instead of silently continuing
- abortIncompleteMPUs: log SeaweedList errors instead of discarding
- isDeleteMarker: use ExtDeleteMarkerKey ("Seaweed-X-Amz-Delete-Marker")
  instead of ExtLatestVersionIsDeleteMarker which is for the parent entry
- batchSize cap: use math.MaxInt instead of math.MaxInt32

* fix: propagate ctx cancellation from abortIncompleteMPUs and log unrecognized bool strings

- abortIncompleteMPUs now returns (aborted, errors, ctxErr) matching
  cleanupDeleteMarkers; caller stops on cancellation or listing failure
- readBoolConfig logs unrecognized string values before falling back

* fix: shared per-bucket budget across lifecycle phases and allow cleanup without expired objects

- Thread a shared remaining counter through TTL deletion, delete marker
  cleanup, and MPU abort so the total operations per bucket never exceed
  MaxDeletesPerBucket
- Remove early return when no TTL-expired objects found so delete marker
  cleanup and MPU abort still run
- Add NOTE on cleanupDeleteMarkers about version-safety limitation

---------

Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
2026-03-26 19:28:13 -07:00
Chris Lu
933fd5b474 feat: add SeaweedFS_upload_error_total Prometheus metric (PR #8788)
Add upload error counter labeled by HTTP status code, matching Go
commit 5fa5507. Code "0" indicates a transport error (no HTTP
response received). Incremented on replication failures in
do_replicated_request.
2026-03-26 17:28:57 -07:00
Chris Lu
ba624f1f34 Rust volume server implementation with CI (#8539)
* Match Go gRPC client transport defaults

* Honor Go HTTP idle timeout

* Honor maintenanceMBps during volume copy

* Honor images.fix.orientation on uploads

* Honor cpuprofile when pprof is disabled

* Match Go memory status payloads

* Propagate request IDs across gRPC calls

* Format pending Rust source updates

* Match Go stats endpoint payloads

* Serve Go volume server UI assets

* Enforce Go HTTP whitelist guards

* Align Rust metrics admin-port test with Go behavior

* Format pending Rust server updates

* Honor access.ui without per-request JWT checks

* Honor keepLocalDatFile in tier upload shortcut

* Honor Go remote volume write mode

* Load tier backends from master config

* Check master config before loading volumes

* Remove vif files on volume destroy

* Delete remote tier data on volume destroy

* Honor vif version defaults and overrides

* Reject mismatched vif bytes offsets

* Load remote-only tiered volumes

* Report Go tail offsets in sync status

* Stream remote dat in incremental copy

* Honor collection vif for EC shard config

* Persist EC expireAtSec in vif metadata

* Stream remote volume reads through HTTP

* Serve HTTP ranges from backend source

* Match Go ReadAllNeedles scan order

* Match Go CopyFile zero-stop metadata

* Delete EC volumes with collection cleanup

* Drop deleted collection metrics

* Match Go tombstone ReadNeedleMeta

* Match Go TTL parsing: all-digit default to minutes, two-pass fit algorithm

* Match Go needle ID/cookie formatting and name size computation

* Match Go image ext checks: webp resize only, no crop; empty healthz body

* Match Go Prometheus metric names and add missing handler counter constants

* Match Go ReplicaPlacement short string parsing with zero-padding

* Add missing EC constants MAX_SHARD_COUNT and MIN_TOTAL_DISKS

* Add walk_ecx_stats for accurate EC volume file counts and size

* Match Go VolumeStatus dat file size, EC shard stats, and disk pct precision

* Match Go needle map: unconditional delete counter, fix redb idx walk offset

* Add CompactMapSegment overflow panic guard matching Go

* Match Go volume: vif creation, version from superblock, TTL expiry, dedup data_size, garbage_level fallback

* Match Go 304 Not Modified: return bare status with no headers

* Match Go JWT error message: use "wrong jwt" instead of detailed error

* Match Go read handler bare 400, delete error prefix, download throttle timeout

* Match Go pretty JSON 1-space indent and "Deletion Failed:" error prefix

* Match Go heartbeat: keep is_heartbeating on error, add EC shard identification

* Match Go needle ReadBytes V2: tolerate EOF on truncated body

* Match Go volume: cookie check on any existing needle, return DataSize, 128KB meta guard

* Match Go DeleteCollection: propagate destroy errors

* Match Go gRPC: BatchDelete no flag, IncrementalCopy error, FetchAndWrite concurrent, VolumeUnmount/DeleteCollection errors, tail draining, query error code

* Match Go Content-Disposition RFC 6266 formatting with RFC 2231 encoding

* Match Go Guard isWriteActive: combine whitelist and signing key check

* Match Go DeleteCollectionMetrics: use partial label matching

* Match Go heartbeat: send state-only delta on volume state changes

* Match Go ReadNeedleMeta paged I/O: read header+tail only, skip data; add EIO tracking

* Match Go ScrubVolume INDEX mode dispatch; add VolumeCopy preallocation and EC NeedleStatus TODOs

* Add read_ec_shard_needle for full needle reconstruction from local EC shards

* Make heartbeat master config helpers pub for VolumeCopy preallocation

* Match Go gRPC: VolumeCopy preallocation, EC NeedleStatus full read, error message wording

* Match Go HTTP responses: omitempty fields, 2-space JSON indent, JWT JSON error, delete pretty/JSONP, 304 Last-Modified, raw write error

* Match Go WriteNeedleBlob V3 timestamp patching, fix makeup_diff double padding, count==0 read handling

* Add rebuild_ecx_file for EC index reconstruction from data shards

* Match Go gRPC: tail header first-chunk-only, EC cleanup on failure, copy append mode, ecx rebuild, compact cancellation

* Add EC volume read and delete support in HTTP handlers

* Add per-shard EC mount/unmount, location predicate search, idx directory for EC

* Add CheckVolumeDataIntegrity on volume load matching Go

* Match Go gRPC: EC multi-disk placement, per-shard mount/unmount, no auto-mount on reconstruct, streaming ReadAll/EcShardRead, ReceiveFile cleanup, version check, proxy streaming, redirect Content-Type

* Match Go heartbeat metric accounting

* Match Go duplicate UUID heartbeat retries

* Delete expired EC volumes during heartbeat

* Match Go volume heartbeat pruning

* Honor master preallocate in volume max

* Report remote storage info in heartbeats

* Emit EC heartbeat deltas on shard changes

* Match Go throttle boundary: use <= instead of <, fix pretty JSON to 1-space

* Match Go write_needle_blob monotonic appendAtNs via get_append_at_ns

* Match Go VolumeUnmount: idempotent success when volume not found

* Match Go TTL Display: return empty string when unit is Empty

Go checks `t.Unit == Empty` separately and returns "" for TTLs
with nonzero count but Empty unit. Rust only checked is_empty()
(count==0 && unit==0), so count>0 with unit=0 would format as
"5 " instead of "".

* Match Go error behavior for truncated needle data in read_body_v2

Go's readNeedleDataVersion2 returns "index out of range %d" errors
(indices 1-7) when needle body or metadata fields are truncated.
Rust was silently tolerating truncation and returning Ok. Now returns
NeedleError::IndexOutOfRange with the matching index for each field.

* Match Go download throttle: return JSON error instead of plain text

* Match Go crop params: default x1/y1 to 0 when not provided

* Match Go ScrubEcVolume: accumulate total_files from EC shards

* Match Go ScrubVolume: count total_files even on scrub error

* Match Go VolumeEcShardsCopy: set ignore_source_file_not_found for .vif

* Match Go VolumeTailSender: send needle_header on every chunk

* Match Go read_super_block: apply replication override from .vif

* Match Go check_volume_data_integrity: verify all 10 entries, detect trailing corruption

* Match Go WriteNeedleBlob: dedup check before writing during replication

* handlers: use meta-only reads for HEAD

* handlers: align range parsing and responses with Go

* handlers: align upload parsing with Go

* deps: enable webp support

* Make 5bytes the default feature for idx entry compatibility

* Match Go TTL: preserve original unit when count fits in byte

* Fix EC locate_needle: use get_actual_size for full needle size

* Fix raw body POST: only parse multipart when Content-Type contains form-data

* Match Go ReceiveFile: return protocol errors in response body, not gRPC status

* add docs

* Match Go VolumeEcShardsCopy: append to .ecj file instead of truncating

* Match Go ParsePath: support _delta suffix on file IDs for sub-file addressing

* Match Go chunk manifest: add Accept-Ranges, Content-Disposition, filename fallback, MIME detection

* Match Go privateStoreHandler: use proper JSON error for unsupported methods

* Match Go Destroy: add only_empty parameter to reject non-empty volume deletion

* Fix compilation: set_read_only_persist and set_writable return ()

These methods fire-and-forget save_vif internally, so gRPC callers
should not try to chain .map_err() on the unit return type.

* Match Go SaveVolumeInfo: check writability and propagate errors in save_vif

* Match Go VolumeDelete: propagate only_empty to delete_volume for defense in depth

The gRPC VolumeDelete handler had a pre-check for only_empty but then
passed false to store.delete_volume(), bypassing the store-level check.
Go passes req.OnlyEmpty directly to DeleteVolume. Now Rust does the same
for defense in depth against TOCTOU races (though the store write lock
makes this unlikely).

* Match Go ProcessRangeRequest: return full content for empty/oversized ranges

Go returns nil from ProcessRangeRequest when ranges are empty or total
range size exceeds content length, causing the caller to serve the full
content as a normal 200 response. Rust was returning an empty 200 body.

* Match Go Query: quote JSON keys in output records

Go's ToJson produces valid JSON with quoted keys like {"name":"Alice"}.
Rust was producing invalid JSON with unquoted keys like {name:"Alice"}.

* Match Go VolumeCopy: reject when no suitable disk location exists

Go returns ErrVolumeNoSpaceLeft when no location matches the disk type
and has sufficient space. Rust had an unsafe fallback that silently
picked the first location regardless of type or available space.

* Match Go DeleteVolumeNeedle: check noWriteOrDelete before allowing delete

Go checks v.noWriteOrDelete before proceeding with needle deletion,
returning "volume is read only" if true. Rust was skipping this check.

* Match Go ReceiveFile: prefer HardDrive location for EC and use response-level write errors

Two fixes: (1) Go prefers HardDriveType disk location for EC volumes,
falling back to first location. Returns "no storage location available"
when no locations exist. (2) Write failures are now response-level
errors (in response body) instead of gRPC status errors, matching Go.

* Match Go CopyFile: sync EC volume journal to disk before copying

Go calls ecVolume.Sync() before copying EC volume files to ensure the
.ecj journal is flushed to disk. Added sync_to_disk() to EcVolume and
call it in the CopyFile EC branch.

* Match Go readSuperBlock: propagate replication parse errors

Go returns an error when parsing the replication string from the .vif
file fails. Rust was silently ignoring the parse failure and using the
super block's replication as-is.

* Match Go TTL expiry: remove append_at_ns > 0 guard

Go computes TTL expiry from AppendAtNs without guarding against zero.
When append_at_ns is 0, the expiry is epoch + TTL which is in the past,
correctly returning NotFound. Rust's extra guard skipped the check,
incorrectly returning success for such needles.

* Match Go delete_collection: skip volumes with compaction in progress

Go checks !v.isCompactionInProgress.Load() before destroying a volume
during collection deletion, skipping compacting volumes. Also changed
destroy errors to log instead of aborting the entire collection delete.

* Match Go MarkReadonly/MarkWritable: always notify master even on local error

Go always notifies the master regardless of whether the local
set_read_only_persist or set_writable step fails. The Rust code was
using `?` which short-circuited on error, skipping the final master
notification. Save the result and defer the `?` until after the
notify call.

* Match Go PostHandler: return 500 for all write errors

Go returns 500 (InternalServerError) for all write failures. Rust was
returning 404 for volume-not-found and 403 for read-only volumes.

* Match Go makeupDiff: validate .cpd compaction revision is old + 1

Go reads the new .cpd file's super block and verifies the compaction
revision is exactly old + 1. Rust only validated the old revision.

* Match Go VolumeStatus: check data backend before returning status

Go checks v.DataBackend != nil before building the status response,
returning an error if missing. Rust was silently returning size 0.

* Match Go PostHandler: always include mime field in upload response JSON

Go always serializes the mime field even when empty ("mime":""). Rust was
omitting it when empty due to Option<String> with skip_serializing_if.

* Match Go FindFreeLocation: account for EC shards in free slot calculation

Go subtracts EC shard equivalents when computing available volume slots.
Rust was only comparing volume count, potentially over-counting free
slots on locations with many EC shards.

* Match Go privateStoreHandler: use INVALID as metrics label for unsupported methods

Go records the method as INVALID in metrics for unsupported HTTP methods.
Rust was using the actual method name.

* Match Go volume: add commit_compact guard and scrub data size validation

Two fixes: (1) commit_compact now checks/sets is_compacting flag to
prevent concurrent commits, matching Go's CompareAndSwap guard.
(2) scrub now validates total needle sizes against .dat file size.

* Match Go gRPC: fix TailSender error propagation, EcShardsInfo all slots, EcShardRead .ecx check

Three fixes: (1) VolumeTailSender now propagates binary search errors
instead of silently falling back to start. (2) VolumeEcShardsInfo
returns entries for all shard slots including unmounted. (3)
VolumeEcShardRead checks .ecx index for deletions instead of .ecj.

* Match Go metrics: add BuildInfo gauge and connection tracking functions

Go exposes a BuildInfo Prometheus metric with version labels, and tracks
open connections via stats.ConnectionOpen/Close. Added both to Rust.

* Match Go NeedleMap.Delete: use !is_deleted() instead of is_valid()

Go's CompactMap.Delete checks !IsDeleted() not IsValid(), so needles
with size==0 (live but anomalous) can still be deleted. The Rust code
was using is_valid() which returns false for size==0, preventing
deletion of such needles.

* Match Go fitTtlCount: always normalize TTL to coarsest unit

Go's fitTtlCount always converts to seconds first, then finds the
coarsest unit that fits in one byte (e.g., 120m → 2h). Rust had an
early return for count<=255 that skipped normalization, producing
different binary encodings for the same duration.

* Match Go BuildInfo metric: correct name and add missing labels

Go uses SeaweedFS_build_info (Namespace=SeaweedFS, Subsystem=build,
Name=info) with labels [version, commit, sizelimit, goos, goarch].
Rust had SeaweedFS_volumeServer_buildInfo with only [version].

* Match Go HTTP handlers: fix UploadResult fields, DiskStatus JSON, chunk manifest ETag

- UploadResult.mime: add skip_serializing_if to omit empty MIME (Go uses omitempty)
- UploadResult.contentMd5: only include when request provided Content-MD5 header
- Content-MD5 response header: only set when request provided it
- DiskStatuses: use camelCase field names (percentFree, percentUsed, diskType)
  to match Go's protobuf JSON marshaling
- Chunk manifest: preserve needle ETag in expanded response headers

* Match Go volume: fix version(), integrity check, scrub, and commit_compact

- version(): use self.version() instead of self.super_block.version in
  read_all_needles, check_volume_data_integrity, scan_raw_needles_from
  to respect volumeInfo.version override
- check_volume_data_integrity: initialize healthy_index_size to idx_size
  (matching Go) and continue on EOF instead of returning error
- scrub(): count deleted needles in total_read since they still occupy
  space in the .dat file (matches Go's totalRead += actualSize for deleted)
- commit_compact: clean up .cpd/.cpx files on makeup_diff failure
  (matches Go's error path cleanup)

* Match Go write queue: add 4MB batch byte limit

Go's startWorker breaks the batch at either 128 requests or 4MB of
accumulated write data. Rust only had the 128-request limit, allowing
large writes to accumulate unbounded latency.

* Add TTL normalization tests for Go parity verification

Test that fit_ttl_count normalizes 120m→2h, 24h→1d, 7d→1w even
when count fits in a byte, matching Go's fitTtlCount behavior.

* Match Go FindFreeLocation: account for EC shards in free slot calculation

Go's free volume count subtracts both regular volumes and EC volumes
from max_volume_count. Rust was only counting regular volumes, which
could over-report available slots when EC shards are mounted.

* Match Go EC volume: mark deletions in .ecx and replay .ecj at startup

Go's DeleteNeedleFromEcx marks needles as deleted in the .ecx index
in-place (writing TOMBSTONE_FILE_SIZE at the size field) in addition
to appending to the .ecj journal. Go's RebuildEcxFile replays .ecj
entries into .ecx on startup, then removes the .ecj file.

Rust was only appending to .ecj without marking .ecx, which meant
deleted EC needles remained readable via .ecx binary search. This
fix:
- Opens .ecx in read/write mode (was read-only)
- Adds mark_needle_deleted_in_ecx: binary search + in-place write
- Calls it from journal_delete before appending to .ecj
- Adds rebuild_ecx_from_journal: replays .ecj into .ecx on startup

* Match Go check_all_ec_shards_deleted: use MAX_SHARD_COUNT instead of hardcoded 14

Go's TotalShardsCount is DataShardsCount + ParityShardsCount = 14 by
default, but custom EC configs via .vif can have more shards (up to
MaxShardCount = 32). Using MAX_SHARD_COUNT ensures all shard files
are checked regardless of EC configuration.

* Match Go EC locate: subtract 1 from shard size and use datFileSize override

Go's LocateEcShardNeedleInterval passes shard.ecdFileSize-1 to
LocateData (shards are padded, -1 avoids overcounting large block
rows). When datFileSize is known, Go uses datFileSize/DataShards
instead. Rust was passing the raw shard file size without adjustment.

* Fix TTL parsing and DiskStatus field names to match Go exactly

TTL::read: Go's ReadTTL preserves the original unit (7d stays 7d,
not 1w) and errors on count > 255. The previous normalization change
was incorrect — Go only normalizes internally via fitTtlCount, not
during string parsing.

DiskStatus: Go uses encoding/json on protobuf structs, which reads
the json struct tags (snake_case: percent_free, percent_used,
disk_type), not the protobuf JSON names (camelCase). Revert to
snake_case to match Go's actual output.

* Fix heartbeat: check leader != current master before redirect, process duplicated UUIDs first

Match Go's volume_grpc_client_to_master.go behavior:
1. Only trigger leader redirect when the leader address differs from the
   current master (prevents unnecessary reconnect loops when master confirms
   its own address).
2. Process duplicated_uuids before leader redirect check, matching Go's
   ordering where duplicate UUID detection takes priority.

* Remove SetState version check to match Go behavior

Go's SetState unconditionally applies the state without any version
mismatch check. The Rust version had an extra optimistic concurrency
check that would reject valid requests from Go clients that don't
track versions.

* Fix TTL::read() to normalize via fit_ttl_count matching Go's ReadTTL

Go's ReadTTL calls fitTtlCount which converts to seconds and normalizes
to the coarsest unit that fits in a byte count (e.g. 120m->2h, 7d->1w,
24h->1d). The Rust version was preserving the original unit, producing
different binary encodings on disk and in heartbeat messages.

* Always return Content-MD5 header and JSON field on successful writes

Go always sets Content-MD5 in the response regardless of whether the
request included it. The Rust version was conditionally including it
only when the request provided Content-MD5.

* Include name and size in UploadResult JSON even when empty/zero

Go's encoding/json always includes empty strings and zero values in
the upload response. The Rust version was using skip_serializing_if
to omit them, causing JSON structure differences.

* Include deleted needles in scan_raw_needles_from to match Go

Go's ScanVolumeFileFrom visits ALL needles including deleted ones.
Skipping deleted entries during incremental copy would cause tombstones
to not be propagated, making deleted files reappear on the receiving side.

* Match Go NeedleMap.Delete: always write tombstone to idx file

Go's NeedleMap.Delete unconditionally writes a tombstone entry to the
idx file and updates metrics, even if the needle doesn't exist or is
already deleted. This is important for replication where every delete
operation must produce an idx write. The Rust version was skipping the
tombstone write for non-existent or already-deleted needles.

* Limit MIME type to 255 bytes matching Go's CreateNeedleFromRequest

* Title-case Seaweed-* pair keys to match Go HTTP header canonicalization

* Unify DiskType::Hdd into HardDrive to match Go's single HardDriveType

* Skip tombstone entries in walk_ecx_stats total_size matching Go's Raw()

* Return EMPTY TTL when computed seconds is zero matching Go's fitTtlCount

* Include disk-space-low in Volume.is_read_only() matching Go

* Log error on CIDR parse failure in whitelist matching Go's glog.Errorf

* Log cookie mismatch in gRPC Query matching Go's V(0).Infof

* Fix is_expired volume_size comparison to use < matching Go

Go checks `volumeSize < super_block.SuperBlockSize` (strict less-than),
but Rust used `<=`. This meant Rust would fail to expire a volume that
is exactly SUPER_BLOCK_SIZE bytes.

* Apply Go's JWT expiry defaults: 10s write, 60s read

Go calls v.SetDefault("jwt.signing.expires_after_seconds", 10) and
v.SetDefault("jwt.signing.read.expires_after_seconds", 60). Rust
defaulted to 0 for both, which meant tokens would never expire when
security.toml has a signing key but omits expires_after_seconds.

* Stop [grpc.volume].ca from overriding [grpc].ca matching Go

Go reads the gRPC CA file only from config.GetString("grpc.ca"), i.e.
the [grpc] section. The [grpc.volume] section only provides cert and
key. Rust was also reading ca from [grpc.volume] which would silently
override the [grpc].ca value when both were present.

* Fix free_volume_count to use EC shard count matching Go

Was counting EC volumes instead of EC shards, which underestimates EC
space usage. One EC volume with 14 shards uses ~1.4 volume slots, not 1.
Now uses Go's formula: ((max - volumes) * DataShardsCount - ecShardCount) / DataShardsCount.

* Include preallocate in compaction space check matching Go

Go uses max(preallocate, estimatedCompactSize) for the free space check.
Rust was only using the estimated volume size, which could start a
compaction that fails mid-way if preallocate exceeds the volume size.

* Check gzip magic bytes before setting Content-Encoding matching Go

Go checks both Accept-Encoding contains "gzip" AND IsGzippedContent
(data starts with 0x1f 0x8b) before setting Content-Encoding: gzip.
Rust only checked Accept-Encoding, which could incorrectly declare
gzip encoding for non-gzip compressed data.

* Only set upload response name when needle HasName matching Go

Go checks reqNeedle.HasName() before setting ret.Name. Rust always set
the name from the filename variable, which could return the fid portion
of the path as the name for raw PUT requests without a filename.

* Treat MaxVolumeCount==0 as unlimited matching Go's hasFreeDiskLocation

Go's hasFreeDiskLocation returns true immediately when MaxVolumeCount
is 0, treating it as unlimited. Rust was computing effective_free as
<= 0 for max==0, rejecting the location. This could fail volume
creation during early startup before the first heartbeat adjusts max.

* Read lastAppendAtNs from deleted V3 entries in integrity check

Go's doCheckAndFixVolumeData reads AppendAtNs from both live entries
(verifyNeedleIntegrity) and deleted tombstones (verifyDeletedNeedleIntegrity).
Rust was skipping deleted entries, which could result in a stale
last_append_at_ns if the last index entry is a deletion.

* Return empty body for empty/oversized range requests matching Go

Go's ProcessRangeRequest returns nil (empty body, 200 OK) when
parsed ranges are empty or combined range size exceeds total content
size. The Rust buffered path incorrectly returned the full file data
for both cases. The streaming path already handled this correctly.

* Dispatch ScrubEcVolume by mode matching Go's INDEX/LOCAL/FULL

Go's ScrubEcVolume switches on mode: INDEX calls v.ScrubIndex()
(ecx integrity only), LOCAL calls v.ScrubLocal(), FULL calls
vs.store.ScrubEcVolume(). Rust was ignoring the mode and always
running verify_ec_shards. Now INDEX mode checks ecx index integrity
(sorted overlap detection + file size validation) without shard I/O,
while LOCAL/FULL modes run the existing shard verification.

* Fix TTL test expectation: 7d normalizes to 1w matching Go's fitTtlCount

Go's ReadTTL calls fitTtlCount which normalizes to the coarsest unit
that fits: 7 days = 1 week, so "7d" becomes {Count:1, Unit:Week}
which displays as "1w". Both Go and Rust normalize identically.

* Add version mismatch check to SetState matching Go's State.Update

Go's State.Update compares the incoming version with the stored
version and returns "version mismatch" error if they differ. This
provides optimistic concurrency control. The Rust implementation
was accepting any version unconditionally.

* Use unquoted keys in Query JSON output matching Go's json.ToJson

Go's json.ToJson produces records with unquoted keys like
{score:12} not {"score":12}. This is a custom format used
internally by SeaweedFS for query results.

* Fix TTL test expectation in VolumeNeedleStatus: 7d normalizes to 1w

Same normalization as the HTTP test: Go's ReadTTL calls fitTtlCount
which converts 7 days to 1 week.

* Include ETag header in 304 Not Modified responses matching Go behavior

Go sets ETag on the response writer (via SetEtag) before the
If-Modified-Since and If-None-Match conditional checks, so both
304 response paths include the ETag header. The Rust implementation
was only adding ETag to 200 responses.

* Remove needle-name fallback in chunk manifest filename resolution

Go's tryHandleChunkedFile only falls back from URL filename to
manifest name. Rust had an extra fallback to needle.name that
Go does not perform, which could produce different
Content-Disposition filenames for chunk manifests.

* Validate JWT nbf (Not Before) claim matching Go's jwt-go/v5

Go's jwt.ParseWithClaims validates the nbf claim when present,
rejecting tokens whose nbf is in the future. The Rust jsonwebtoken
crate defaults validate_nbf to false, so tokens with future nbf
were incorrectly accepted.

* Set isHeartbeating to true at startup matching Go's VolumeServer init

Go unconditionally sets isHeartbeating: true in the VolumeServer
struct literal. Rust was starting with false when masters are
configured, causing /healthz to return 503 until the first
heartbeat succeeds.

* Call store.close() on shutdown matching Go's Shutdown()

Go's Shutdown() calls vs.store.Close() which closes all volumes
and flushes file handles. The Rust server was relying on process
exit for cleanup, which could leave data unflushed.

* Include server ID in maintenance mode error matching Go's format

Go returns "volume server %s is in maintenance mode" with the
store ID. Rust was returning a generic "maintenance mode" message.

* Fix DiskType test: use HardDrive variant matching Go's HddType=""

Go maps both "" and "hdd" to HardDriveType (empty string). The
Rust enum variant is HardDrive, not Hdd. The test referenced a
nonexistent Hdd variant causing compilation failure.

* Do not include ETag in 304 responses matching Go's GetOrHeadHandler

Go sets ETag at L235 AFTER the If-Modified-Since and If-None-Match
304 return paths, so Go's 304 responses do not include the ETag header.
The Rust code was incorrectly including ETag in both 304 response paths.

* Return 400 on malformed query strings in PostHandler matching Go's ParseForm

Go's r.ParseForm() returns HTTP 400 with "form parse error: ..." when
the query string is malformed. Rust was silently falling back to empty
query params via unwrap_or_default().

* Load EC volume version from .vif matching Go's NewEcVolume

Go sets ev.Version = needle.Version(volumeInfo.Version) from the .vif
file. Rust was always using Version::current() (V3), which would produce
wrong needle actual size calculations for volumes created with V1 or V2.

* Sync .ecx file before close matching Go's EcVolume.Close

Go calls ev.ecxFile.Sync() before closing to ensure in-place deletion
marks are flushed to disk. Without this, deletion marks written via
MarkNeedleDeleted could be lost on crash.

* Validate SuperBlock extra data size matching Go's Bytes() guard

Go checks extraSize > 256*256-2 and calls glog.Fatalf to prevent
corrupt super block headers. Rust was silently truncating via u16 cast,
which would write an incorrect extra_size field.

* Update quinn-proto 0.11.13 -> 0.11.14 to fix GHSA-6xvm-j4wr-6v98

Fixes Dependency Review CI failure: quinn-proto < 0.11.14 is vulnerable
to unauthenticated remote DoS via panic in QUIC transport parameter
parsing.

* Skip TestMultipartUploadUsesFormFieldsForTimestampAndTTL for Go server

Go's r.FormValue() cannot read multipart text fields after
r.MultipartReader() consumes the body, so ts/ttl sent as multipart
form fields only work with the Rust volume server. Skip this test
when VOLUME_SERVER_IMPL != "rust" to fix CI failure.

* Flush .ecx in EC volume sync_to_disk matching Go's Sync()

Go's EcVolume.Sync() flushes both the .ecj journal and the .ecx index
to disk. The Rust version only flushed .ecj, leaving in-place deletion
marks in .ecx unpersisted until close(). This could cause data
inconsistency if the server crashes after marking a needle deleted in
.ecx but before close().

* Remove .vif file in EC volume destroy matching Go's Destroy()

Go's EcVolume.Destroy() removes .ecx, .ecj, and .vif files. The Rust
version only removed .ecx and .ecj, leaving orphaned .vif files on
disk after EC volume destruction (e.g., after TTL expiry).

* Fix is_expired to use <= for SuperBlockSize check matching Go

Go checks contentSize <= SuperBlockSize to detect empty volumes (no
needles). Rust used < which would incorrectly allow a volume with
exactly SuperBlockSize bytes (header only, no data) to proceed to
the TTL expiry check and potentially be marked as expired.

* Fix read_append_at_ns to read timestamps from tombstone entries

Go reads the full needle body for all entries including tombstones
(deleted needles with size=0) to extract the actual AppendAtNs
timestamp. The Rust version returned 0 early for size <= 0 entries,
which would cause the binary search in incremental copy to produce
incorrect results for positions containing deleted needles.

Now uses get_actual_size to compute the on-disk size (which handles
tombstones correctly) and only returns 0 when the actual size is 0.

* Add X-Request-Id response header matching Go's requestIDMiddleware

Go sets both X-Request-Id and x-amz-request-id response headers.
The Rust server only set x-amz-request-id, missing X-Request-Id.

* Add skip_serializing_if for UploadResult name and size fields

Go's UploadResult uses json:"name,omitempty" and json:"size,omitempty",
omitting these fields from JSON when they are zero values (empty
string / 0). The Rust struct always serialized them, producing
"name":"" and "size":0 where Go would omit them.

* Support JSONP/pretty-print for write success responses

Go's writeJsonQuiet checks for callback (JSONP) and pretty query
parameters on all JSON responses including write success. The Rust
write success path used axum::Json directly, bypassing JSONP and
pretty-print support. Now uses json_result_with_query to match Go.

* Include actual limit in file size limit error message

Go returns "file over the limited %d bytes" with the actual limit
value included. Rust returned a generic "file size limit exceeded"
without the limit value, making it harder to debug.

* Extract extension from 2-segment URL paths for image operations

Go's parseURLPath extracts the file extension from all URL formats
including 2-segment paths like /vid,fid.jpg. The Rust version only
handled 3-segment paths (/vid/fid/filename.ext), so extensions in
2-segment paths were lost. This caused image resize/crop operations
requested via query params to be silently skipped for those paths.

* Add size_hint to TrackedBody so throttled downloads get Content-Length

TrackedBody (used for download throttling) did not implement
size_hint(), causing HTTP/1.1 to fall back to chunked transfer
encoding instead of setting Content-Length. Go always sets
Content-Length explicitly for non-range responses.

* Add Last-Modified, pairs, and S3 headers to chunk manifest responses

Go sets Last-Modified, needle pairs, and S3 pass-through headers on
the response writer BEFORE calling tryHandleChunkedFile. Since the
Rust chunk manifest handler created fresh response headers and
returned early, these headers were missing from chunk manifest
responses. Now passes last_modified_str into the chunk manifest
handler and applies pairs and S3 pass-through query params
(response-cache-control, response-content-encoding, etc.) to the
chunk manifest response headers.

* Fix multipart fallback to use first part data when no filename

Go reads the first part's data unconditionally, then looks for a
part with a filename. If none found, Go uses the first part's data
(with empty filename). Rust only captured parts with filenames, so
when no part had a filename it fell back to the raw multipart body
bytes (including boundary delimiters), producing corrupt needle data.

* Set HasName and HasMime flags for empty values matching Go

Go's CreateNeedleFromRequest sets HasName and HasMime flags even when
the filename or MIME type is empty (len < 256 is true for len 0).
Rust skipped empty values, causing the on-disk needle format to
differ: Go-written needles include extra bytes for the empty name/mime
size fields, changing the serialized needle size in the idx entry.
This ensures binary format compatibility between Go and Rust servers.

* Add is_stopping guard to vacuum_volume_commit matching Go

Go's CommitCompactVolume (store_vacuum.go L53-54) checks
s.isStopping before committing compaction to prevent file
swaps during shutdown. The Rust handler was missing this
check, which could allow compaction commits while the
server is stopping.

* Remove disk_type from required status fields since Go omits it

Go's default DiskType is "" (HardDriveType), and protobuf's omitempty
tag causes empty strings to be dropped from JSON output.

* test: honor rust env in dual volume harness

* grpc: notify master after volume lifecycle changes

* http: proxy to replicas before download-limit timeout

* test: pass readMode to rust volume harnesses

* fix store free-location predicate selection

* fix volume copy disk placement and heartbeat notification

* fix chunk manifest delete replication

* fix write replication to survive client disconnects

* fix download limit proxy and wait flow

* fix crop gating for streamed reads

* fix upload limit wait counter behavior

* fix chunk manifest image transforms

* fix has_resize_ops to check width/height > 0 instead of is_some()

Go's shouldResizeImages condition is `width > 0 || height > 0`, so
`?width=0` correctly evaluates to false. Rust was using `is_some()`
which made `?width=0` evaluate to true, unnecessarily disabling
streaming reads for those requests.

* fix Content-MD5 to only compute and return when provided by client

Go only computes the MD5 of uncompressed data when a Content-MD5
header or multipart field is provided. Rust was always computing and
returning it. Also fix the mismatch error message to include size,
matching Go's format.

* fix save_vif to compute ExpireAtSec from TTL

Go's SaveVolumeInfo always computes ExpireAtSec = now + ttlSeconds
when the volume has a TTL. The save_vif path (used by set_read_only
and set_writable) was missing this computation, causing .vif files
to be written without the correct expiration timestamp for TTL volumes.

* fix set_writable to not modify no_write_can_delete

Go's MarkVolumeWritable only sets noWriteOrDelete=false and persists.
Rust was additionally setting no_write_can_delete=has_remote_file,
which could incorrectly change the write mode for remote-file volumes
when the master explicitly asks to make the volume writable.

* fix write_needle_blob_and_index to error on too-small V3 blob

Go returns an error when the needle blob is too small for timestamp
patching. Rust was silently skipping the patch and writing the blob
with a stale/zero timestamp, which could cause data integrity issues
during incremental replication that relies on AppendAtNs ordering.

* fix VolumeEcShardsToVolume to validate dataShards range

Go validates that dataShards is > 0 and <= MaxShardCount before
proceeding with EC-to-volume reconstruction. Without this check,
a zero or excessively large data_shards value could cause confusing
downstream failures.

* fix destroy to use VolumeError::NotEmpty instead of generic Io error

The dedicated NotEmpty variant exists in the enum but was not being
used. This makes error matching consistent with Go's ErrVolumeNotEmpty.

* fix SetState to persist state to disk with rollback on failure

Go's State.Update saves VolumeServerState to a state.pb file after
each SetState call, and rolls back the in-memory state if persistence
fails. Rust was only updating in-memory atomics, so maintenance mode
would be lost on server restart. Now saves protobuf-encoded state.pb
and loads it on startup.

* fix VolumeTierMoveDatToRemote to close local dat backend after upload

Go calls v.LoadRemoteFile() after saving volume info, which closes
the local DataBackend before transitioning to remote storage. Without
this, the volume holds a stale file handle to the deleted local .dat
file, causing reads to fail until server restart.

* fix VolumeTierMoveDatFromRemote to close remote dat backend after download

Go calls v.DataBackend.Close() and sets DataBackend=nil after removing
the remote file reference. Without this, the stale remote backend
state lingers and reads may not discover the newly downloaded local
.dat file until server restart.

* fix redirect to use internal url instead of public_url

Go's proxyReqToTargetServer builds the redirect Location header from
loc.Url (the internal URL), not publicUrl. Using public_url could
cause redirect failures when internal and external URLs differ.

* fix redirect test and add state_file_path to integration test

Update redirect unit test to expect internal url (matching the
previous fix). Add missing state_file_path field to the integration
test VolumeServerState constructor.

* fix FetchAndWriteNeedle to await all writes before checking errors

Go uses a WaitGroup to await all writes (local + replicas) before
checking errors. Rust was short-circuiting on local write failure,
which could leave replica writes in-flight without waiting for
completion.

* fix shutdown to send deregister heartbeat before pre_stop delay

Go's StopHeartbeat() closes stopChan immediately on interrupt, causing
the heartbeat goroutine to send the deregister heartbeat right away,
before the preStopSeconds delay. Rust was only setting is_stopping=true
without waking the heartbeat loop, so the deregister was delayed until
after the pre_stop sleep. Now we call volume_state_notify.notify_one()
to wake the heartbeat immediately.

* fix heartbeat response ordering to check duplicate UUIDs first

Go processes heartbeat responses in this order: DuplicatedUuids first,
then volume options (prealloc/size limit), then leader redirect. Rust
was applying volume options before checking for duplicate UUIDs, which
meant volume option changes would take effect even when the response
contained a duplicate UUID error that should cause an immediate return.

* the test thread was blocked

* fix(deps): update aws-lc-sys 0.38.0 → 0.39.0 to resolve security advisories

Bumps aws-lc-rs 1.16.1 → 1.16.2, pulling in aws-lc-sys 0.39.0 which
fixes GHSA-394x-vwmw-crm3 (X.509 Name Constraints wildcard/unicode
bypass) and GHSA-9f94-5g5w-gf6r (CRL Distribution Point scope check
logic error).

* fix: match Go Content-MD5 mismatch error message format

Go uses "Content-MD5 did not match md5 of file data expected [X]
received [Y] size Z" while Rust had a shorter format. Match the
exact Go error string so clients see identical messages.

* fix: match Go Bearer token length check (> 7, not >= 7)

Go requires len(bearer) > 7 ensuring at least one char after
"Bearer ". Rust used >= 7 which would accept an empty token.

* fix(deps): drop legacy rustls 0.21 to resolve rustls-webpki GHSA-pwjx-qhcg-rvj4

aws-sdk-s3's default "rustls" feature enables tls-rustls in
aws-smithy-runtime, which pulls in legacy-rustls-ring (rustls 0.21
→ rustls-webpki 0.101.7, moderate CRL advisory). Replace with
explicit default-https-client which uses only rustls 0.23 /
rustls-webpki 0.103.9.

* fix: use uploaded filename for auto-compression extension detection

Go extracts the file extension from pu.FileName (the uploaded
filename) for auto-compression decisions. Rust was using the URL
path, which typically has no extension for SeaweedFS file IDs.

* fix: add CRC legacy Value() backward-compat check on needle read

Go double-checks CRC: n.Checksum != crc && uint32(n.Checksum) !=
crc.Value(). The Value() path is a deprecated transform for compat
with seaweed versions prior to commit 056c480eb. Rust had the
legacy_value() method but wasn't using it in validation.

* fix: remove /stats/* endpoints to match Go (commented out since L130)

Go's volume_server.go has the /stats/counter, /stats/memory, and
/stats/disk endpoints commented out (lines 130-134). Remove them
from the Rust router along with the now-unused whitelist_guard
middleware.

* fix: filter application/octet-stream MIME for chunk manifests

Go's tryHandleChunkedFile (L334) filters out application/octet-stream
from chunk manifest MIME types, falling back to extension-based
detection. Rust was returning the stored MIME as-is for manifests.

* fix: VolumeMarkWritable returns error before notifying master

Go returns early at L200 if MarkVolumeWritable fails, before
reaching the master notification at L206. Rust was notifying master
even on failure, creating inconsistent state where master thinks
the volume is writable but local marking failed.

* fix: check volume existence before maintenance in MarkReadonly/Writable

Go's VolumeMarkReadonly (L239-241) and VolumeMarkWritable (L253-255)
look up the volume first, then call makeVolumeReadonly/Writable which
checks maintenance. Rust was checking maintenance first, returning
"maintenance mode" instead of "not found" for missing volumes.

* feat: implement ScrubVolume mark_broken_volumes_readonly (PR #8360)

Add the mark_broken_volumes_readonly flag from PR #8360:
- Sync proto field (tag 3) to local volume_server.proto
- After scrubbing, if flag is set, call makeVolumeReadonly on each
  broken volume (notify master, mark local readonly, notify again)
- Collect errors via joined error semantics matching Go's errors.Join
- Factor out make_volume_readonly helper reused by both
  VolumeMarkReadonly and ScrubVolume

Also refactors VolumeMarkReadonly to use the shared helper.

* fix(deps): update rustls-webpki 0.103.9 → 0.103.10 (GHSA-pwjx-qhcg-rvj4)

CRL Distribution Point matching logic fix for moderate severity
advisory about CRLs not considered authoritative.

* test: update integration tests for removed /stats/* endpoints

Replace tests that expected /stats/* routes to return 200/401 with
tests confirming they now fall through to the store handler (400),
matching Go's commented-out stats endpoints.

* docs: fix misleading comment about default offset feature

The comment said "4-byte offsets unless explicitly built with 5-byte
support" but the default feature enables 5bytes. This is intentional
for production parity with Go -tags 5BytesOffset builds. Fix the
comment to match reality.
2026-03-26 17:24:35 -07:00
Chris Lu
5fa5507234 Add Prometheus metric to count upload errors (#8788)
Add Prometheus metric to count upload errors (#8775)

Add SeaweedFS_upload_error_total counter labeled by HTTP status code,
so operators can alert on write/replication failures. Code "0" indicates
a transport error (no HTTP response received).

Also add an "Upload Errors" panel to the Grafana dashboard.
2026-03-26 16:58:05 -07:00
Chris Lu
17028fbf59 fix: serialize SSE-KMS metadata when bucket default encryption applies KMS (#8780)
* fix: serialize SSE-KMS metadata when bucket default encryption applies KMS

When a bucket has default SSE-KMS encryption enabled and a file is uploaded
without explicit SSE headers, the encryption was applied correctly but the
SSE-KMS metadata (x-seaweedfs-sse-kms-key) was not serialized. This caused
downloads to fail with "empty SSE-KMS metadata" because the entry's Extended
map stored an empty byte slice.

The existing code already handled this for SSE-S3 bucket defaults
(SerializeSSES3Metadata) but was missing the equivalent call to
SerializeSSEKMSMetadata for the KMS path.

Fixes seaweedfs/seaweedfs#8776

* ci: add KMS integration tests to GitHub Actions

Add a kms-tests.yml workflow that runs on changes to KMS/SSE code with
two jobs:

1. KMS provider tests: starts OpenBao via Docker, runs Go integration
   tests in test/kms/ against a real KMS backend
2. S3 KMS e2e tests: starts OpenBao + weed mini built from source, runs
   test_s3_kms.sh which covers bucket-default SSE-KMS upload/download
   (the exact scenario from #8776)

Supporting changes:
- test/kms/Makefile: add CI targets (test-provider-ci, test-s3-kms-ci)
  that manage OpenBao via plain Docker and run weed from source
- test/kms/s3-config-openbao-template.json: S3 config template with
  OpenBao KMS provider for weed mini

* refactor: combine SSE-S3 and SSE-KMS metadata serialization into else-if

SSE-S3 and SSE-KMS bucket default encryption are mutually exclusive, so
use a single if/else-if block instead of two independent if blocks.

* Update .github/workflows/kms-tests.yml

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

* fix(ci): start weed mini from data dir to avoid Docker filer.toml

weed mini reads filer.toml from the current working directory first.
When running from test/kms/, it picked up the Docker-targeted filer.toml
which has dir="/data/filerdb" (a path that doesn't exist in CI), causing
a fatal crash at filer store initialization.

Fix by cd-ing to the data directory before starting weed mini.
Also improve log visibility on failure.

---------

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
2026-03-26 14:07:01 -07:00
Chris Lu
3a3fff1399 Fix TUS chunked upload and resume failures (#8783) (#8786)
* Fix TUS chunked upload and resume failures caused by request context cancellation (#8783)

The filer's TCP connections use a 10-second inactivity timeout (net_timeout.go).
After the TUS PATCH request body is fully consumed, internal operations (assigning
file IDs via gRPC to the master, uploading data to volume servers, completing uploads)
do not generate any activity on the client connection, so the inactivity timer fires
and Go's HTTP server cancels the request context. This caused HTTP 500 errors on
PATCH requests where body reading + internal processing exceeded the timeout.

Fix by using context.WithoutCancel in TUS create and patch handlers, matching the
existing pattern used by assignNewFileInfo. This ensures internal operations complete
regardless of client connection state.

Fixes seaweedfs/seaweedfs#8783

* Add comment to tusCreateHandler explaining context.WithoutCancel rationale

* Run TUS integration tests on all PRs, not just TUS file changes

The previous path filter meant these tests only ran when TUS-specific files
changed. This allowed regressions from changes to shared infrastructure
(net_timeout.go, upload paths, gRPC) to go undetected — which is exactly
how the context cancellation bug in #8783 was missed. Matches the pattern
used by s3-go-tests.yml.
2026-03-26 14:06:21 -07:00
dependabot[bot]
77e4b92432 build(deps): bump io.netty:netty-codec-http2 from 4.1.129.Final to 4.1.132.Final in /test/java/spark (#8785)
build(deps): bump io.netty:netty-codec-http2 in /test/java/spark

Bumps [io.netty:netty-codec-http2](https://github.com/netty/netty) from 4.1.129.Final to 4.1.132.Final.
- [Release notes](https://github.com/netty/netty/releases)
- [Commits](https://github.com/netty/netty/compare/netty-4.1.129.Final...netty-4.1.132.Final)

---
updated-dependencies:
- dependency-name: io.netty:netty-codec-http2
  dependency-version: 4.1.132.Final
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-03-26 13:00:18 -07:00
dependabot[bot]
8558c586a0 build(deps): bump io.netty:netty-codec-http from 4.1.129.Final to 4.1.132.Final in /test/java/spark (#8784)
build(deps): bump io.netty:netty-codec-http in /test/java/spark

Bumps [io.netty:netty-codec-http](https://github.com/netty/netty) from 4.1.129.Final to 4.1.132.Final.
- [Release notes](https://github.com/netty/netty/releases)
- [Commits](https://github.com/netty/netty/compare/netty-4.1.129.Final...netty-4.1.132.Final)

---
updated-dependencies:
- dependency-name: io.netty:netty-codec-http
  dependency-version: 4.1.132.Final
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-03-26 13:00:03 -07:00
Lars Lehtonen
e8888765a1 fix(weed/filer/store_test): fix dropped errors (#8782) 2026-03-26 12:07:48 -07:00
Chris Lu
92c2fc0d52 Add insecure_skip_verify option for HTTPS client in security.toml (#8781)
* Add -insecureSkipVerify flag and config option for filer.sync HTTPS connections

When using filer.sync between clusters with different CAs (e.g., separate
OpenShift clusters), TLS certificate verification fails with "x509:
certificate signed by unknown authority". This adds two ways to skip TLS
certificate verification:

1. CLI flag: `weed filer.sync -insecureSkipVerify ...`
2. Config option: `insecure_skip_verify = true` under [https.client] in
   security.toml

Closes #8778

* Add insecure_skip_verify option for HTTPS client in security.toml

When using filer.sync between clusters with different CAs (e.g., separate
OpenShift clusters), TLS certificate verification fails. Adding
insecure_skip_verify = true under [https.client] in security.toml allows
skipping TLS certificate verification.

The option is read during global HTTP client initialization so it applies
to all HTTPS connections including filer.sync proxy reads and writes.

Closes #8778

---------

Co-authored-by: Copilot <copilot@github.com>
2026-03-26 11:42:47 -07:00
Chris Lu
aa12b51cbf test: restore coverage removed in PR #8360 (#8779)
* test: restore maintenance mode coverage in TestVolumeMarkReadonlyWritableErrorPaths

PR #8360 removed the maintenance mode assertions because the refactored
check ordering (volume lookup before maintenance check) caused the
original test to hit "not found" instead of "maintenance mode" — the
test used a non-existent volume ID.

Restore coverage by allocating a real volume, then verifying:
- existing volume in maintenance mode returns "maintenance mode"
- non-existent volume in maintenance mode still returns "not found"
  (validating the new check ordering)

* test: add coverage for ScrubVolume MarkBrokenVolumesReadonly flag

PR #8360 added the mark_broken_volumes_readonly field to ScrubVolumeRequest
but no tests exercised the new logic paths. Add three integration tests:

- HealthyVolume: flag is a no-op when scrub finds no broken volumes
- CorruptVolume: corrupted .idx triggers broken detection; without the
  flag the volume stays writable, with the flag it becomes read-only
- MaintenanceMode: makeVolumeReadonly fails under maintenance and
  ScrubVolume propagates the error via errors.Join

* refactor: extract CorruptIndexFile and EnableMaintenanceMode test helpers

Move duplicated idx corruption and maintenance mode setup into
framework.CorruptIndexFile() and framework.EnableMaintenanceMode()
helpers. Use defer for file close in the corruption helper.
2026-03-26 10:52:37 -07:00
Lisandro Pin
e5cf2d2a19 Give the ScrubVolume() RPC an option to flag found broken volumes as read-only. (#8360)
* Give the `ScrubVolume()` RPC an option to flag found broken volumes as read-only.

Also exposes this option in the shell `volume.scrub` command.

* Remove redundant test in `TestVolumeMarkReadonlyWritableErrorPaths`.

417051bb slightly rearranges the logic for `VolumeMarkReadonly()` and `VolumeMarkWritable()`,
so calling them for invalid volume IDs will actually yield that error, instead of checking
maintnenance mode first.
2026-03-26 10:20:57 -07:00
Jaehoon Kim
6cf34f2376 Add -filerExcludePathPattern flag and fix nil panic in -filerExcludeFileName (#8756)
* Fix filerExcludeFileName to support directory names and path components

The original implementation only matched excludeFileName against
message.NewEntry.Name, which caused two issues:

1. Nil pointer panic on delete events (NewEntry is nil)
2. Files inside excluded directories were still backed up because
   the parent directory name was not checked

This patch:
- Checks all path components in resp.Directory against the regexp
- Adds nil guard for message.NewEntry before accessing .Name
- Also checks message.OldEntry.Name for rename/delete events

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Add -filerExcludePathPattern flag and fix nil panic in filerExcludeFileName

Separate concerns between two exclude mechanisms:
- filerExcludeFileName: matches entry name only (leaf node)
- filerExcludePathPattern (NEW): matches any path component via regexp,
  so files inside matched directories are also excluded

Also fixes nil pointer panic when filerExcludeFileName encounters
delete events where NewEntry is nil.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Refactor exclude logic: per-side exclusion for rename events, reduce duplication

- Extract isEntryExcluded() to compute exclusion per old/new side,
  so rename events crossing an exclude boundary are handled as
  delete + create instead of being entirely skipped
- Extract compileExcludePattern() to deduplicate regexp compilation
- Replace strings.Split with allocation-free pathContainsMatch()
- Check message.NewParentPath (not just resp.Directory) for new side

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Move regexp compilation out of retry loop to fail fast on config errors

compileExcludePattern for -filerExcludeFileName and -filerExcludePathPattern
are configuration-time validations that will never succeed on retry.
Move them to runFilerBackup before the reconnect loop and use glog.Fatalf
on failure, so invalid patterns are caught immediately at startup instead
of being retried every 1.7 seconds indefinitely.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Add wildcard matching helpers for path and filename exclusion

* Replace regexp exclude patterns with wildcard-based flags, deprecate -filerExcludeFileName

Add -filerExcludeFileNames and -filerExcludePathPatterns flags that accept
comma-separated wildcard patterns (*, ?) using the existing wildcard library.
Mark -filerExcludeFileName as deprecated but keep its regexp behavior.

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Chris Lu <chris.lu@gmail.com>
2026-03-26 10:04:06 -07:00
Chris Lu
ccc662b90b shell: add s3.bucket.access command for anonymous access policy (#8774)
* shell: add s3.bucket.access command for anonymous access policy (#7738)

Add a new weed shell command to view or change the anonymous access
policy of an S3 bucket without external tools.

Usage:
  s3.bucket.access -name <bucket> -access read,list
  s3.bucket.access -name <bucket> -access none

Supported permissions: read, write, list. The command writes a standard
bucket policy with Principal "*" and warns if no anonymous IAM identity
exists.

* shell: fix anonymous identity hint in s3.bucket.access warning

The anonymous identity doesn't need IAM actions — the bucket policy
controls what anonymous users can do.

* shell: only warn about anonymous identity when write access is set

Read and list operations use AuthWithPublicRead which evaluates bucket
policies directly without requiring the anonymous identity. Only write
operations go through the normal auth flow that needs it.

* shell: rewrite s3.bucket.access to use IAM actions instead of bucket policies

Replace the bucket policy approach with direct IAM identity actions,
matching the s3.configure pattern. The user is auto-created if it does
not exist.

Usage:
  s3.bucket.access -name <bucket> -user anonymous -access Read,List
  s3.bucket.access -name <bucket> -user anonymous -access none
  s3.bucket.access -name <bucket> -user anonymous

Actions are stored as "Action:bucket" on the identity, same as
s3.configure -actions=Read -buckets=my-bucket.

* shell: return flag parse errors instead of swallowing them

* shell: normalize action names case-insensitively in s3.bucket.access

Accept actions in any case (read, READ, Read) and normalize to canonical
form (Read, Write, List, etc.) before storing. This matches the
case-insensitive handling of "none" and avoids confusing rejections.
2026-03-25 23:09:53 -07:00
Chris Lu
67a551fd62 admin UI: add anonymous user creation checkbox (#8773)
Add an "Anonymous" checkbox next to the username field in the Create User
modal. When checked, the username is set to "anonymous" and the credential
generation checkbox is disabled since anonymous users do not need keys.

The checkbox is only shown when no anonymous user exists yet. The
manage-access-keys button in the users table is hidden for the anonymous
user.
2026-03-25 21:24:10 -07:00
Chris Lu
94bfa2b340 mount: stream all filer mutations over single ordered gRPC stream (#8770)
* filer: add StreamMutateEntry bidi streaming RPC

Add a bidirectional streaming RPC that carries all filer mutation
types (create, update, delete, rename) over a single ordered stream.
This eliminates per-request connection overhead for pipelined
operations and guarantees mutation ordering within a stream.

The server handler delegates each request to the existing unary
handlers (CreateEntry, UpdateEntry, DeleteEntry) and uses a proxy
stream adapter for rename operations to reuse StreamRenameEntry logic.

The is_last field signals completion for multi-response operations
(rename sends multiple events per request; create/update/delete
always send exactly one response with is_last=true).

* mount: add streaming mutation multiplexer (streamMutateMux)

Implement a client-side multiplexer that routes all filer mutation RPCs
(create, update, delete, rename) over a single bidirectional gRPC
stream. Multiple goroutines submit requests through a send channel;
a dedicated sendLoop serializes them on the stream; a recvLoop
dispatches responses to waiting callers via per-request channels.

Key features:
- Lazy stream opening on first use
- Automatic reconnection on stream failure
- Permanent fallback to unary RPCs if filer returns Unimplemented
- Monotonic request_id for response correlation
- Multi-response support for rename operations (is_last signaling)

The mux is initialized on WFS and closed during unmount cleanup.
No call sites use it yet — wiring comes in subsequent commits.

* mount: route CreateEntry and UpdateEntry through streaming mux

Wire all CreateEntry call sites to use wfs.streamCreateEntry() which
routes through the StreamMutateEntry stream when available, falling
back to unary RPCs otherwise. Also wire Link's UpdateEntry calls
through wfs.streamUpdateEntry().

Updated call sites:
- flushMetadataToFiler (file flush after write)
- Mkdir (directory creation)
- Symlink (symbolic link creation)
- createRegularFile non-deferred path (Mknod)
- flushFileMetadata (periodic metadata flush)
- Link (hard link: update source + create link + rollback)

* mount: route UpdateEntry and DeleteEntry through streaming mux

Wire remaining mutation call sites through the streaming mux:

- saveEntry (Setattr/chmod/chown/utimes) → streamUpdateEntry
- Unlink → streamDeleteEntry (replaces RemoveWithResponse)
- Rmdir → streamDeleteEntry (replaces RemoveWithResponse)

All filer mutations except Rename now go through StreamMutateEntry
when the filer supports it, with automatic unary RPC fallback.

* mount: route Rename through streaming mux

Wire Rename to use streamMutate.Rename() when available, with
fallback to the existing StreamRenameEntry unary stream.

The streaming mux sends rename as a StreamRenameEntryRequest oneof
variant. The server processes it through the existing rename logic
and sends multiple StreamRenameEntryResponse events (one per moved
entry), with is_last=true on the final response.

All filer mutations now go through a single ordered stream.

* mount: fix stream mux connection ownership

WithGrpcClient(streamingMode=true) closes the gRPC connection when
the callback returns, destroying the stream. Own the connection
directly via pb.GrpcDial so it stays alive for the stream's lifetime.
Close it explicitly in recvLoop on stream failure and in Close on
shutdown.

* mount: fix rename failure for deferred-create files

Three fixes for rename operations over the streaming mux:

1. lookupEntry: fall back to local metadata store when filer returns
   "not found" for entries in uncached directories. Files created with
   deferFilerCreate=true exist only in the local leveldb store until
   flushed; lookupEntry skipped the local store when the parent
   directory had never been readdir'd, causing rename to fail with
   ENOENT.

2. Rename: wait for pending async flushes and force synchronous flush
   of dirty metadata before sending rename to the filer. Covers the
   writebackCache case where close() defers the flush to a background
   worker that may not complete before rename fires.

3. StreamMutateEntry: propagate rename errors from server to client.
   Add error/errno fields to StreamMutateEntryResponse so the mount
   can map filer errors to correct FUSE status codes instead of
   silently returning OK. Also fix the existing Rename error handler
   which could return fuse.OK on unrecognized errors.

* mount: fix streaming mux error handling, sendLoop lifecycle, and fallback

Address PR review comments:

1. Server: populate top-level Error/Errno on StreamMutateEntryResponse for
   create/update/delete errors, not just rename. Previously update errors
   were silently dropped and create/delete errors were only in nested
   response fields that the client didn't check.

2. Client: check nested error fields in CreateEntry (ErrorCode, Error)
   and DeleteEntry (Error) responses, matching CreateEntryWithResponse
   behavior.

3. Fix sendLoop lifecycle: give each stream generation a stopSend channel.
   recvLoop closes it on error to stop the paired sendLoop. Previously a
   reconnect left the old sendLoop draining sendCh, breaking ordering.

4. Transparent fallback: stream helpers and doRename fall back to unary
   RPCs on transport errors (ErrStreamTransport), including the first
   Unimplemented from ensureStream. Previously the first call failed
   instead of degrading.

5. Filer rotation in openStream: try all filer addresses on dial failure,
   matching WithFilerClient behavior. Stop early on Unimplemented.

6. Pass metadata-bearing context to StreamMutateEntry RPC call so
   sw-client-id header is actually sent.

7. Gate lookupEntry local-cache fallback on open dirty handle or pending
   async flush to avoid resurrecting deleted/renamed entries.

8. Remove dead code in flushFileMetadata (err=nil followed by if err!=nil).

9. Use string matching for rename error-to-errno mapping in the mount to
   stay portable across Linux/macOS (numeric errno values differ).

* mount: make failAllPending idempotent with delete-before-close

Change failAllPending to collect pending entries into a local slice
(deleting from the sync.Map first) before closing channels. This
prevents double-close panics if called concurrently. Also remove the
unused err parameter.

* mount: add stream generation tracking and teardownStream

Introduce a generation counter on streamMutateMux that increments each
time a new stream is created. Requests carry the generation they were
enqueued for so sendLoop can reject stale requests after reconnect.

Add teardownStream(gen) which is idempotent (only acts when gen matches
current generation and stream is non-nil). Both sendLoop and recvLoop
call it on error, replacing the inline cleanup in recvLoop. sendLoop now
actively triggers teardown on send errors instead of silently exiting.

ensureStream waits for the prior generation's recvDone before creating a
new stream, ensuring all old pending waiters are failed before reconnect.
recvLoop now takes the stream, generation, and recvDone channel as
parameters to avoid accessing shared fields without the lock.

* mount: harden Close to prevent races with teardownStream

Nil out stream, cancel, and grpcConn under the lock so that any
concurrent teardownStream call from recvLoop/sendLoop becomes a no-op.
Call failAllPending before closing sendCh to unblock waiters promptly.
Guard recvDone with a nil check for the case where Close is called
before any stream was ever opened.

* mount: make errCh receive ctx-aware in doUnary and Rename

Replace the blocking <-sendReq.errCh with a select that also observes
ctx.Done(). If sendLoop exits via stopSend without consuming a buffered
request, the caller now returns ctx.Err() instead of blocking forever.
The buffered errCh (capacity 1) ensures late acknowledgements from
sendLoop don't block the sender.

* mount: fix sendLoop/Close race and recvLoop/teardown pending channel race

Three related fixes:

1. Stop closing sendCh in Close(). Closing the shared producer channel
   races with callers who passed ensureStream() but haven't sent yet,
   causing send-on-closed-channel panics. sendCh is now left open;
   ensureStream checks m.closed to reject new callers.

2. Drain buffered sendCh items on shutdown. sendLoop defers drainSendCh()
   on exit so buffered requests get an ErrStreamTransport on their errCh
   instead of blocking forever. Close() drains again for any stragglers
   enqueued between sendLoop's drain and the final shutdown.

3. Move failAllPending from teardownStream into recvLoop's defer.
   teardownStream (called from sendLoop on send error) was closing
   pending response channels while recvLoop could be between
   pending.Load and the channel send — a send-on-closed-channel panic.
   recvLoop is now the sole closer of pending channels, eliminating
   the race. Close() waits on recvDone (with cancel() to guarantee
   Recv unblocks) so pending cleanup always completes.

* filer/mount: add debug logging for hardlink lifecycle

Add V(0) logging at every point where a HardLinkId is created, stored,
read, or deleted to trace orphaned hardlink references. Logging covers:

- gRPC server: CreateEntry/UpdateEntry when request carries HardLinkId
- FilerStoreWrapper: InsertEntry/UpdateEntry when entry has HardLinkId
- handleUpdateToHardLinks: entry path, HardLinkId, counter, chunk count
- setHardLink: KvPut with blob size
- maybeReadHardLink: V(1) on read attempt and successful decode
- DeleteHardLink: counter decrement/deletion events
- Mount Link(): when NewHardLinkId is generated and link is created

This helps diagnose how a git pack .rev file ended up with a
HardLinkId during a clone (no hard links should be involved).

* test: add git clone/pull integration test for FUSE mount

Shell script that exercises git operations on a SeaweedFS mount:
1. Creates a bare repo on the mount
2. Clones locally, makes 3 commits, pushes to mount
3. Clones from mount bare repo into an on-mount working dir
4. Verifies clone integrity (files, content, commit hashes)
5. Pushes 2 more commits with renames and deletes
6. Checks out an older revision on the mount clone
7. Returns to branch and pulls with real changes
8. Verifies file content, renames, deletes after pull
9. Checks git log integrity and clean status

27 assertions covering file existence, content, commit hashes,
file counts, renames, deletes, and git status. Run against any
existing mount: bash test-git-on-mount.sh /path/to/mount

* test: add git clone/pull FUSE integration test to CI suite

Add TestGitOperations to the existing fuse_integration test framework.
The test exercises git's full file operation surface on the mount:

  1. Creates a bare repo on the mount (acts as remote)
  2. Clones locally, makes 3 commits (files, bulk data, renames), pushes
  3. Clones from mount bare repo into an on-mount working dir
  4. Verifies clone integrity (content, commit hash, file count)
  5. Pushes 2 more commits with new files, renames, and deletes
  6. Checks out an older revision on the mount clone
  7. Returns to branch and pulls with real fast-forward changes
  8. Verifies post-pull state: content, renames, deletes, file counts
  9. Checks git log integrity (5 commits) and clean status

Runs automatically in the existing fuse-integration.yml CI workflow.

* mount: fix permission check with uid/gid mapping

The permission checks in createRegularFile() and Access() compared
the caller's local uid/gid against the entry's filer-side uid/gid
without applying the uid/gid mapper. With -map.uid 501:0, a directory
created as uid 0 on the filer would not match the local caller uid
501, causing hasAccess() to fall through to "other" permission bits
and reject write access (0755 → other has r-x, no w).

Fix: map entry uid/gid from filer-space to local-space before the
hasAccess() call so both sides are in the same namespace.

This fixes rsync -a failing with "Permission denied" on mkstempat
when using uid/gid mapping.

* mount: fix Mkdir/Symlink returning filer-side uid/gid to kernel

Mkdir and Symlink used `defer wfs.mapPbIdFromFilerToLocal(entry)` to
restore local uid/gid, but `outputPbEntry` writes the kernel response
before the function returns — so the kernel received filer-side
uid/gid (e.g., 0:0). macFUSE then caches these and rejects subsequent
child operations (mkdir, create) because the caller uid (501) doesn't
match the directory owner (0), and "other" bits (0755 → r-x) lack
write permission.

Fix: replace the defer with an explicit call to mapPbIdFromFilerToLocal
before outputPbEntry, so the kernel gets local uid/gid.

Also add nil guards for UidGidMapper in Access and createRegularFile
to prevent panics in tests that don't configure a mapper.

This fixes rsync -a "Permission denied" on mkpathat for nested
directories when using uid/gid mapping.

* mount: fix Link outputting filer-side uid/gid to kernel, add nil guards

Link had the same defer-before-outputPbEntry bug as Mkdir and Symlink:
the kernel received filer-side uid/gid because the defer hadn't run
yet when outputPbEntry wrote the response.

Also add nil guards for UidGidMapper in Access and createRegularFile
so tests without a mapper don't panic.

Audit of all outputPbEntry/outputFilerEntry call sites:
- Mkdir: fixed in prior commit (explicit map before output)
- Symlink: fixed in prior commit (explicit map before output)
- Link: fixed here (explicit map before output)
- Create (existing file): entry from maybeLoadEntry (already mapped)
- Create (deferred): entry has local uid/gid (never mapped to filer)
- Create (non-deferred): createRegularFile defer runs before return
- Mknod: createRegularFile defer runs before return
- Lookup: entry from lookupEntry (already mapped)
- GetAttr: entry from maybeReadEntry/maybeLoadEntry (already mapped)
- readdir: entry from cache (mapIdFromFilerToLocal) or filer (mapped)
- saveEntry: no kernel output
- flushMetadataToFiler: no kernel output
- flushFileMetadata: no kernel output

* test: fix git test for same-filesystem FUSE clone

When both the bare repo and working clone live on the same FUSE mount,
git's local transport uses hardlinks and cross-repo stat calls that
fail on FUSE. Fix:

- Use --no-local on clone to disable local transport optimizations
- Use reset --hard instead of checkout to stay on branch
- Use fetch + reset --hard origin/<branch> instead of git pull to
  avoid local transport stat failures during fetch

* adjust logging

* test: use plain git clone/pull to exercise real FUSE behavior

Remove --no-local and fetch+reset workarounds. The test should use
the same git commands users run (clone, reset --hard, pull) so it
reveals real FUSE issues rather than hiding them.

* test: enable V(1) logging for filer/mount and collect logs on failure

- Run filer and mount with -v=1 so hardlink lifecycle logs (V(0):
  create/delete/insert, V(1): read attempts) are captured
- On test failure, automatically dump last 16KB of all process logs
  (master, volume, filer, mount) to test output
- Copy process logs to /tmp/seaweedfs-fuse-logs/ for CI artifact upload
- Update CI workflow to upload SeaweedFS process logs alongside test output

* mount: clone entry for filer flush to prevent uid/gid race

flushMetadataToFiler and flushFileMetadata used entry.GetEntry() which
returns the file handle's live proto entry pointer, then mutated it
in-place via mapPbIdFromLocalToFiler. During the gRPC call window, a
concurrent Lookup (which takes entryLock.RLock but NOT fhLockTable)
could observe filer-side uid/gid (e.g., 0:0) on the file handle entry
and return it to the kernel. The kernel caches these attributes, so
subsequent opens by the local user (uid 501) fail with EACCES.

Fix: proto.Clone the entry before mapping uid/gid for the filer request.
The file handle's live entry is never mutated, so concurrent Lookup
always sees local uid/gid.

This fixes the intermittent "Permission denied" on .git/FETCH_HEAD
after the first git pull on a mount with uid/gid mapping.

* mount: add debug logging for stale lock file investigation

Add V(0) logging to trace the HEAD.lock recreation issue:
- Create: log when O_EXCL fails (file already exists) with uid/gid/mode
- completeAsyncFlush: log resolved path, saved path, dirtyMetadata,
  isDeleted at entry to trace whether async flush fires after rename
- flushMetadataToFiler: log the dir/name/fullpath being flushed

This will show whether the async flush is recreating the lock file
after git renames HEAD.lock → HEAD.

* mount: prevent async flush from recreating renamed .lock files

When git renames HEAD.lock → HEAD, the async flush from the prior
close() can run AFTER the rename and re-insert HEAD.lock into the
meta cache via its CreateEntryRequest response event. The next git
pull then sees HEAD.lock and fails with "File exists".

Fix: add isRenamed flag on FileHandle, set by Rename before waiting
for the pending async flush. The async flush checks this flag and
skips the metadata flush for renamed files (same pattern as isDeleted
for unlinked files). The data pages still flush normally.

The Rename handler flushes deferred metadata synchronously (Case 1)
before setting isRenamed, ensuring the entry exists on the filer for
the rename to proceed. For already-released handles (Case 2), the
entry was created by a prior flush.

* mount: also mark renamed inodes via entry.Attributes.Inode fallback

When GetInode fails (Forget already removed the inode mapping), the
Rename handler couldn't find the pending async flush to set isRenamed.
The async flush then recreated the .lock file on the filer.

Fix: fall back to oldEntry.Attributes.Inode to find the pending async
flush when the inode-to-path mapping is gone. Also extract
MarkInodeRenamed into a method on FileHandleToInode for clarity.

* mount: skip async metadata flush when saved path no longer maps to inode

The isRenamed flag approach failed for refs/remotes/origin/HEAD.lock
because neither GetInode nor oldEntry.Attributes.Inode could find the
inode (Forget already evicted the mapping, and the entry's stored
inode was 0).

Add a direct check in completeAsyncFlush: before flushing metadata,
verify that the saved path still maps to this inode in the inode-to-path
table. If the path was renamed or removed (inode mismatch or not found),
skip the metadata flush to avoid recreating a stale entry.

This catches all rename cases regardless of whether the Rename handler
could set the isRenamed flag.

* mount: wait for pending async flush in Unlink before filer delete

Unlink was deleting the filer entry first, then marking the draining
async-flush handle as deleted. The async flush worker could race between
these two operations and recreate the just-unlinked entry on the filer.
This caused git's .lock files (e.g. refs/remotes/origin/HEAD.lock) to
persist after git pull, breaking subsequent git operations.

Move the isDeleted marking and add waitForPendingAsyncFlush() before
the filer delete so any in-flight flush completes first. Even if the
worker raced past the isDeleted check, the wait ensures it finishes
before the filer delete cleans up any recreated entry.

* mount: reduce async flush and metadata flush log verbosity

Raise completeAsyncFlush entry log, saved-path-mismatch skip log, and
flushMetadataToFiler entry log from V(0) to V(3)/V(4). These fire for
every file close with writebackCache and are too noisy for normal use.

* filer: reduce hardlink debug log verbosity from V(0) to V(4)

HardLinkId logs in filerstore_wrapper, filerstore_hardlink, and
filer_grpc_server fire on every hardlinked file operation (git pack
files use hardlinks extensively) and produce excessive noise.

* mount/filer: reduce noisy V(0) logs for link, rmdir, and empty folder check

- weedfs_link.go: hardlink creation logs V(0) → V(4)
- weedfs_dir_mkrm.go: non-empty folder rmdir error V(0) → V(1)
- empty_folder_cleaner.go: "not empty" check log V(0) → V(4)

* filer: handle missing hardlink KV as expected, not error

A "kv: not found" on hardlink read is normal when the link blob was
already cleaned up but a stale entry still references it. Log at V(1)
for not-found; keep Error level for actual KV failures.

* test: add waitForDir before git pull in FUSE git operations test

After git reset --hard, the FUSE mount's metadata cache may need a
moment to settle on slow CI. The git pull subprocess (unpack-objects)
could fail to stat the working directory. Poll for up to 5s.

* Update git_operations_test.go

* wait

* test: simplify FUSE test framework to use weed mini

Replace the 4-process setup (master + volume + filer + mount) with
2 processes: "weed mini" (all-in-one) + "weed mount". This simplifies
startup, reduces port allocation, and is faster on CI.

* test: fix mini flag -admin → -admin.ui
2026-03-25 20:06:34 -07:00
Chris Lu
29bdbb3c48 filer.sync: replace O(n) conflict check with O(depth) index lookups (#8772)
* filer.sync: replace O(n) conflict check with O(depth) index lookups

The MetadataProcessor.conflictsWith() scanned all active jobs linearly
for every new event dispatch. At high concurrency (256-1024), this O(n)
scan under the activeJobsLock became a bottleneck that throttled the
event dispatch pipeline, negating the benefit of higher -concurrency
values.

Replace the linear scan with three index maps:
- activeFilePaths: O(1) exact file path lookup
- activeDirPaths: O(1) directory path lookup per ancestor
- descendantCount: O(1) check for active jobs under a directory

Conflict check is now O(depth) where depth is the path depth (typically
3-6 levels), constant regardless of active job count. Benchmark confirms
~81ns per check whether there are 32 or 1024 active jobs.

Also replace the O(n) watermark scan with minActiveTs tracking so
non-oldest job completions are O(1).

Ref: #8771

* filer.sync: replace O(n) watermark rescan with min-heap lazy deletion

Address review feedback:
- Replace minActiveTs O(n) rescan with a tsMinHeap using lazy deletion.
  Each TsNs is pushed once and popped once, giving O(log n) amortized
  watermark tracking regardless of completion order.
- Fix benchmark to consume conflictsWith result via package-level sink
  variable to prevent compiler elision.

The watermark advancement semantics (conservative, sets to completing
job's TsNs) are unchanged from the original code. This is intentionally
safe for idempotent replay on restart.
2026-03-25 15:43:25 -07:00
Andreas Røste
79f4a4579f feat(k8s): added possibility to specify service.type for multiple ser… (#8372)
* feat(k8s): added possibility to specify service.type for multiple services in helm chart

* fix(k8s): removed headless (clusterIP: None) from services

* fix(k8s): keep master and filer services headless for StatefulSet compatibility

Master and filer services must remain headless (clusterIP: None) because
their StatefulSets reference them via serviceName for stable pod DNS.
Revert the service.type change for these two services and remove their
unused service config from values.yaml. S3 and SFTP remain configurable.

---------

Co-authored-by: Andreas Røste <andreas2101@gmail.com>
Co-authored-by: Chris Lu <chrislusf@users.noreply.github.com>
Co-authored-by: Chris Lu <chris.lu@gmail.com>
2026-03-25 11:30:14 -07:00
Chris Lu
e47054a7e7 mount: improve small file write performance (#8769)
* mount: defer file creation gRPC to flush time for faster small file writes

When creating a file via FUSE Create(), skip the synchronous gRPC
CreateEntry call to the filer. Instead, allocate the inode and build
the entry locally, deferring the filer create to the Flush/Release path
where flushMetadataToFiler already sends a CreateEntry with chunk data.

This eliminates one synchronous gRPC round-trip per file during creation.
For workloads with many small files (e.g. 30K files), this reduces the
per-file overhead from ~2 gRPC calls to ~1.

Mknod retains synchronous filer creation since it has no file handle
and thus no flush path.

* mount: use bounded worker pool for async flush operations

Replace unbounded goroutine spawning in writebackCache async flush
with a fixed-size worker pool backed by a channel. When many files
are closed rapidly (e.g., cp -r of 30K files), the previous approach
spawned one goroutine per file, leading to resource contention on
gRPC/HTTP connections and high goroutine overhead.

The worker pool size matches ConcurrentWriters (default 128), which
provides good parallelism while bounding resource usage. Work items
are queued into a buffered channel and processed by persistent worker
goroutines.

* mount: fix deferred create cache visibility and async flush race

Three fixes for the deferred create and async flush changes:

1. Insert a local placeholder entry into the metadata cache during
   deferred file creation so that maybeLoadEntry() can find the file
   for duplicate-create checks, stat, and readdir. Uses InsertEntry
   directly (not applyLocalMetadataEvent) to avoid triggering the
   directory hot-threshold eviction that would wipe the entry.

2. Fix race in ReleaseHandle where asyncFlushWg.Add(1) and the
   channel send happened after pendingAsyncFlushMu was unlocked.
   A concurrent WaitForAsyncFlush could observe a zero counter,
   close the channel, and cause a send-on-closed panic. Move Add(1)
   before the unlock; keep the send after unlock to avoid deadlock
   with workers that acquire the same mutex during cleanup.

3. Update TestCreateCreatesAndOpensFile to flush the file handle
   before verifying the CreateEntry gRPC call, since file creation
   is now deferred to flush time.
2026-03-24 20:31:53 -07:00
Chris Lu
28fe92065a S3: reject part uploads after AbortMultipartUpload (#8768)
* S3: reject part uploads after AbortMultipartUpload

PutObjectPartHandler did not verify that the multipart upload session
still exists before accepting parts. After AbortMultipartUpload deleted
the upload directory, the ErrNotFound from getEntry was silently ignored
(treated as "may be non-SSE upload"), allowing parts to be stored as
orphaned files.

Now return ErrNoSuchUpload when the upload directory is not found,
matching AWS S3 behavior.

Fixes #8766

* S3: check upload existence unconditionally in PutObjectPartHandler

Move the getEntry call out of the SSE-type conditional so the upload
existence check runs for all part uploads, including SSE-C. Previously
the SSE-C path skipped the check entirely, allowing parts to be uploaded
after abort when SSE-C headers were present.

Also flattens the nested SSE branching by one level now that getEntry
is called once upfront.

* S3: address PR review feedback for PutObjectPartHandler

- Log at error level when getEntry fails with an unexpected error,
  since we return ErrInternalError to the client
- Distinguish base IV decode errors from length validation failures
  with separate, clearer error messages

---------

Co-authored-by: Copilot <copilot@github.com>
2026-03-24 18:11:51 -07:00
Chris Lu
0b3867dca3 filer: add structured error codes to CreateEntryResponse (#8767)
* filer: add FilerError enum and error_code field to CreateEntryResponse

Add a machine-readable error code alongside the existing string error
field. This follows the precedent set by PublishMessageResponse in the
MQ broker proto. The string field is kept for human readability and
backward compatibility.

Defined codes: OK, ENTRY_NAME_TOO_LONG, PARENT_IS_FILE,
EXISTING_IS_DIRECTORY, EXISTING_IS_FILE, ENTRY_ALREADY_EXISTS.

* filer: add sentinel errors and error code mapping in filer_pb

Define sentinel errors (ErrEntryNameTooLong, ErrParentIsFile, etc.) in
the filer_pb package so both the filer and consumers can reference them
without circular imports.

Add FilerErrorToSentinel() to map proto error codes to sentinels, and
update CreateEntryWithResponse() to check error_code first, falling back
to the string-based path for backward compatibility with old servers.

* filer: return wrapped sentinel errors and set proto error codes

Replace fmt.Errorf string errors in filer.CreateEntry, UpdateEntry, and
ensureParentDirectoryEntry with wrapped filer_pb sentinel errors (using
%w). This preserves errors.Is() traversal on the server side.

In the gRPC CreateEntry handler, map sentinel errors to the
corresponding FilerError proto codes using errors.Is(), setting both
resp.Error (string, for backward compat) and resp.ErrorCode (enum).

* S3: use errors.Is() with filer sentinels instead of string matching

Replace fragile string-based error matching in filerErrorToS3Error and
other S3 API consumers with errors.Is() checks against filer_pb sentinel
errors. This works because the updated CreateEntryWithResponse helper
reconstructs sentinel errors from the proto FilerError code.

Update iceberg stage_create and metadata_files to check resp.ErrorCode
instead of parsing resp.Error strings. Update SSE-S3 to use errors.Is()
for the already-exists check.

String matching is retained only for non-filer errors (gRPC transport
errors, checksum validation) that don't go through CreateEntryResponse.

* filer: remove backward-compat string fallbacks for error codes

Clients and servers are always deployed together, so there is no need
for backward-compatibility fallback paths that parse resp.Error strings
when resp.ErrorCode is unset. Simplify all consumers to rely solely on
the structured error code.

* iceberg: ensure unknown non-OK error codes are not silently ignored

When FilerErrorToSentinel returns nil for an unrecognized error code,
return an error including the code and message rather than falling
through to return nil.

* filer: fix redundant error message and restore error wrapping in helper

Use request path instead of resp.Error in the sentinel error format
string to avoid duplicating the sentinel message (e.g. "entry already
exists: entry already exists"). Restore %w wrapping with errors.New()
in the fallback paths so callers can use errors.Is()/errors.As().

* filer: promote file to directory on path conflict instead of erroring

S3 allows both "foo/bar" (object) and "foo/bar/xyzzy" (another object)
to coexist because S3 has a flat key space. When ensureParentDirectoryEntry
finds a parent path that is a file instead of a directory, promote it to
a directory by setting ModeDir while preserving the original content and
chunks. Use Store.UpdateEntry directly to bypass the Filer.UpdateEntry
type-change guard.

This fixes the S3 compatibility test failures where creating overlapping
keys (e.g. "foo/bar" then "foo/bar/xyzzy") returned ExistingObjectIsFile.
2026-03-24 17:08:22 -07:00
Chris Lu
152884eff2 S3: add s3: prefix to x-amz-* condition keys for AWS compatibility (#8765)
AWS S3 policy conditions reference request headers with the s3: namespace
prefix (e.g., s3:x-amz-server-side-encryption). The extraction code was
storing these headers without the prefix, so bucket policy conditions
using the standard AWS key names would never match.
2026-03-24 14:04:42 -07:00
Chris Lu
2877febd73 S3: fix silent PutObject failure and enforce 1024-byte key limit (#8764)
* S3: add KeyTooLongError error code

Add ErrKeyTooLongError (HTTP 400, code "KeyTooLongError") to match the
standard AWS S3 error for object keys that exceed length limits.

* S3: fix silent PutObject failure when entry name exceeds max_file_name_length

putToFiler called client.CreateEntry() directly and discarded the gRPC
response. The filer embeds application errors like "entry name too long"
in resp.Error (not as gRPC transport errors), so the error was silently
swallowed and clients received HTTP 200 with an ETag for objects that
were never stored.

Switch to the filer_pb.CreateEntry() helper which properly checks
resp.Error, and map "entry name too long" to KeyTooLongError (HTTP 400).

To avoid fragile string parsing across the gRPC boundary, define shared
error message constants in weed/util/constants and use them in both the
filer (producing errors) and S3 API (matching errors). Switch
filerErrorToS3Error to use strings.Contains/HasSuffix with these
constants so matches work regardless of any wrapper prefix. Apply
filerErrorToS3Error to the mkdir path for directory markers.

Fixes #8759

* S3: enforce 1024-byte maximum object key length

AWS S3 limits object keys to 1024 bytes. Add early validation on write
paths (PutObject, CopyObject, CreateMultipartUpload) to reject keys
exceeding the limit with the standard KeyTooLongError (HTTP 400).

The key length check runs before bucket auto-creation to prevent
overlong keys from triggering unnecessary side effects.

Also use filerErrorToS3Error for CopyObject's mkFile error paths so
name-too-long errors from the filer return KeyTooLongError instead of
InternalError.

Ref #8758

* S3: add handler-level tests for key length validation and error mapping

Add tests for filerErrorToS3Error mapping "entry name too long" to
KeyTooLongError, including a regression test for the CreateEntry-prefixed
"existing ... is a directory" form. Add handler-level integration tests
that exercise PutObjectHandler, CopyObjectHandler, and
NewMultipartUploadHandler via httptest, verifying HTTP 400 and
KeyTooLongError XML response for overlong keys and acceptance of keys at
the 1024-byte limit.
2026-03-24 13:35:28 -07:00
Chris Lu
7f3f61ea28 fix: resolve Kafka gateway response deadlocks causing Sarama client hangs (#8762)
* fix: resolve Kafka gateway response deadlocks causing Sarama client hangs

Fix three bugs in the Kafka protocol handler that caused sequential
clients (notably Sarama) to hang during E2E tests:

1. Race condition in correlation queue ordering: the correlation ID was
   added to the response ordering queue AFTER sending the request to
   the processing channel. A fast processor (e.g. ApiVersions) could
   finish and send its response before the ID was in the queue, causing
   the response writer to miss it — permanently deadlocking the
   connection. Now the ID is added BEFORE the channel send, with error
   response injection on send failure.

2. Silent error response drops: when processRequestSync returned an
   error, the response writer logged it but never sent anything back to
   the client. The client would block forever waiting for bytes that
   never arrived. Now sends a Kafka UNKNOWN_SERVER_ERROR response.

3. Produce V0/V1 missing timeout_ms parsing: the handler skipped the
   4-byte timeout field, reading it as topicsCount instead. This caused
   incorrect parsing of the entire produce request for V0/V1 clients.

* fix: API-versioned error responses, unsupported-version queue fix, V0V1 header alignment

1. errors.go — BuildAPIErrorResponse: emits a minimal-but-valid error
   body whose layout matches the schema the client expects for each API
   key and version (throttle_time position, array fields, etc.).  The
   old 2-byte generic body corrupted the protocol stream for APIs whose
   response begins with throttle_time_ms or an array.

2. handler.go — unsupported-version path: the correlationID was never
   added to correlationQueue before sending to responseChan, so the
   response writer could never match it and the client hung.  Now
   appends the ID under correlationQueueMu before the send.

3. produce.go — handleProduceV0V1: requestBody is already post-header
   (HandleConn strips client_id).  The handler was erroneously parsing
   acks bytes as a client_id length, misaligning all subsequent field
   reads.  Removed the client_id parsing; offset now starts at 0 with
   acks(2) + timeout_ms(4) + topicsCount(4), matching handleProduceV2Plus.

* fix: free pooled message buffer per-iteration instead of deferring

The read loop allocated messageBuf via mem.Allocate and deferred
mem.Free.  Since the defer only runs when HandleConn returns, pool
buffers accumulated for the entire connection lifetime — one per
request.  Worse, the deferred frees ran in LIFO order before
wg.Wait(), so processing goroutines could read from already-freed
pool buffers.

Now: read into a pooled buffer, immediately copy to Go-managed
memory, and return the pool buffer.  messageBuf is a regular slice
safe for async goroutine access with no defer accumulation.

* fix: cancel context before wg.Wait and on worker response-send timeout

Two related issues:

1. Cleanup defer ordering deadlock: defers run LIFO — the cleanup defer
   (close channels, wg.Wait) ran before the cancel() defer.  The
   response writer is in the WaitGroup and exits only on ctx.Done() or
   responseChan close, but both signals came after wg.Wait().  Deadlock
   on every normal connection close (EOF, read error, queue-full).
   Fix: call cancel() at the start of the cleanup defer, before
   wg.Wait().

2. Worker 5s response-send timeout: when the timeout fired, the
   response was silently dropped but the correlationID remained in the
   ordered queue.  The response writer could never advance past it,
   stalling all subsequent responses permanently.
   Fix: call cancel() to tear down the connection — if we cannot
   deliver a response in 5s the connection is irrecoverable.

* chore: remove empty no-op ListOffsets conditional

The `if apiKey == 2 {}` block had no body — leftover debug code.
ListOffsets routing is handled by isDataPlaneAPI (returns false,
sending it to the control channel). No behavior change.
2026-03-24 13:17:25 -07:00