Commit Graph

8590 Commits

Author SHA1 Message Date
Anton
90277ceed5 admin/plugin: migrate inline job details asynchronously to avoid slow startup (#8721)
loadPersistedMonitorState performed a backward-compatibility migration that
wrote every job with inline rich detail fields to a dedicated per-job detail
file synchronously during startup. On deployments with many historical jobs
(e.g. 1000+) stored on distributed block storage (e.g. Longhorn), each
individual file write requires an fsync round-trip, making startup
disproportionately slow and causing readiness/liveness probe failures.

The in-memory state is populated correctly before the goroutine is started
because stripTrackedJobDetailFields is still called in-place; only the disk
writes are deferred. A completion log message at V(1) is emitted once the
background migration finishes.

Co-authored-by: Anton Ustyugov <anton@devops>
2026-03-21 14:18:42 -07:00
Anton
ae170f1fbb admin: fix manual job run to use scheduler dispatch with capacity management and retry (#8720)
RunPluginJobTypeAPI previously executed proposals with a naive sequential loop
calling ExecutePluginJob per proposal. This had two bugs:

1. Double-lock: RunPluginJobTypeAPI held pluginLock while calling ExecutePluginJob,
   which tried to re-acquire the same lock for every job in the loop.

2. No capacity management: proposals were fired directly at workers without
   reserveScheduledExecutor, so every job beyond the worker concurrency limit
   received an immediate at_capacity error with no retry or backoff.

Fix: add Plugin.DispatchProposals which reuses dispatchScheduledProposals - the
same code path the scheduler loop uses - with executor reservation, configurable
concurrency, and per-job retry with backoff. RunPluginJobTypeAPI now calls
DispatchPluginProposals (a thin AdminServer wrapper) after holding pluginLock once.

Co-authored-by: Anton Ustyugov <anton@devops>
2026-03-21 14:09:31 -07:00
Anton
8e7b15a995 wdclient/exclusive_locks: replace println with glog in ExclusiveLocker (#8723)
RequestLock used a bare println to report transient lock acquisition failures
('lock: already locked by ...'), which writes directly to stdout instead of
going through the structured logging pipeline. This causes log noise at the
wrong level and cannot be filtered with -v or redirected like glog output.

Changes:
- println("lock:", ...) -> glog.V(2).Infof for per-retry acquisition errors
  (transient, high-frequency during startup when another instance still holds)
- Add glog.V(1).Infof when the lock is successfully acquired
- Add glog.V(2).Infof for successful renewals (replaces commented-out println)
- Errorf -> Warningf for renewal failures (the goroutine exits cleanly, it is
  not a fatal error; the caller will re-acquire via RequestLock)

Co-authored-by: Anton Ustyugov <anton@devops>
2026-03-21 13:36:26 -07:00
Chris Lu
7fbdb9b7b7 feat(shell): add volume.tier.compact command to reclaim cloud storage space (#8715)
* feat(shell): add volume.tier.compact command to reclaim cloud storage space

Adds a new shell command that automates compaction of cloud tier volumes.
When files are deleted from remote-tiered volumes, space is not reclaimed
on the cloud storage. This command orchestrates: download from remote,
compact locally, and re-upload to reclaim deleted space.

Closes #8563

* fix: log cleanup errors in compactVolumeOnServer instead of discarding them

Helps operators diagnose leftover temp files (.cpd/.cpx) if cleanup
fails after a compaction or commit failure.

* fix: return aggregate error from loop and use regex for collection filter

- Track and return error count when one or more volumes fail to compact,
  so callers see partial failures instead of always getting nil.
- Use compileCollectionPattern for -collection in -volumeId mode too, so
  regex patterns work consistently with the flag description. Empty
  pattern (no -collection given) matches all collections.
2026-03-20 23:52:12 -07:00
JARDEL ALVES
1413822424 glog: add JSON structured logging mode (#8708)
* glog: add JSON structured logging mode

Add opt-in JSON output format for glog, enabling integration with
log aggregation systems like ELK, Loki, and Datadog.

- Add --log_json flag to enable JSON output at startup
- Add SetJSONMode()/IsJSONMode() for runtime toggle
- Add JSON branches in println, printDepth, printf, printWithFileLine
- Use manual JSON construction (no encoding/json) for performance
- Add jsonEscapeString() for safe string escaping
- Include 8 unit tests and 1 benchmark

Enabled via --log_json flag. Default behavior unchanged.

* glog: prevent lazy flag init from overriding SetJSONMode

If --log_json=true and SetJSONMode(false) was called at runtime,
a subsequent IsJSONMode() call would re-enable JSON mode via the
sync.Once lazy initialization. Mark jsonFlagOnce as done inside
SetJSONMode so the runtime API always takes precedence.

* glog: fix RuneError check to not misclassify valid U+FFFD

The condition r == utf8.RuneError matches both invalid UTF-8
sequences (size=1) and a valid U+FFFD replacement character
(size=3). Without checking size == 1, a valid U+FFFD input
would be incorrectly escaped and only advance by 1 byte,
corrupting the output.

---------

Co-authored-by: Chris Lu <chris.lu@gmail.com>
2026-03-19 22:01:09 -07:00
JARDEL ALVES
5f2244d25d glog: add gzip compression for rotated log files (#8709)
* glog: add gzip compression for rotated log files

Add opt-in gzip compression that automatically compresses log files
after rotation, reducing disk usage in long-running deployments.

- Add --log_compress flag to enable compression at startup
- Add SetCompressRotated()/IsCompressRotated() for runtime toggle
- Compress rotated files in background goroutine (non-blocking)
- Use gzip.BestSpeed for minimal CPU overhead
- Fix .gz file cleanup: TrimSuffix approach correctly counts
  compressed files toward MaxFileCount limit
- Include 6 unit tests covering normal, empty, large, and edge cases

Enabled via --log_compress flag. Default behavior unchanged.

* glog: fix compressFile to check gz/dst close errors and use atomic rename

Write to a temp file (.gz.tmp) and rename atomically to prevent
exposing partial archives. Check gz.Close() and dst.Close() errors
to avoid deleting the original log when flush fails (e.g. ENOSPC).
Use defer for robust resource cleanup.

* glog: deduplicate .log/.log.gz pairs in rotation cleanup

During concurrent compression, both foo.log and foo.log.gz can exist
simultaneously. Count them as one entry against MaxFileCount to prevent
premature eviction of rotated logs.

* glog: use portable temp path in TestCompressFile_NonExistent

Replace hardcoded /nonexistent/path with t.TempDir() for portability.

---------

Co-authored-by: Copilot <copilot@github.com>
2026-03-19 21:35:08 -07:00
Chris Lu
51ec0d2122 fix(remote_gateway): prevent double-versioning when syncing to versioned central bucket (#8710)
* fix(remote_gateway): prevent double-versioning when syncing to versioned central bucket

When a file is uploaded to a versioned bucket on edge, SeaweedFS stores
it internally as {object}.versions/v_{versionId}. The remote_gateway was
syncing this internal path directly to the central S3 endpoint. When
central's bucket also has versioning enabled, this caused central to
apply its own versioning on top, producing corrupt paths like:
  object.versions/v_{edgeId}.versions/v_{centralId}

Fix: rewrite internal .versions/v_{id} paths to the original S3 object
key before uploading to the remote. Skip version file delete/update
events that are internal bookkeeping.

Fixes https://github.com/seaweedfs/seaweedfs/discussions/8481#discussioncomment-16209342

* fix(remote_gateway): propagate delete markers to remote as deletions

Delete markers are zero-content version entries (ExtDeleteMarkerKey=true)
created by S3 DELETE on a versioned bucket. Previously they were silently
dropped by the HasData() filter, so deletions on edge never reached
central.

Now: detect delete markers before the HasData check, rewrite the
.versions path to the original S3 key, and issue client.DeleteFile()
on the remote.

* fix(remote_gateway): tighten isVersionedPath to avoid false positives

Address PR review feedback:

- Add isDir parameter to isVersionedPath so it only matches the exact
  internal shapes: directories whose name ends with .versions (isDir=true),
  and files with the v_ prefix inside a .versions parent (isDir=false).
  Previously the function was too broad and could match user-created paths
  like "my.versions/data.txt".

- Update all 4 call sites to pass the entry's IsDirectory field.

- Rename TestVersionedDirectoryNotFilteredByHasData to
  TestVersionsDirectoryFilteredByHasData so the name reflects the
  actual assertion (directories ARE filtered by HasData).

- Expand TestIsVersionedPath with isDir cases and false-positive checks.

* fix(remote_gateway): persist sync marker after delete-marker propagation

The delete-marker branch was calling client.DeleteFile() and returning
without updating the local entry, making event replay re-issue the
remote delete. Now call updateLocalEntry after a successful DeleteFile
to stamp the delete-marker entry with a RemoteEntry, matching the
pattern used by the normal create path.

* refactor(remote_gateway): extract syncDeleteMarker and fix root path edge case

- Extract syncDeleteMarker() shared helper used by both bucketed and
  mounted-dir event processors, replacing the duplicated delete + persist
  local marker logic.

- Fix rewriteVersionedSourcePath for root-level objects: when lastSlash
  is 0 (e.g. "/file.xml.versions"), return "/" as the parent dir instead
  of an empty string.

- The strings.Contains(dir, ".versions/") condition flagged in review was
  already removed in a prior commit that tightened isVersionedPath.

* fix(remote_gateway): skip updateLocalEntry for versioned path rewrites

After rewriting a .versions/v_{id} path to the logical S3 key and
uploading, the code was calling updateLocalEntry on the original v_*
entry, stamping it with a RemoteEntry for the logical key. This is
semantically wrong: the logical object has no filer entry in versioned
buckets, and the internal v_* entry should not carry a RemoteEntry for
a different path.

Skip updateLocalEntry when the path was rewritten from a versioned
source. Replay safety is preserved because S3 PutObject is idempotent.

* fix(remote_gateway): scope versioning checks to /buckets/ namespace

isVersionedPath and rewriteVersionedSourcePath could wrongly match
paths in non-bucket mounts (e.g. /mnt/remote/file.xml.versions).

Add the same /buckets/ prefix guard used by isMultipartUploadDir so
the .versions / v_ logic only applies within the bucket namespace.
2026-03-19 21:18:52 -07:00
Chris Lu
6ccda3e809 fix(s3): allow deleting the anonymous user from admin webui (#8706)
Remove the block that prevented deleting the "anonymous" identity
and stop auto-creating it when absent.  If no anonymous identity
exists (or it is disabled), LookupAnonymous returns not-found and
both auth paths return ErrAccessDenied for anonymous requests.

To enable anonymous access, explicitly create the "anonymous" user.
To revoke it, delete the user like any other identity.

Closes #8694
2026-03-19 18:10:20 -07:00
Chris Lu
08b79a30f6 Fix lock table shared wait condition (#8707)
* Fix lock table shared wait condition (#8696)

* Refactor lock table waiter check

* Add exclusive lock wait helper test
2026-03-19 16:08:24 -07:00
Chris Lu
80f3079d2a fix(s3): include directory markers in ListObjects without delimiter (#8704)
* fix(s3): include directory markers in ListObjects without delimiter (#8698)

Directory key objects (zero-byte objects with keys ending in "/") created
via PutObject were omitted from ListObjects/ListObjectsV2 results when no
delimiter was specified. AWS S3 includes these as regular keys in Contents.

The issue was in doListFilerEntries: when recursing into directories in
non-delimiter mode, directory key objects were only emitted when
prefixEndsOnDelimiter was true. Added an else branch to emit them in the
general recursive case as well.

* remove issue reference from inline comment

* test: add child-under-marker and paginated listing coverage

Extend test 6 to place a child object under the directory marker
and paginate with MaxKeys=1 so the emit-then-recurse truncation
path is exercised.

* fix(test): skip directory markers in Spark temporary artifacts check

The listing check now correctly shows directory markers (keys ending
in "/") after the ListObjects fix. These 0-byte metadata objects are
not data artifacts — filter them from the listing check since the
HeadObject-based check already verifies their cleanup with a timeout.
2026-03-19 15:36:11 -07:00
Chris Lu
15f4a97029 fix: improve raft leader election reliability and failover speed (#8692)
* fix: clear raft vote state file on non-resume startup

The seaweedfs/raft library v1.1.7 added a persistent `state` file for
currentTerm and votedFor. When RaftResumeState=false (the default), the
log, conf, and snapshot directories are cleared but this state file was
not. On repeated restarts, different masters accumulate divergent terms,
causing AppendEntries rejections and preventing leader election.

Fixes #8690

* fix: recover TopologyId from snapshot before clearing raft state

When RaftResumeState=false clears log/conf/snapshot, the TopologyId
(used for license validation) was lost. Now extract it from the latest
snapshot before cleanup and restore it on the topology.

Both seaweedfs/raft and hashicorp/raft paths are handled, with a shared
recoverTopologyIdFromState helper in raft_common.go.

* fix: stagger multi-master bootstrap delay by peer index

Previously all masters used a fixed 1500ms delay before the bootstrap
check. Now the delay is proportional to the peer's sorted index with
randomization (matching the hashicorp raft path), giving the designated
bootstrap node (peer 0) a head start while later peers wait for gRPC
servers to be ready.

Also adds diagnostic logging showing why DoJoinCommand was or wasn't
called, making leader election issues easier to diagnose from logs.

* fix: skip unreachable masters during leader reconnection

When a master leader goes down, non-leader masters still redirect
clients to the stale leader address. The masterClient would follow
these redirects, fail, and retry — wasting round-trips each cycle.

Now tryAllMasters tracks which masters failed within a cycle and skips
redirects pointing to them, reducing log spam and connection overhead
during leader failover.

* fix: take snapshot after TopologyId generation for recovery

After generating a new TopologyId on the leader, immediately take a raft
snapshot so the ID can be recovered from the snapshot on future restarts
with RaftResumeState=false. Without this, short-lived clusters would
lose the TopologyId on restart since no automatic snapshot had been
taken yet.

* test: add multi-master raft failover integration tests

Integration test framework and 5 test scenarios for 3-node master
clusters:

- TestLeaderConsistencyAcrossNodes: all nodes agree on leader and
  TopologyId
- TestLeaderDownAndRecoverQuickly: leader stops, new leader elected,
  old leader rejoins as follower
- TestLeaderDownSlowRecover: leader gone for extended period, cluster
  continues with 2/3 quorum
- TestTwoMastersDownAndRestart: quorum lost (2/3 down), recovered
  when both restart
- TestAllMastersDownAndRestart: full cluster restart, leader elected,
  all nodes agree on TopologyId

* fix: address PR review comments

- peerIndex: return -1 (not 0) when self not found, add warning log
- recoverTopologyIdFromSnapshot: defer dir.Close()
- tests: check GetTopologyId errors instead of discarding them

* fix: address review comments on failover tests

- Assert no leader after quorum loss (was only logging)
- Verify follower cs.Leader matches expected leader via
  ServerAddress.ToHttpAddress() comparison
- Check GetTopologyId error in TestTwoMastersDownAndRestart
2026-03-18 23:28:07 -07:00
Chris Lu
c197206897 fix(s3): return ETag header for directory marker PutObject requests (#8688)
* fix(s3): return ETag header for directory marker PutObject requests

The PutObject handler has a special path for keys ending with "/" (directory
markers) that creates the entry via mkdir. This path never computed or set
the ETag response header, unlike the regular PutObject path. AWS S3 always
returns an ETag header, even for empty-body puts.

Compute the MD5 of the content (empty or otherwise), store it in the entry
attributes and extended attributes, and set the ETag response header.

Fixes #8682

* fix: handle io.ReadAll error and chunked encoding for directory markers

Address review feedback:
- Handle error from io.ReadAll instead of silently discarding it
- Change condition from ContentLength > 0 to ContentLength != 0 to
  correctly handle chunked transfer encoding (ContentLength == -1)

* fix hanging tests
2026-03-18 17:26:33 -07:00
Jayshan Raghunandan
1f1eac4f08 feat: improve aio support for admin/volume ingress and fix UI links (#8679)
* feat: improve allInOne mode support for admin/volume ingress and fix master UI links

- Add allInOne support to admin ingress template, matching the pattern
  used by filer and s3 ingress templates (or-based enablement with
  ternary service name selection)
- Add allInOne support to volume ingress template, which previously
  required volume.enabled even when the volume server runs within the
  allInOne pod
- Expose admin ports in allInOne deployment and service when
  allInOne.admin.enabled is set
- Add allInOne.admin config section to values.yaml (enabled by default,
  ports inherit from admin.*)
- Fix legacy master UI templates (master.html, masterNewRaft.html) to
  prefer PublicUrl over internal Url when linking to volume server UI.
  The new admin UI already handles this correctly.

* fix: revert admin allInOne changes and fix PublicUrl in admin dashboard

The admin binary (`weed admin`) is a separate process that cannot run
inside `weed server` (allInOne mode). Revert the admin-related allInOne
helm chart changes that caused 503 errors on admin ingress.

Fix bug in cluster_topology.go where VolumeServer.PublicURL was set to
node.Id (internal pod address) instead of the actual public URL. Add
public_url field to DataNodeInfo proto message so the topology gRPC
response carries the public URL set via -volume.publicUrl flag.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: use HTTP /dir/status to populate PublicUrl in admin dashboard

The gRPC DataNodeInfo proto does not include PublicUrl, so the admin dashboard showed internal pod IPs instead of the configured public URL.
Fetch PublicUrl from the master's /dir/status HTTP endpoint and apply it
in both GetClusterTopology and GetClusterVolumeServers code paths.

Also reverts the unnecessary proto field additions from the previous
commit and cleans up a stray blank line in all-in-one-service.yml.

* fix: apply PublicUrl link fix to masterNewRaft.html

Match the same conditional logic already applied to master.html —
prefer PublicUrl when set and different from Url.

* fix: add HTTP timeout and status check to fetchPublicUrlMap

Use a 5s-timeout client instead of http.DefaultClient to prevent
blocking indefinitely when the master is unresponsive. Also check
the HTTP status code before attempting to parse the response body.

* fix: fall back to node address when PublicUrl is empty

Prevents blank links in the admin dashboard when PublicUrl is not
configured, such as in standalone or mixed-version clusters.

* fix: log io.ReadAll error in fetchPublicUrlMap

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Chris Lu <chris.lu@gmail.com>
2026-03-18 13:20:55 -07:00
JARDEL ALVES
bd3a6b1b33 glog: add --log_rotate_hours flag for time-based log rotation (#8685)
* glog: add --log_rotate_hours flag for time-based log rotation

SeaweedFS previously only rotated log files when they reached MaxSize
(1.8 GB). Long-running deployments with low log volume could accumulate
log files indefinitely with no way to force rotation on a schedule.

This change adds the --log_rotate_hours flag. When set to a non-zero
value, the current log file is rotated once it has been open for the
specified number of hours, regardless of its size.

Implementation details:
- New flag --log_rotate_hours (int, default 0 = disabled) in glog_file.go
- Added createdAt time.Time field to syncBuffer to track file open time
- rotateFile() sets createdAt to the time the new file is opened
- Write() checks elapsed time and triggers rotation when the threshold
  is exceeded, consistent with the existing size-based check

This resolves the long-standing request for time-based rotation and
helps prevent unbounded log accumulation in /tmp on production systems.

Related: #3455, #5763, #8336

* glog: default log_rotate_hours to 168 (7 days)

Enable time-based rotation by default so log files don't accumulate
indefinitely in long-running deployments. Set to 0 to disable.

* glog: simplify rotation logic by combining size and time conditions

Merge the two separate rotation checks into a single block to
eliminate duplicated rotateFile error handling.

* glog: use timeNow() in syncBuffer.Write and add time-based rotation test

Use the existing testable timeNow variable instead of time.Now() in
syncBuffer.Write so that time-based rotation can be tested with a
mocked clock.

Add TestTimeBasedRollover that verifies:
- no rotation occurs before the interval elapses
- rotation triggers after the configured hours

---------

Co-authored-by: Copilot <copilot@github.com>
2026-03-18 13:19:14 -07:00
Weihao Jiang
01987bcafd Make weed-fuse compatible with systemd-based mount (#6814)
* Make weed-fuse compatible with systemd-mount series

* fix: add missing type annotation on skipAutofs param in FreeBSD build

The parameter was declared without a type, causing a compile error on FreeBSD.

* fix: guard hasAutofs nil dereference and make FsName conditional on autofs mode

- Check option.hasAutofs for nil before dereferencing to prevent panic
  when RunMount is called without the flag initialized.
- Only set FsName to "fuse" when autofs mode is active; otherwise
  preserve the descriptive server:path name for mount/df output.
- Fix typo: recogize -> recognize.

* fix: consistent error handling for autofs option and log ignored _netdev

- Replace panic with fmt.Fprintf+return false for autofs parse errors,
  matching the pattern used by other fuse option parsers.
- Log when _netdev option is silently stripped to aid debugging.

---------

Co-authored-by: Chris Lu <chris.lu@gmail.com>
2026-03-18 12:18:40 -07:00
JARDEL ALVES
98f301e30b glog: add --log_max_size_mb and --log_max_files runtime flags (#8684) 2026-03-18 06:59:22 -07:00
Chris Lu
81369b8a83 improve: large file sync throughput for remote.cache and filer.sync (#8676)
* improve large file sync throughput for remote.cache and filer.sync

Three main throughput improvements:

1. Adaptive chunk sizing for remote.cache: targets ~32 chunks per file
   instead of always starting at 5MB. A 500MB file now uses ~16MB chunks
   (32 chunks) instead of 5MB chunks (100 chunks), reducing per-chunk
   overhead (volume assign, gRPC call, needle write) by 3x.

2. Configurable concurrency at every layer:
   - remote.cache chunk concurrency: -chunkConcurrency flag (default 8)
   - remote.cache S3 download concurrency: -downloadConcurrency flag
     (default raised from 1 to 5 per chunk)
   - filer.sync chunk concurrency: -chunkConcurrency flag (default 32)

3. S3 multipart download concurrency raised from 1 to 5: the S3 manager
   downloader was using Concurrency=1, serializing all part downloads
   within each chunk. This alone can 5x per-chunk download speed.

The concurrency values flow through the gRPC request chain:
  shell command → CacheRemoteObjectToLocalClusterRequest →
  FetchAndWriteNeedleRequest → S3 downloader

Zero values in the request mean "use server defaults", maintaining
full backward compatibility with existing callers.

Ref #8481

* fix: use full maxMB for chunk size cap and remove loop guard

Address review feedback:
- Use full maxMB instead of maxMB/2 for maxChunkSize to avoid
  unnecessarily limiting chunk size for very large files.
- Remove chunkSize < maxChunkSize guard from the safety loop so it
  can always grow past maxChunkSize when needed to stay under 1000
  chunks (e.g., extremely large files with small maxMB).

* address review feedback: help text, validation, naming, docs

- Fix help text for -chunkConcurrency and -downloadConcurrency flags
  to say "0 = server default" instead of advertising specific numeric
  defaults that could drift from the server implementation.
- Validate chunkConcurrency and downloadConcurrency are within int32
  range before narrowing, returning a user-facing error if out of range.
- Rename ReadRemoteErr to readRemoteErr to follow Go naming conventions.
- Add doc comment to SetChunkConcurrency noting it must be called
  during initialization before replication goroutines start.
- Replace doubling loop in chunk size safety check with direct
  ceil(remoteSize/1000) computation to guarantee the 1000-chunk cap.

* address Copilot review: clamp concurrency, fix chunk count, clarify proto docs

- Use ceiling division for chunk count check to avoid overcounting
  when file size is an exact multiple of chunk size.
- Clamp chunkConcurrency (max 1024) and downloadConcurrency (max 1024
  at filer, max 64 at volume server) to prevent excessive goroutines.
- Always use ReadFileWithConcurrency when the client supports it,
  falling back to the implementation's default when value is 0.
- Clarify proto comments that download_concurrency only applies when
  the remote storage client supports it (currently S3).
- Include specific server defaults in help text (e.g., "0 = server
  default 8") so users see the actual values in -h output.

* fix data race on executionErr and use %w for error wrapping

- Protect concurrent writes to executionErr in remote.cache worker
  goroutines with a sync.Mutex to eliminate the data race.
- Use %w instead of %v in volume_grpc_remote.go error formatting
  to preserve the error chain for errors.Is/errors.As callers.
2026-03-17 16:49:56 -07:00
Chris Lu
f4073107cb fix: clean up orphaned needles on remote.cache partial download failure (#8675)
When remote.cache downloads a file in parallel chunks and a gRPC
connection drops mid-transfer, chunks already written to volume servers
were not cleaned up. Since the filer metadata was never updated, these
needles became orphaned — invisible to volume.vacuum and never
referenced by the filer. On subsequent cache cycles the file was still
treated as uncached, creating more orphans each attempt.

Call DeleteUncommittedChunks on the download-error path, matching the
cleanup already present for the metadata-update-failure path.

Fixes #8481
2026-03-17 13:47:54 -07:00
Chris Lu
55e988a7ee iceberg: add sort-aware compaction rewrite (#8666)
* iceberg: add sort-aware compaction rewrite

* iceberg: share filtered row iteration in compaction

* iceberg: rely on table sort order for sort rewrites

* iceberg: harden sort compaction planning

* iceberg: include rewrite strategy in planning config hash

compactionPlanningConfigHash now incorporates RewriteStrategy and
SortMaxInputBytes so cached planning results are invalidated when
sort strategy settings change. Also use the bytesPerMB constant in
compactionNoEligibleMessage.
2026-03-17 00:57:32 -07:00
Chris Lu
e5c0889473 iceberg: add delete file rewrite maintenance (#8664)
* iceberg: add delete file rewrite maintenance

* iceberg: preserve untouched delete files during rewrites

* iceberg: share detection threshold defaults

* iceberg: add partition-scoped maintenance filters (#8665)

* iceberg: add partition-scoped maintenance filters

* iceberg: tighten where-filter partition matching
2026-03-16 21:11:09 -07:00
Chris Lu
a3717cd4b5 fix(admin): show anonymous user in Object Store Users UI (#8671)
The anonymous identity was explicitly filtered out of the user listing,
making it invisible in the admin console. Users could not view or edit
its permissions. Attempting to recreate it failed with "already exists".

Remove the anonymous skip in GetObjectStoreUsers so it appears like any
other identity. Add a guard in DeleteObjectStoreUser to prevent deletion
of the anonymous system identity, which would break unauthenticated S3
access.

Fixes #8466

Co-authored-by: Copilot <copilot@github.com>
2026-03-16 19:54:57 -07:00
Chris Lu
6e45fc0055 iceberg: cache detection planning results (#8667)
* iceberg: cache detection planning results

* iceberg: tighten planning index cache handling

* iceberg: remove dead planning-index metadata fallback

* iceberg: preserve partial planning index caches

* iceberg: scope planning index caching per op

* iceberg: rename copy vars to avoid shadowing builtin
2026-03-16 19:54:01 -07:00
Chris Lu
f71cef2dc8 iceberg: add resource-group proposal controls (#8668)
* iceberg: add resource-group proposal controls

* iceberg: tighten resource group config validation
2026-03-16 16:45:00 -07:00
Chris Lu
7c83460b10 adjust template path 2026-03-16 15:29:12 -07:00
Chris Lu
e8914ac879 feat(admin): add -urlPrefix flag for subdirectory deployment (#8670)
Allow the admin server to run behind a reverse proxy under a
subdirectory by adding a -urlPrefix flag (e.g. -urlPrefix=/seaweedfs).

Closes #8646
2026-03-16 15:26:02 -07:00
Chris Lu
9984ce7dcb fix(s3): omit NotResource:null from bucket policy JSON response (#8658)
* fix(s3): omit NotResource:null from bucket policy JSON response (#8657)

Change NotResource from value type to pointer (*StringOrStringSlice) so
that omitempty properly omits it when unset, matching the existing
Principal field pattern. This prevents IaC tools (Terraform, Ansible)
from detecting false configuration drift.

Add bucket policy round-trip idempotency integration tests.

* simplify JSON comparison in bucket policy idempotency test

Use require.JSONEq directly on the raw JSON strings instead of
round-tripping through unmarshal/marshal, since JSONEq already
handles normalization internally.

* fix bucket policy test cases that locked out the admin user

The Deny+NotResource test cases used Action:"s3:*" which denied the
admin's own GetBucketPolicy call. Scope deny to s3:GetObject only,
and add an Allow+NotResource variant instead.

* fix(s3): also make Resource a pointer to fix empty string in JSON

Apply the same omitempty pointer fix to the Resource field, which was
emitting "Resource":"" when only NotResource was set. Add
NewStringOrStringSlicePtr helper, make Strings() nil-safe, and handle
*StringOrStringSlice in normalizeToStringSliceWithError.

* improve bucket policy integration tests per review feedback

- Replace time.Sleep with waitForClusterReady using ListBuckets
- Use structural hasKey check instead of brittle substring NotContains
- Assert specific NoSuchBucketPolicy error code after delete
- Handle single-statement policies in hasKey helper
2026-03-16 12:58:26 -07:00
Chris Lu
acea36a181 filer: add conditional update preconditions (#8647)
* filer: add conditional update preconditions

* iceberg: tighten metadata CAS preconditions
2026-03-16 12:33:32 -07:00
Chris Lu
6b2b442450 iceberg: detect maintenance work per operation (#8639)
* iceberg: detect maintenance work per operation

* iceberg: ignore delete manifests during detection

* iceberg: clean up detection maintenance planning

* iceberg: tighten detection manifest heuristics

* Potential fix for code scanning alert no. 330: Incorrect conversion between integer types

Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>

* iceberg: tolerate per-operation detection errors

* iceberg: fix fake metadata location versioning

* iceberg: check snapshot expiry before manifest loads

* iceberg: make expire-snapshots switch case explicit

---------

Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
2026-03-15 20:21:50 -07:00
Chris Lu
a00eddb525 Iceberg table maintenance Phase 3: multi-spec compaction, delete handling, and metrics (#8643)
* Add multi-partition-spec compaction and delete-aware compaction (Phase 3)

Multi-partition-spec compaction:
- Add SpecID to compactionBin struct and group by spec+partition key
- Remove the len(specIDs) > 1 skip that blocked spec-evolved tables
- Write per-spec manifests in compaction commit using specByID map
- Use per-bin PartitionSpec when calling NewDataFileBuilder

Delete-aware compaction:
- Add ApplyDeletes config (default: true) with readBoolConfig helper
- Implement position delete collection (file_path + pos Parquet columns)
- Implement equality delete collection (field ID to column mapping)
- Update mergeParquetFiles to filter rows via position deletes (binary
  search) and equality deletes (hash set lookup)
- Smart delete manifest carry-forward: drop when all data files compacted
- Fix EXISTING/DELETED entries to include sequence numbers

Tests for multi-spec bins, delete collection, merge filtering, and
end-to-end compaction with position/equality/mixed deletes.

* Add structured metrics and per-bin progress to iceberg maintenance

- Change return type of all four operations from (string, error) to
  (string, map[string]int64, error) with structured metric counts
  (files_merged, snapshots_expired, orphans_removed, duration_ms, etc.)
- Add onProgress callback to compactDataFiles for per-bin progress
- In Execute, pass progress callback that sends JobProgressUpdate with
  per-bin stage messages
- Accumulate per-operation metrics with dot-prefixed keys
  (e.g. compact.files_merged) into OutputValues on completion
- Update testing_api.go wrappers and integration test call sites
- Add tests: TestCompactDataFilesMetrics, TestExpireSnapshotsMetrics,
  TestExecuteCompletionOutputValues

* Address review feedback: group equality deletes by field IDs, use metric constants

- Group equality deletes by distinct equality_ids sets so different
  delete files with different equality columns are handled correctly
- Use length-prefixed type-aware encoding in buildEqualityKey to avoid
  ambiguity between types and collisions from null bytes
- Extract metric key strings into package-level constants

* Fix buildEqualityKey to use length-prefixed type-aware encoding

The previous implementation used plain String() concatenation with null
byte separators, which caused type ambiguity (int 123 vs string "123")
and separator collisions when values contain null bytes. Now each value
is serialized as "kind:length:value" for unambiguous composite keys.

This fix was missed in the prior cherry-pick due to a merge conflict.

* Address nitpick review comments

- Document patchManifestContentToDeletes workaround: explain that
  iceberg-go WriteManifest cannot create delete manifests, and note
  the fail-fast validation on pattern match
- Document makeTestEntries: note that specID field is ignored and
  callers should use makeTestEntriesWithSpec for multi-spec testing

* fmt

* Fix path normalization, manifest threshold, and artifact filename collisions

- Normalize file paths in position delete collection and lookup so that
  absolute S3 URLs and relative paths match correctly
- Fix rewriteManifests threshold check to count only data manifests
  (was including delete manifests in the count and metric)
- Add random suffix to artifact filenames in compactDataFiles and
  rewriteManifests to prevent collisions between concurrent runs
- Sort compaction bins by SpecID then PartitionKey for deterministic
  ordering across specs

* Fix pos delete read, deduplicate column resolution, minor cleanups

- Remove broken Column() guard in position delete reading that silently
  defaulted pos to 0; unconditionally extract Int64() instead
- Deduplicate column resolution in readEqualityDeleteFile by calling
  resolveEqualityColIndices instead of inlining the same logic
- Add warning log in readBoolConfig for unrecognized string values
- Fix CompactDataFiles call site in integration test to capture 3 return
  values

* Advance progress on all bins, deterministic manifest order, assert metrics

- Call onProgress for every bin iteration including skipped/failed bins
  so progress reporting never appears stalled
- Sort spec IDs before iterating specEntriesMap to produce deterministic
  manifest list ordering across runs
- Assert expected metric keys in CompactDataFiles integration test

---------

Co-authored-by: Copilot <copilot@github.com>
2026-03-15 19:18:14 -07:00
Chris Lu
e24630251c iceberg: handle filer-backed compaction inputs (#8638)
* iceberg: handle filer-backed compaction inputs

* iceberg: preserve upsert creation times

* iceberg: align compaction test schema

* iceberg: tighten compact output assertion

* iceberg: document compact output match

* iceberg: clear stale chunks in upsert helper

* iceberg: strengthen compaction integration coverage
2026-03-15 17:46:06 -07:00
Chris Lu
0afc675a55 iceberg: validate filer failover targets (#8637)
* iceberg: validate filer failover targets

* iceberg: tighten filer liveness checks

* iceberg: relax filer test readiness deadline
2026-03-15 17:45:55 -07:00
Chris Lu
b868980260 fix(remote): don't send empty StorageClass in S3 uploads (#8645)
When S3StorageClass is empty (the default), aws.String("") was passed
as the StorageClass in PutObject requests. While AWS S3 treats this as
"use default," S3-compatible providers (e.g. SharkTech) reject it with
InvalidStorageClass. Only set StorageClass when a non-empty value is
configured, letting the provider use its default.

Fixes #8644
2026-03-15 13:32:48 -07:00
Chris Lu
5f19e3259f iceberg: keep split bins within target size (#8640) 2026-03-15 12:48:31 -07:00
Chris Lu
d9d6707401 Change iceberg compaction target file size config from bytes to MB (#8636)
Change iceberg target_file_size config from bytes to MB

Rename the config field from target_file_size_bytes to
target_file_size_mb with a default of 256 (MB). The value is
converted to bytes internally. This makes the config more
user-friendly — entering 256 is clearer than 268435456.

Co-authored-by: Copilot <copilot@github.com>
2026-03-15 11:42:06 -07:00
Chris Lu
8cde3d4486 Add data file compaction to iceberg maintenance (Phase 2) (#8503)
* Add iceberg_maintenance plugin worker handler (Phase 1)

Implement automated Iceberg table maintenance as a new plugin worker job
type. The handler scans S3 table buckets for tables needing maintenance
and executes operations in the correct Iceberg order: expire snapshots,
remove orphan files, and rewrite manifests.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Add data file compaction to iceberg maintenance handler (Phase 2)

Implement bin-packing compaction for small Parquet data files:
- Enumerate data files from manifests, group by partition
- Merge small files using parquet-go (read rows, write merged output)
- Create new manifest with ADDED/DELETED/EXISTING entries
- Commit new snapshot with compaction metadata

Add 'compact' operation to maintenance order (runs before expire_snapshots),
configurable via target_file_size_bytes and min_input_files thresholds.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Fix memory exhaustion in mergeParquetFiles by processing files sequentially

Previously all source Parquet files were loaded into memory simultaneously,
risking OOM when a compaction bin contained many small files. Now each file
is loaded, its rows are streamed into the output writer, and its data is
released before the next file is loaded — keeping peak memory proportional
to one input file plus the output buffer.

* Validate bucket/namespace/table names against path traversal

Reject names containing '..', '/', or '\' in Execute to prevent
directory traversal via crafted job parameters.

* Add filer address failover in iceberg maintenance handler

Try each filer address from cluster context in order instead of only
using the first one. This improves resilience when the primary filer
is temporarily unreachable.

* Add separate MinManifestsToRewrite config for manifest rewrite threshold

The rewrite_manifests operation was reusing MinInputFiles (meant for
compaction bin file counts) as its manifest count threshold. Add a
dedicated MinManifestsToRewrite field with its own config UI section
and default value (5) so the two thresholds can be tuned independently.

* Fix risky mtime fallback in orphan removal that could delete new files

When entry.Attributes is nil, mtime defaulted to Unix epoch (1970),
which would always be older than the safety threshold, causing the
file to be treated as eligible for deletion. Skip entries with nil
Attributes instead, matching the safer logic in operations.go.

* Fix undefined function references in iceberg_maintenance_handler.go

Use the exported function names (ShouldSkipDetectionByInterval,
BuildDetectorActivity, BuildExecutorActivity) matching their
definitions in vacuum_handler.go.

* Remove duplicated iceberg maintenance handler in favor of iceberg/ subpackage

The IcebergMaintenanceHandler and its compaction code in the parent
pluginworker package duplicated the logic already present in the
iceberg/ subpackage (which self-registers via init()). The old code
lacked stale-plan guards, proper path normalization, CAS-based xattr
updates, and error-returning parseOperations.

Since the registry pattern (default "all") makes the old handler
unreachable, remove it entirely. All functionality is provided by
iceberg.Handler with the reviewed improvements.

* Fix MinManifestsToRewrite clamping to match UI minimum of 2

The clamp reset values below 2 to the default of 5, contradicting the
UI's advertised MinValue of 2. Clamp to 2 instead.

* Sort entries by size descending in splitOversizedBin for better packing

Entries were processed in insertion order which is non-deterministic
from map iteration. Sorting largest-first before the splitting loop
improves bin packing efficiency by filling bins more evenly.

* Add context cancellation check to drainReader loop

The row-streaming loop in drainReader did not check ctx between
iterations, making long compaction merges uncancellable. Check
ctx.Done() at the top of each iteration.

* Fix splitOversizedBin to always respect targetSize limit

The minFiles check in the split condition allowed bins to grow past
targetSize when they had fewer than minFiles entries, defeating the
OOM protection. Now bins always split at targetSize, and a trailing
runt with fewer than minFiles entries is merged into the previous bin.

* Add integration tests for iceberg table maintenance plugin worker

Tests start a real weed mini cluster, create S3 buckets and Iceberg
table metadata via filer gRPC, then exercise the iceberg.Handler
operations (ExpireSnapshots, RemoveOrphans, RewriteManifests) against
the live filer. A full maintenance cycle test runs all operations in
sequence and verifies metadata consistency.

Also adds exported method wrappers (testing_api.go) so the integration
test package can call the unexported handler methods.

* Fix splitOversizedBin dropping files and add source path to drainReader errors

The runt-merge step could leave leading bins with fewer than minFiles
entries (e.g. [80,80,10,10] with targetSize=100, minFiles=2 would drop
the first 80-byte file). Replace the filter-based approach with an
iterative merge that folds any sub-minFiles bin into its smallest
neighbor, preserving all eligible files.

Also add the source file path to drainReader error messages so callers
can identify which Parquet file caused a read/write failure.

* Harden integration test error handling

- s3put: fail immediately on HTTP 4xx/5xx instead of logging and
  continuing
- lookupEntry: distinguish NotFound (return nil) from unexpected RPC
  errors (fail the test)
- writeOrphan and orphan creation in FullMaintenanceCycle: check
  CreateEntryResponse.Error in addition to the RPC error

* go fmt

---------

Co-authored-by: Copilot <copilot@github.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-15 11:27:42 -07:00
Chris Lu
47799a5b4f fix tests 2026-03-15 09:44:14 -07:00
Chris Lu
1f2014568f fix(mini): use "all" job type for plugin worker (#8634)
The mini command previously hardcoded a list of specific job types
(vacuum, volume_balance, erasure_coding, admin_script). Use the "all"
category instead so that newly registered handlers are automatically
picked up without requiring changes to the mini command.
2026-03-14 21:43:41 -07:00
Chris Lu
a838661b83 feat(plugin): EC shard balance handler for plugin worker (#8629)
* feat(ec_balance): add TaskTypeECBalance constant and protobuf definitions

Add the ec_balance task type constant to both topology and worker type
systems. Define EcBalanceTaskParams, EcShardMoveSpec, and
EcBalanceTaskConfig protobuf messages for EC shard balance operations.

* feat(ec_balance): add configuration for EC shard balance task

Config includes imbalance threshold, min server count, collection
filter, disk type, and preferred tags for tag-aware placement.

* feat(ec_balance): add multi-phase EC shard balance detection algorithm

Implements four detection phases adapted from the ec.balance shell
command:
1. Duplicate shard detection and removal proposals
2. Cross-rack shard distribution balancing
3. Within-rack node-level shard balancing
4. Global shard count equalization across nodes

Detection is side-effect-free: it builds an EC topology view from
ActiveTopology and generates move proposals without executing them.

* feat(ec_balance): add EC shard move task execution

Implements the shard move sequence using the same VolumeEcShardsCopy,
VolumeEcShardsMount, VolumeEcShardsUnmount, and VolumeEcShardsDelete
RPCs as the shell ec.balance command. Supports both regular shard
moves and dedup-phase deletions (unmount+delete without copy).

* feat(ec_balance): add task registration and scheduling

Register EC balance task definition with auto-config update support.
Scheduling respects max concurrent limits and worker capabilities.

* feat(ec_balance): add plugin handler for EC shard balance

Implements the full plugin handler with detection, execution, admin
and worker config forms, proposal building, and decision trace
reporting. Supports collection/DC/disk type filtering, preferred tag
placement, and configurable detection intervals. Auto-registered via
init() with the handler registry.

* test(ec_balance): add tests for detection algorithm and plugin handler

Detection tests cover: duplicate shard detection, cross-rack imbalance,
within-rack imbalance, global rebalancing, topology building, collection
filtering, and edge cases. Handler tests cover: config derivation with
clamping, proposal building, protobuf encode/decode round-trip, fallback
parameter decoding, capability, and config policy round-trip.

* fix(ec_balance): address PR review feedback and fix CI test failure

- Update TestWorkerDefaultJobTypes to expect 6 handlers (was 5)
- Extract threshold constants (ecBalanceMinImbalanceThreshold, etc.)
  to eliminate magic numbers in Descriptor and config derivation
- Remove duplicate ShardIdsToUint32 helper (use erasure_coding package)
- Add bounds checks for int64→int/uint32 conversions to fix CodeQL
  integer conversion warnings

* fix(ec_balance): address code review findings

storage_impact.go:
- Add TaskTypeECBalance case returning shard-level reservation
  (ShardSlots: -1/+1) instead of falling through to default which
  incorrectly reserves a full volume slot on target.

detection.go:
- Use dc:rack composite key to avoid cross-DC rack name collisions.
  Only create rack entries after confirming node has matching disks.
- Add exceedsImbalanceThreshold check to cross-rack, within-rack,
  and global phases so trivial skews below the configured threshold
  are ignored. Dedup phase always runs since duplicates are errors.
- Reserve destination capacity after each planned move (decrement
  destNode.freeSlots, update rackShardCount/nodeShardCount) to
  prevent overbooking the same destination.
- Skip nodes with freeSlots <= 0 when selecting minNode in global
  balance to avoid proposing moves to full nodes.
- Include loop index and source/target node IDs in TaskID to
  guarantee uniqueness across moves with the same volumeID/shardID.

ec_balance_handler.go:
- Fail fast with error when shard_id is absent in fallback parameter
  decoding instead of silently defaulting to shard 0.

ec_balance_task.go:
- Delegate GetProgress() to BaseTask.GetProgress() so progress
  updates from ReportProgressWithStage are visible to callers.
- Add fail-fast guard rejecting multiple sources/targets until
  batch execution is implemented.

Findings verified but not changed (matches existing codebase pattern
in vacuum/balance/erasure_coding handlers):
- register.go globalTaskDef.Config race: same unsynchronized pattern
  in all 4 task packages.
- CreateTask using generated ID: same fmt.Sprintf pattern in all 4
  task packages.

* fix(ec_balance): harden parameter decoding, progress tracking, and validation

ec_balance_handler.go (decodeECBalanceTaskParams):
- Validate execution-critical fields (Sources[0].Node, ShardIds,
  Targets[0].Node, ShardIds) after protobuf deserialization.
- Require source_disk_id and target_disk_id in legacy fallback path
  so Targets[0].DiskId is populated for VolumeEcShardsCopyRequest.
- All error messages reference decodeECBalanceTaskParams and the
  specific missing field (TaskParams, shard_id, Targets[0].DiskId,
  EcBalanceTaskParams) for debuggability.

ec_balance_task.go:
- Track progress in ECBalanceTask.progress field, updated via
  reportProgress() helper called before ReportProgressWithStage(),
  so GetProgress() returns real stage progress instead of stale 0.
- Validate: require exactly 1 source and 1 target (mirrors Execute
  guard), require ShardIds on both, with error messages referencing
  ECBalanceTask.Validate and the specific field.

* fix(ec_balance): fix dedup execution path, stale topology, collection filter, timeout, and dedupeKey

detection.go:
- Dedup moves now set target=source so isDedupPhase() triggers the
  unmount+delete-only execution path instead of attempting a copy.
- Apply moves to in-memory topology between phases via
  applyMovesToTopology() so subsequent phases see updated shard
  placement and don't conflict with already-planned moves.
- detectGlobalImbalance now accepts allowedVids and filters both
  shard counting and shard selection to respect CollectionFilter.

ec_balance_task.go:
- Apply EcBalanceTaskParams.TimeoutSeconds to the context via
  context.WithTimeout so all RPC operations respect the configured
  timeout instead of hanging indefinitely.

ec_balance_handler.go:
- Include source node ID in dedupeKey so dedup deletions from
  different source nodes for the same shard aren't collapsed.
- Clamp minServerCountRaw and minIntervalRaw lower bounds on int64
  before narrowing to int, preventing undefined overflow on 32-bit.

* fix(ec_balance): log warning before cancelling on progress send failure

Log the error, job ID, job type, progress percentage, and stage
before calling execCancel() in the progress callback so failed
progress sends are diagnosable instead of silently cancelling.
2026-03-14 21:34:53 -07:00
Chris Lu
c4d642b8aa fix(ec): gather shards from all disk locations before rebuild (#8633)
* fix(ec): gather shards from all disk locations before rebuild (#8631)

Fix "too few shards given" error during ec.rebuild on multi-disk volume
servers. The root cause has two parts:

1. VolumeEcShardsRebuild only looked at a single disk location for shard
   files. On multi-disk servers, the existing local shards could be on one
   disk while copied shards were placed on another, causing the rebuild to
   see fewer shards than actually available.

2. VolumeEcShardsCopy had a DiskId condition (req.DiskId == 0 &&
   len(vs.store.Locations) > 0) that was always true, making the
   FindFreeLocation fallback dead code. This meant copies always went to
   Locations[0] regardless of where existing shards were.

Changes:
- VolumeEcShardsRebuild now finds the location with the most shards,
  then gathers shard files from other locations via hard links (or
  symlinks for cross-device) before rebuilding. Gathered files are
  cleaned up after rebuild.
- VolumeEcShardsCopy now only uses Locations[DiskId] when DiskId > 0
  (explicitly set). Otherwise, it prefers the location that already has
  the EC volume, falling back to HDD then any free location.
- generateMissingEcFiles now logs shard counts and provides a clear
  error message when not enough shards are found, instead of passing
  through to the opaque reedsolomon "too few shards given" error.

* fix(ec): update test to match skip behavior for unrepairable volumes

The test expected an error for volumes with insufficient shards, but
commit 5acb4578a changed unrepairable volumes to be skipped with a log
message instead of returning an error. Update the test to verify the
skip behavior and log output.

* fix(ec): address PR review comments

- Add comment clarifying DiskId=0 means "not specified" (protobuf default),
  callers must use DiskId >= 1 to target a specific disk.
- Log warnings on cleanup failures for gathered shard links.

* fix(ec): read shard files from other disks directly instead of linking

Replace the hard link / symlink gathering approach with passing
additional search directories into RebuildEcFiles. The rebuild
function now opens shard files directly from whichever disk they
live on, avoiding filesystem link operations and cleanup.

RebuildEcFiles and RebuildEcFilesWithContext gain a variadic
additionalDirs parameter (backward compatible with existing callers).

* fix(ec): clarify DiskId selection semantics in VolumeEcShardsCopy comment

* fix(ec): avoid empty files on failed rebuild; don't skip ecx-only locations

- generateMissingEcFiles: two-pass approach — first discover present/missing
  shards and check reconstructability, only then create output files. This
  avoids leaving behind empty truncated shard files when there are too few
  shards to rebuild.

- VolumeEcShardsRebuild: compute hasEcx before skipping zero-shard locations.
  A location with an .ecx file but no shard files (all shards on other disks)
  is now a valid rebuild candidate instead of being silently skipped.

* fix(ec): select ecx-only location as rebuildLocation when none chosen yet

When rebuildLocation is nil and a location has hasEcx=true but
existingShardCount=0 (all shards on other disks), the condition
0 > 0 was false so it was never promoted to rebuildLocation.
Add rebuildLocation == nil to the predicate so the first location
with an .ecx file is always selected as a candidate.
2026-03-14 20:59:47 -07:00
Chris Lu
5acb4578ab Fix ec.rebuild failing on unrepairable volumes instead of skipping (#8632)
* Fix ec.rebuild failing on unrepairable volumes instead of skipping them

When an EC volume has fewer shards than DataShardsCount, ec.rebuild would
return an error and abort the entire operation. Now it logs a warning and
continues rebuilding the remaining volumes.

Fixes #8630

* Remove duplicate volume ID in unrepairable log message

---------

Co-authored-by: Copilot <copilot@github.com>
2026-03-14 16:18:29 -07:00
Chris Lu
2f51a94416 feat(vacuum): add volume state and location filters to vacuum handler (#8625)
* feat(vacuum): add volume state, location, and enhanced collection filters

Align the vacuum handler's admin config with the balance handler by adding:
- volume_state filter (ALL/ACTIVE/FULL) to scope vacuum to writable or
  read-only volumes
- data_center_filter, rack_filter, node_filter to scope vacuum to
  specific infrastructure locations
- Enhanced collection_filter description matching the balance handler's
  ALL_COLLECTIONS/EACH_COLLECTION/regex modes

The new filters reuse filterMetricsByVolumeState() and
filterMetricsByLocation() already defined in the same package.

* use wildcard matchers for DC/rack/node filters

Replace exact-match and CSV set lookups with wildcard matching
from util/wildcard package. Patterns like "dc*", "rack-1?", or
"node-a*" are now supported in all location filter fields for
both balance and vacuum handlers.

* add nil guard in filterMetricsByLocation
2026-03-13 23:41:58 -07:00
Chris Lu
6fc0489dd8 feat(plugin): make page tabs and sub-tabs addressable by URLs (#8626)
* feat(plugin): make page tabs and sub-tabs addressable by URLs

Update the plugin page so that clicking tabs and sub-tabs pushes
browser history via history.pushState(), enabling bookmarkable URLs,
browser back/forward navigation, and shareable links.

URL mapping:
  - /plugin              → Overview tab
  - /plugin/configuration → Configuration sub-tab
  - /plugin/detection     → Job Detection sub-tab
  - /plugin/queue         → Job Queue sub-tab
  - /plugin/execution     → Job Execution sub-tab

Job-type-specific URLs use the ?job= query parameter (e.g.,
/plugin/configuration?job=vacuum) so that a specific job type tab
is pre-selected on page load.

Changes:
- Add initialJob parameter to Plugin() template and handler
- Extract ?job= query param in renderPluginPage handler
- Add buildPluginURL/updateURL helpers in JavaScript
- Push history state on top-tab, sub-tab, and job-type clicks
- Listen for popstate to restore tab state on back/forward
- Replace initial history entry on page load via replaceState

* make popstate handler async with proper error handling

Await loadDescriptorAndConfig so data loading completes before
rendering dependent views. Log errors instead of silently
swallowing them.
2026-03-13 23:02:43 -07:00
Chris Lu
baae672b6f feat: auto-disable master vacuum when plugin worker is active (#8624)
* feat: auto-disable master vacuum when plugin vacuum worker is active

When a vacuum-capable plugin worker connects to the admin server, the
admin server calls DisableVacuum on the master to prevent the automatic
scheduled vacuum from conflicting with the plugin worker's vacuum. When
the worker disconnects, EnableVacuum is called to restore the default
behavior. A safety net in the topology refresh loop re-enables vacuum
if the admin server disconnects without cleanup.

* rename isAdminServerConnected to isAdminServerConnectedFunc

* add 5s timeout to DisableVacuum/EnableVacuum gRPC calls

Prevents the monitor goroutine from blocking indefinitely if the
master is unresponsive.

* track plugin ownership of vacuum disable to avoid overriding operator

- Add vacuumDisabledByPlugin flag to Topology, set when DisableVacuum
  is called while admin server is connected (i.e., by plugin monitor)
- Safety net only re-enables vacuum when it was disabled by plugin,
  not when an operator intentionally disabled it via shell command
- EnableVacuum clears the plugin flag

* extract syncVacuumState for testability, add fake toggler tests

Extract the single sync step into syncVacuumState() with a
vacuumToggler interface. Add TestSyncVacuumState with a fake
toggler that verifies disable/enable calls on state transitions.

* use atomic.Bool for isDisableVacuum and vacuumDisabledByPlugin

Both fields are written by gRPC handlers and read by the vacuum
goroutine, causing a data race. Use atomic.Bool with Store/Load
for thread-safe access.

* use explicit by_plugin field instead of connection heuristic

Add by_plugin bool to DisableVacuumRequest proto so the caller
declares intent explicitly. The admin server monitor sets it to
true; shell commands leave it false. This prevents an operator's
intentional disable from being auto-reversed by the safety net.

* use setter for admin server callback instead of function parameter

Move isAdminServerConnected from StartRefreshWritableVolumes
parameter to Topology.SetAdminServerConnectedFunc() setter.
Keeps the function signature stable and decouples the topology
layer from the admin server concept.

* suppress repeated log messages on persistent sync failures

Add retrying parameter to syncVacuumState so the initial
state transition is logged at V(0) but subsequent retries
of the same transition are silent until the call succeeds.

* clear plugin ownership flag on manual DisableVacuum

Prevents stale plugin flag from causing incorrect auto-enable
when an operator manually disables vacuum after a plugin had
previously disabled it.

* add by_plugin to EnableVacuumRequest for symmetric ownership tracking

Plugin-driven EnableVacuum now only re-enables if the plugin was
the one that disabled it. If an operator manually disabled vacuum
after the plugin, the plugin's EnableVacuum is a no-op. This
prevents the plugin monitor from overriding operator intent on
worker disconnect.

* use cancellable context for monitorVacuumWorker goroutine

Replace context.Background() with a cancellable context stored
as bgCancel on AdminServer. Shutdown() calls bgCancel() so
monitorVacuumWorker exits cleanly via ctx.Done().

* track operator and plugin vacuum disables independently

Replace single isDisableVacuum flag with two independent flags:
vacuumDisabledByOperator and vacuumDisabledByPlugin. Each caller
only flips its own flag. The effective disabled state is the OR
of both. This prevents a plugin connect/disconnect cycle from
overriding an operator's manual disable, and vice versa.

* fix safety net to clear plugin flag, not operator flag

The safety net should call EnableVacuumByPlugin() to clear only
the plugin disable flag when the admin server disconnects. The
previous call to EnableVacuum() incorrectly cleared the operator
flag instead.
2026-03-13 22:49:12 -07:00
Chris Lu
89ccb6d825 use constants 2026-03-13 18:11:08 -07:00
Chris Lu
f48725a31d add more tests 2026-03-13 17:44:44 -07:00
Chris Lu
8056b702ba feat(balance): replica placement validation for volume moves (#8622)
* feat(balance): add replica placement validation for volume moves

When the volume balance detection proposes moving a volume, validate
that the move does not violate the volume's replication policy (e.g.,
ReplicaPlacement=010 requires replicas on different racks). If the
preferred destination violates the policy, fall back to score-based
planning; if that also violates, skip the volume entirely.

- Add ReplicaLocation type and VolumeReplicaMap to ClusterInfo
- Build replica map from all volumes before collection filtering
- Port placement validation logic from command_volume_fix_replication.go
- Thread replica map through collectVolumeMetrics call chain
- Add IsGoodMove check in createBalanceTask before destination use

* address PR review: extract validation closure, add defensive checks

- Extract validateMove closure to eliminate duplicated ReplicaLocation
  construction and IsGoodMove calls
- Add defensive check for empty replica map entries (len(replicas) == 0)
- Add bounds check for int-to-byte cast on ExpectedReplicas (0-255)

* address nitpick: rp test helper accepts *testing.T and fails on error

Prevents silent failures from typos in replica placement codes.

* address review: add composite replica placement tests (011, 110)

Test multi-constraint placement policies where both rack and DC
rules must be satisfied simultaneously.

* address review: use struct keys instead of string concatenation

Replace string-concatenated map keys with typed rackKey/nodeKey
structs to eliminate allocations and avoid ambiguity if IDs
contain spaces.

* address review: simplify bounds check, log fallback error, guard source

- Remove unreachable ExpectedReplicas < 0 branch (outer condition
  already guarantees > 0), fold bounds check into single condition
- Log error from planBalanceDestination in replica validation fallback
- Return false from IsGoodMove when sourceNodeID not found in
  existing replicas (inconsistent cluster state)

* address review: use slices.Contains instead of hand-rolled helpers

Replace isAmongDC and isAmongRack with slices.Contains from the
standard library, reducing boilerplate.
2026-03-13 17:39:25 -07:00
Chris Lu
47ddf05d95 feat(plugin): DC/rack/node filtering for volume balance (#8621)
* feat(plugin): add DC/rack/node filtering for volume balance detection

Add scoping filters so balance detection can be limited to specific data
centers, racks, or nodes. Filters are applied both at the metrics level
(in the handler) and at the topology seeding level (in detection) to
ensure only the targeted infrastructure participates in balancing.

* address PR review: use set lookups, deduplicate test helpers, add target checks

* address review: assert non-empty tasks in filter tests

Prevent vacuous test passes by requiring len(tasks) > 0
before checking source/target exclusions.

* address review: enforce filter scope in fallback, clarify DC filter

- Thread allowedServers into createBalanceTask so the fallback
  planner cannot produce out-of-scope targets when DC/rack/node
  filters are active
- Update data_center_filter description to clarify single-DC usage

* address review: centralize parseCSVSet, fix filter scope leak, iterate all targets

- Extract ParseCSVSet to shared weed/worker/tasks/util package,
  remove duplicates from detection.go and volume_balance_handler.go
- Fix metric accumulation re-introducing filtered-out servers by
  only counting metrics for servers that passed DC/rack/node filters
- Trim DataCenterFilter before matching to handle trailing spaces
- Iterate all task.TypedParams.Targets in filter tests, not just [0]

* remove useless descriptor string test
2026-03-13 17:03:37 -07:00
Chris Lu
00ce1c6eba feat(plugin): enhanced collection filtering for volume balance (#8620)
* feat(plugin): enhanced collection filtering for volume balance

Replace wildcard matching with three collection filter modes:
- ALL_COLLECTIONS (default): treat all volumes as one pool
- EACH_COLLECTION: run detection separately per collection
- Regex pattern: filter volumes by matching collection names

The EACH_COLLECTION mode extracts distinct collections from metrics
and calls Detection() per collection, sharing the maxResults budget
and clusterInfo (with ActiveTopology) across all calls.

* address PR review: fix wildcard→regexp replacement, optimize EACH_COLLECTION

* address nitpick: fail fast on config errors (invalid regex)

Add configError type so invalid collection_filter regex returns
immediately instead of retrying across all masters with the same
bad config. Transient errors still retry.

* address review: constants, unbounded maxResults, wildcard compat

- Define collectionFilterAll/collectionFilterEach constants to
  eliminate magic strings across handler and metrics code
- Fix EACH_COLLECTION budget loop to treat maxResults <= 0 as
  unbounded, matching Detection's existing semantics
- Treat "*" as ALL_COLLECTIONS for backward compat with wildcard

* address review: nil guard in EACH_COLLECTION grouping loop

* remove useless descriptor string test
2026-03-13 17:02:59 -07:00
Lars Lehtonen
577a8459c9 fix(mount): return dropped error (#8623) 2026-03-13 15:57:35 -07:00
Chris Lu
34fe289f32 feat(balance): add volume state filter (ALL/ACTIVE/FULL) (#8619)
* feat(balance): add volume state filter (ALL/ACTIVE/FULL)

Add a volume_state admin config field to the plugin worker volume balance
handler, matching the shell's -volumeBy flag. This allows filtering volumes
by state before balance detection:
- ALL (default): consider all volumes
- ACTIVE: only writable volumes below the size limit (FullnessRatio < 1.01)
- FULL: only read-only volumes above the size limit (FullnessRatio >= 1.01)

The 1.01 threshold mirrors the shell's thresholdVolumeSize constant.

* address PR review: use enum/select widget, switch-based filter, nil safety

- Change volume_state field from string/text to enum/select with
  dropdown options (ALL, ACTIVE, FULL)
- Refactor filterMetricsByVolumeState to use switch with predicate
  function for clearer extensibility
- Add nil-check guard to prevent panic on nil metric elements
- Add TestFilterMetricsByVolumeState_NilElement regression test
2026-03-13 15:56:03 -07:00