94 Commits

Author SHA1 Message Date
Chris Lu
75a6a34528 dlm: resilient distributed locks via consistent hashing + backup replication (#8860)
* dlm: replace modulo hashing with consistent hash ring

Introduce HashRing with virtual nodes (CRC32-based consistent hashing)
to replace the modulo-based hashKeyToServer. When a filer node is
removed, only keys that hashed to that node are remapped to the next
server on the ring, leaving all other mappings stable. This is the
foundation for backup replication — the successor on the ring is
always the natural takeover node.

* dlm: add Generation and IsBackup fields to Lock

Lock now carries IsBackup (whether this node holds the lock as a backup
replica) and Generation (a monotonic fencing token that increments on
each fresh acquisition, stays the same on renewal). Add helper methods:
AllLocks, PromoteLock, DemoteLock, InsertBackupLock, RemoveLock, GetLock.

* dlm: add ReplicateLock RPC and generation/is_backup proto fields

Add generation field to LockResponse for fencing tokens.
Add generation and is_backup fields to Lock message.
Add ReplicateLock RPC for primary-to-backup lock replication.
Add ReplicateLockRequest/ReplicateLockResponse messages.

* dlm: add async backup replication to DistributedLockManager

Route lock/unlock via consistent hash ring's GetPrimaryAndBackup().
After a successful lock or unlock on the primary, asynchronously
replicate the operation to the backup server via ReplicateFunc
callback. Single-server deployments skip replication.

* dlm: add ReplicateLock handler and backup-aware topology changes

Add ReplicateLock gRPC handler for primary-to-backup replication.
Revise OnDlmChangeSnapshot to handle three cases on topology change:
- Promote backup locks when this node becomes primary
- Demote primary locks when this node becomes backup
- Transfer locks when this node is neither primary nor backup
Wire up SetupDlmReplication during filer server initialization.

* dlm: expose generation fencing token in lock client

LiveLock now captures the generation from LockResponse and exposes it
via Generation() method. Consumers can use this as a fencing token to
detect stale lock holders.

* dlm: update empty folder cleaner to use consistent hash ring

Replace local modulo-based hashKeyToServer with LockRing.GetPrimary()
which uses the shared consistent hash ring for folder ownership.

* dlm: add unit tests for consistent hash ring

Test basic operations, consistency on server removal (only keys from
removed server move), backup-is-successor property (backup becomes
new primary when primary is removed), and key distribution balance.

* dlm: add integration tests for lock replication failure scenarios

Test cases:
- Primary crash with backup promotion (backup has valid token)
- Backup crash with primary continuing
- Both primary and backup crash (lock lost, re-acquirable)
- Rolling restart across all nodes
- Generation fencing token increments on new acquisition
- Replication failure (primary still works independently)
- Unlock replicates deletion to backup
- Lock survives server addition (topology change)
- Consistent hashing minimal disruption (only removed server's keys move)

* dlm: address PR review findings

1. Causal replication ordering: Add per-lock sequence number (Seq) that
   increments on every mutation. Backup rejects incoming mutations with
   seq <= current seq, preventing stale async replications from
   overwriting newer state. Unlock replication also carries seq and is
   rejected if stale.

2. Demote-after-handoff: OnDlmChangeSnapshot now transfers the lock to
   the new primary first and only demotes to backup after a successful
   TransferLocks RPC. If the transfer fails, the lock stays as primary
   on this node.

3. SetSnapshot candidateServers leak: Replace the candidateServers map
   entirely instead of appending, so removed servers don't linger.

4. TransferLocks preserves Generation and Seq: InsertLock now accepts
   generation and seq parameters. After accepting a transferred lock,
   the receiving node re-replicates to its backup.

5. Rolling restart test: Add re-replication step after promotion and
   assert survivedCount > 0. Add TestDLM_StaleReplicationRejected.

6. Mixed-version upgrade note: Add comment on HashRing documenting that
   all filer nodes must be upgraded together.

* dlm: serve renewals locally during transfer window on node join

When a new node joins and steals hash ranges from surviving nodes,
there's a window between ring update and lock transfer where the
client gets redirected to a node that doesn't have the lock yet.

Fix: if the ring says primary != self but we still hold the lock
locally (non-backup, matching token), serve the renewal/unlock here
rather than redirecting. The lock will be transferred by
OnDlmChangeSnapshot, and subsequent requests will go to the new
primary once the transfer completes.

Add tests:
- TestDLM_NodeDropAndJoin_OwnershipDisruption: measures disruption
  when a node drops and a new one joins (14/100 surviving-node locks
  disrupted, all handled by transfer logic)
- TestDLM_RenewalDuringTransferWindow: verifies renewal succeeds on
  old primary during the transfer window

* dlm: master-managed lock ring with stabilization batching

The master now owns the lock ring membership. Instead of filers
independently reacting to individual ClusterNodeUpdate add/remove
events, the master:

1. Tracks filer membership in LockRingManager
2. Batches rapid changes with a 1-second stabilization timer
   (e.g., a node drop + join within 1 second → single ring update)
3. Broadcasts the complete ring snapshot atomically via the new
   LockRingUpdate message in KeepConnectedResponse

Filers receive the ring as a complete snapshot and apply it via
SetSnapshot, ensuring all filers converge to the same ring state
without intermediate churn.

This eliminates the double-churn problem where a rapid drop+join
would fire two separate ring mutations, each triggering lock
transfers and disrupting ownership on surviving nodes.

* dlm: track ring version, reject stale updates, remove dead code

SetSnapshot now takes a version parameter from the master. Stale
updates (version < current) are rejected, preventing reordered
messages from overwriting a newer ring state. Version 0 is always
accepted for bootstrap.

Remove AddServer/RemoveServer from LockRing — the ring is now
exclusively managed by the master via SetSnapshot. Remove the
candidateServers map that was only used by those methods.

* dlm: fix SelectLocks data race, advance generation on backup insert

- SelectLocks: change RLock to Lock since the function deletes map
  entries, which is a write operation and causes a data race under RLock.
- InsertBackupLock: advance nextGeneration to at least the incoming
  generation so that after failover promotion, new lock acquisitions
  get a generation strictly greater than any replicated lock.
- Bump replication failure log from V(1) to Warningf for production
  visibility.

* dlm: fix SetSnapshot race, test reliability, timer edge cases

- SetSnapshot: hold LockRing lock through both version update and
  Ring.SetServers() so they're atomic. Prevents a concurrent caller
  from seeing the new version but applying stale servers.
- Transfer window test: search for a key that actually moves primary
  when filer4 joins, instead of relying on a fixed key that may not.
- renewLock redirect: pass the existing token to the new primary
  instead of empty string, so redirected renewals work correctly.
- scheduleBroadcast: check timer.Stop() return value. If the timer
  already fired, the callback picks up latest state.
- FlushPending: only broadcast if timer.Stop() returns true (timer
  was still pending). If false, the callback is already running.
- Fix test comment: "idempotent" → "accepted, state-changing".

* dlm: use wall-clock nanoseconds for lock ring version

The lock ring version was an in-memory counter that reset to 0 on
master restart. A filer that had seen version 5 would reject version 1
from the restarted master.

Fix: use time.Now().UnixNano() as the version. This survives master
restarts without persistence — the restarted master produces a
version greater than any pre-restart value.

* dlm: treat expired lock owners as missing

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* dlm: reject stale lock transfers

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* dlm: order replication by generation

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* dlm: bootstrap lock ring on reconnect

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-03-30 23:29:56 -07:00
Chris Lu
baae672b6f feat: auto-disable master vacuum when plugin worker is active (#8624)
* feat: auto-disable master vacuum when plugin vacuum worker is active

When a vacuum-capable plugin worker connects to the admin server, the
admin server calls DisableVacuum on the master to prevent the automatic
scheduled vacuum from conflicting with the plugin worker's vacuum. When
the worker disconnects, EnableVacuum is called to restore the default
behavior. A safety net in the topology refresh loop re-enables vacuum
if the admin server disconnects without cleanup.

* rename isAdminServerConnected to isAdminServerConnectedFunc

* add 5s timeout to DisableVacuum/EnableVacuum gRPC calls

Prevents the monitor goroutine from blocking indefinitely if the
master is unresponsive.

* track plugin ownership of vacuum disable to avoid overriding operator

- Add vacuumDisabledByPlugin flag to Topology, set when DisableVacuum
  is called while admin server is connected (i.e., by plugin monitor)
- Safety net only re-enables vacuum when it was disabled by plugin,
  not when an operator intentionally disabled it via shell command
- EnableVacuum clears the plugin flag

* extract syncVacuumState for testability, add fake toggler tests

Extract the single sync step into syncVacuumState() with a
vacuumToggler interface. Add TestSyncVacuumState with a fake
toggler that verifies disable/enable calls on state transitions.

* use atomic.Bool for isDisableVacuum and vacuumDisabledByPlugin

Both fields are written by gRPC handlers and read by the vacuum
goroutine, causing a data race. Use atomic.Bool with Store/Load
for thread-safe access.

* use explicit by_plugin field instead of connection heuristic

Add by_plugin bool to DisableVacuumRequest proto so the caller
declares intent explicitly. The admin server monitor sets it to
true; shell commands leave it false. This prevents an operator's
intentional disable from being auto-reversed by the safety net.

* use setter for admin server callback instead of function parameter

Move isAdminServerConnected from StartRefreshWritableVolumes
parameter to Topology.SetAdminServerConnectedFunc() setter.
Keeps the function signature stable and decouples the topology
layer from the admin server concept.

* suppress repeated log messages on persistent sync failures

Add retrying parameter to syncVacuumState so the initial
state transition is logged at V(0) but subsequent retries
of the same transition are silent until the call succeeds.

* clear plugin ownership flag on manual DisableVacuum

Prevents stale plugin flag from causing incorrect auto-enable
when an operator manually disables vacuum after a plugin had
previously disabled it.

* add by_plugin to EnableVacuumRequest for symmetric ownership tracking

Plugin-driven EnableVacuum now only re-enables if the plugin was
the one that disabled it. If an operator manually disabled vacuum
after the plugin, the plugin's EnableVacuum is a no-op. This
prevents the plugin monitor from overriding operator intent on
worker disconnect.

* use cancellable context for monitorVacuumWorker goroutine

Replace context.Background() with a cancellable context stored
as bgCancel on AdminServer. Shutdown() calls bgCancel() so
monitorVacuumWorker exits cleanly via ctx.Done().

* track operator and plugin vacuum disables independently

Replace single isDisableVacuum flag with two independent flags:
vacuumDisabledByOperator and vacuumDisabledByPlugin. Each caller
only flips its own flag. The effective disabled state is the OR
of both. This prevents a plugin connect/disconnect cycle from
overriding an operator's manual disable, and vice versa.

* fix safety net to clear plugin flag, not operator flag

The safety net should call EnableVacuumByPlugin() to clear only
the plugin disable flag when the admin server disconnects. The
previous call to EnableVacuum() incorrectly cleared the operator
flag instead.
2026-03-13 22:49:12 -07:00
Chris Lu
b3620c7e14 admin: auto migrating master maintenance scripts to admin_script plugin config (#8509)
* admin: seed admin_script plugin config from master maintenance scripts

When the admin server starts, fetch the maintenance scripts configuration
from the master via GetMasterConfiguration. If the admin_script plugin
worker does not already have a saved config, use the master's scripts as
the default value. This enables seamless migration from master.toml
[master.maintenance] to the admin script plugin worker.

Changes:
- Add maintenance_scripts and maintenance_sleep_minutes fields to
  GetMasterConfigurationResponse in master.proto
- Populate the new fields from viper config in master_grpc_server.go
- On admin server startup, fetch the master config and seed the
  admin_script plugin config if no config exists yet
- Strip lock/unlock commands from the master scripts since the admin
  script worker handles locking automatically

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: address review comments on admin_script seeding

- Replace TOCTOU race (separate Load+Save) with atomic
  SaveJobTypeConfigIfNotExists on ConfigStore and Plugin
- Replace ineffective polling loop with single GetMaster call using
  30s context timeout, since GetMaster respects context cancellation
- Add unit tests for SaveJobTypeConfigIfNotExists (in-memory + on-disk)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: apply maintenance script defaults in gRPC handler

The gRPC handler for GetMasterConfiguration read maintenance scripts
from viper without calling SetDefault, relying on startAdminScripts
having run first. If the admin server calls GetMasterConfiguration
before startAdminScripts sets the defaults, viper returns empty
strings and the seeding is silently skipped.

Apply SetDefault in the gRPC handler itself so it is self-contained.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Revert "fix: apply maintenance script defaults in gRPC handler"

This reverts commit 068a5063303f6bc34825a07bb681adfa67e6f9de.

* fix: use atomic save in ensureJobTypeConfigFromDescriptor

ensureJobTypeConfigFromDescriptor used a separate Load + Save, racing
with seedAdminScriptFromMaster. If the descriptor defaults (empty
script) were saved first, SaveJobTypeConfigIfNotExists in the seeding
goroutine would see an existing config and skip, losing the master's
maintenance scripts.

Switch to SaveJobTypeConfigIfNotExists so both paths are atomic. Whichever
wins, the other is a safe no-op.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: fetch master scripts inline during config bootstrap, not in goroutine

Replace the seedAdminScriptFromMaster goroutine with a
ConfigDefaultsProvider callback. When the plugin bootstraps
admin_script defaults from the worker descriptor, it calls the
provider which fetches maintenance scripts from the master
synchronously. This eliminates the race between the seeding
goroutine and the descriptor-based config bootstrap.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* skip commented lock unlock

Co-Authored-By: Copilot <223556219+Copilot@users.noreply.github.com>

* reduce grpc calls

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-03-04 22:11:07 -08:00
Chris Lu
f5c35240be Add volume dir tags and EC placement priority (#8472)
* Add volume dir tags to topology

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Add preferred tag config for EC

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Prioritize EC destinations by tags

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Add EC placement planner tag tests

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Refactor EC placement tests to reuse buildActiveTopology

Remove buildActiveTopologyWithDiskTags helper function and consolidate
tag setup inline in test cases. Tests now use UpdateTopology to apply
tags after topology creation, reusing the existing buildActiveTopology
function rather than duplicating its logic.

All tag scenario tests pass:
- TestECPlacementPlannerPrefersTaggedDisks
- TestECPlacementPlannerFallsBackWhenTagsInsufficient

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Consolidate normalizeTagList into shared util package

Extract normalizeTagList from three locations (volume.go,
detection.go, erasure_coding_handler.go) into new weed/util/tag.go
as exported NormalizeTagList function. Replace all duplicate
implementations with imports and calls to util.NormalizeTagList.

This improves code reuse and maintainability by centralizing
tag normalization logic.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Add PreferredTags to EC config persistence

Add preferred_tags field to ErasureCodingTaskConfig protobuf with field
number 5. Update GetConfigSpec to include preferred_tags field in the
UI configuration schema. Add PreferredTags to ToTaskPolicy to serialize
config to protobuf. Add PreferredTags to FromTaskPolicy to deserialize
from protobuf with defensive copy to prevent external mutation.

This allows EC preferred tags to be persisted and restored across
worker restarts.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Add defensive copy for Tags slice in DiskLocation

Copy the incoming tags slice in NewDiskLocation instead of storing
by reference. This prevents external callers from mutating the
DiskLocation.Tags slice after construction, improving encapsulation
and preventing unexpected changes to disk metadata.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Add doc comment to buildCandidateSets method

Document the tiered candidate selection and fallback behavior. Explain
that for a planner with preferredTags, it accumulates disks matching
each tag in order into progressively larger tiers, emits a candidate
set once a tier reaches shardsNeeded, and finally falls back to the
full candidates set if preferred-tag tiers are insufficient.

This clarifies the intended semantics for future maintainers.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Apply final PR review fixes

1. Update parseVolumeTags to replicate single tag entry to all folders
   instead of leaving some folders with nil tags. This prevents nil
   pointer dereferences when processing folders without explicit tags.

2. Add defensive copy in ToTaskPolicy for PreferredTags slice to match
   the pattern used in FromTaskPolicy, preventing external mutation of
   the returned TaskPolicy.

3. Add clarifying comment in buildCandidateSets explaining that the
   shardsNeeded <= 0 branch is a defensive check for direct callers,
   since selectDestinations guarantees shardsNeeded > 0.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Fix nil pointer dereference in parseVolumeTags

Ensure all folder tags are initialized to either normalized tags or
empty slices, not nil. When multiple tag entries are provided and there
are more folders than entries, remaining folders now get empty slices
instead of nil, preventing nil pointer dereference in downstream code.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Fix NormalizeTagList to return empty slice instead of nil

Change NormalizeTagList to always return a non-nil slice. When all tags
are empty or whitespace after normalization, return an empty slice
instead of nil. This prevents nil pointer dereferences in downstream
code that expects a valid (possibly empty) slice.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Add nil safety check for v.tags pointer

Add a safety check to handle the case where v.tags might be nil,
preventing a nil pointer dereference. If v.tags is nil, use an empty
string instead. This is defensive programming to prevent panics in
edge cases.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Add volume.tags flag to weed server and weed mini commands

Add the volume.tags CLI option to both the 'weed server' and 'weed mini'
commands. This allows users to specify disk tags when running the
combined server modes, just like they can with 'weed volume'.

The flag uses the same format and description as the volume command:
comma-separated tag groups per data dir with ':' separators
(e.g. fast:ssd,archive).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

---------

Co-authored-by: Copilot <copilot@github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-03-01 10:22:00 -08:00
Lisandro Pin
2af293ce60 Boostrap persistent state for volume servers. (#7984)
This PR implements logic load/save persistent state information for storages
associated with volume servers, and reporting state changes back to masters
via heartbeat messages.

More work ensues!

See https://github.com/seaweedfs/seaweedfs/issues/7977 for details.
2026-01-12 10:49:59 -08:00
Chris Lu
f4cdfcc5fd Add cluster.raft.leader.transfer command for graceful leader change (#7819)
* proto: add RaftLeadershipTransfer RPC for forced leader change

Add new gRPC RPC and messages for leadership transfer:
- RaftLeadershipTransferRequest: optional target_id and target_address
- RaftLeadershipTransferResponse: previous_leader and new_leader

This enables graceful leadership transfer before master maintenance,
reducing errors in filers during planned maintenance windows.

Ref: https://github.com/seaweedfs/seaweedfs/issues/7527

* proto: regenerate Go files for RaftLeadershipTransfer

Generated from master.proto changes.

* master: implement RaftLeadershipTransfer gRPC handler

Add gRPC handler for leadership transfer with support for:
- Transfer to any eligible follower (when target_id is empty)
- Transfer to a specific server (when target_id and target_address are provided)

Uses hashicorp/raft LeadershipTransfer() and LeadershipTransferToServer() APIs.

Returns the previous and new leader in the response.

* shell: add cluster.raft.leader.transfer command

Add weed shell command for graceful leadership transfer:
- Displays current cluster status before transfer
- Supports auto-selection of target (any eligible follower)
- Supports targeted transfer with -id and -address flags
- Provides clear feedback on success/failure with troubleshooting tips

Usage:
  cluster.raft.leader.transfer
  cluster.raft.leader.transfer -id <server_id> -address <grpc_address>

* master: add unit tests for raft gRPC handlers

Add tests covering:
- RaftLeadershipTransfer with no raft initialized
- RaftLeadershipTransfer with target_id but no address
- RaftListClusterServers with no raft initialized
- RaftAddServer with no raft initialized
- RaftRemoveServer with no raft initialized

These tests verify error handling when raft is not configured.

* shell: add tests for cluster.raft.leader.transfer command

Add tests covering:
- Command name and help text validation
- HasTag returns false for ResourceHeavy
- Validation of -id without -address
- Argument parsing with unknown flags

* master: clarify that leadership transfer requires -raftHashicorp

The default raft implementation (seaweedfs/raft, a goraft fork) does not
support graceful leadership transfer. This feature is only available when
using hashicorp raft (-raftHashicorp=true).

Update error messages and help text to make this requirement clear:
- gRPC handler returns specific error for goraft users
- Shell command help text notes the requirement
- Added test for goraft case

* test: use strings.Contains instead of custom helper

Replace custom contains/containsHelper functions with the standard
library strings.Contains for better maintainability.

* shell: return flag parsing errors instead of swallowing them

- Return the error from flag.Parse() instead of returning nil
- Update test to explicitly assert error for unknown flags

* test: document integration test scenarios for Raft leadership transfer

Add comments explaining:
- Why these unit tests only cover 'Raft not initialized' scenarios
- What integration tests should cover (with multi-master cluster)
- hashicorp/raft uses concrete types that cannot be easily mocked

* fix: address reviewer feedback on tests and leader routing

- Remove misleading tests that couldn't properly validate their
  documented behavior without a real Raft cluster:
  - TestRaftLeadershipTransfer_GoraftNotSupported
  - TestRaftLeadershipTransfer_ValidationTargetIdWithoutAddress

- Change WithClient(false) to WithClient(true) for RaftLeadershipTransfer
  RPC to ensure the request is routed to the current leader

* Improve cluster.raft.transferLeader command

- Rename command from cluster.raft.leader.transfer to cluster.raft.transferLeader
- Add symmetric validation: -id and -address must be specified together
- Handle case where same leader is re-elected after transfer
- Add test for -address without -id validation
- Add docker compose file for 5-master raft cluster testing
2025-12-19 00:15:39 -08:00
Chris Lu
5ed0b00fb9 Support separate volume server ID independent of RPC bind address (#7609)
* pb: add id field to Heartbeat message for stable volume server identification

This adds an 'id' field to the Heartbeat protobuf message that allows
volume servers to identify themselves independently of their IP:port address.

Ref: https://github.com/seaweedfs/seaweedfs/issues/7487

* storage: add Id field to Store struct

Add Id field to Store struct and include it in CollectHeartbeat().
The Id field provides a stable volume server identity independent of IP:port.

Ref: https://github.com/seaweedfs/seaweedfs/issues/7487

* topology: support id-based DataNode identification

Update GetOrCreateDataNode to accept an id parameter for stable node
identification. When id is provided, the DataNode can maintain its identity
even when its IP address changes (e.g., in Kubernetes pod reschedules).

For backward compatibility:
- If id is provided, use it as the node ID
- If id is empty, fall back to ip:port

Ref: https://github.com/seaweedfs/seaweedfs/issues/7487

* volume: add -id flag for stable volume server identity

Add -id command line flag to volume server that allows specifying a stable
identifier independent of the IP address. This is useful for Kubernetes
deployments with hostPath volumes where pods can be rescheduled to different
nodes while the persisted data remains on the original node.

Usage: weed volume -id=node-1 -ip=10.0.0.1 ...

If -id is not specified, it defaults to ip:port for backward compatibility.

Fixes https://github.com/seaweedfs/seaweedfs/issues/7487

* server: add -volume.id flag to weed server command

Support the -volume.id flag in the all-in-one 'weed server' command,
consistent with the standalone 'weed volume' command.

Usage: weed server -volume.id=node-1 ...

Ref: https://github.com/seaweedfs/seaweedfs/issues/7487

* topology: add test for id-based DataNode identification

Test the key scenarios:
1. Create DataNode with explicit id
2. Same id with different IP returns same DataNode (K8s reschedule)
3. IP/PublicUrl are updated when node reconnects with new address
4. Different id creates new DataNode
5. Empty id falls back to ip:port (backward compatibility)

Ref: https://github.com/seaweedfs/seaweedfs/issues/7487

* pb: add address field to DataNodeInfo for proper node addressing

Previously, DataNodeInfo.Id was used as the node address, which worked
when Id was always ip:port. Now that Id can be an explicit string,
we need a separate Address field for connection purposes.

Changes:
- Add 'address' field to DataNodeInfo protobuf message
- Update ToDataNodeInfo() to populate the address field
- Update NewServerAddressFromDataNode() to use Address (with Id fallback)
- Fix LookupEcVolume to use dn.Url() instead of dn.Id()

Ref: https://github.com/seaweedfs/seaweedfs/issues/7487

* fix: trim whitespace from volume server id and fix test

- Trim whitespace from -id flag to treat ' ' as empty
- Fix store_load_balancing_test.go to include id parameter in NewStore call

Ref: https://github.com/seaweedfs/seaweedfs/issues/7487

* refactor: extract GetVolumeServerId to util package

Move the volume server ID determination logic to a shared utility function
to avoid code duplication between volume.go and rack.go.

Ref: https://github.com/seaweedfs/seaweedfs/issues/7487

* fix: improve transition logic for legacy nodes

- Use exact ip:port match instead of net.SplitHostPort heuristic
- Update GrpcPort and PublicUrl during transition for consistency
- Remove unused net import

Ref: https://github.com/seaweedfs/seaweedfs/issues/7487

* fix: add id normalization and address change logging

- Normalize id parameter at function boundary (trim whitespace)
- Log when DataNode IP:Port changes (helps debug K8s pod rescheduling)

Ref: https://github.com/seaweedfs/seaweedfs/issues/7487
2025-12-02 22:08:11 -08:00
Chris Lu
9d013ea9b8 Admin UI: include ec shard sizes into volume server info (#7071)
* show ec shards on dashboard, show max in its own column

* master collect shard size info

* master send shard size via VolumeList

* change to more efficient shard sizes slice

* include ec shard sizes into volume server info

* Eliminated Redundant gRPC Calls

* much more efficient

* Efficient Counting: bits.OnesCount32() uses CPU-optimized instructions to count set bits in O(1)

* avoid extra volume list call

* simplify

* preserve existing shard sizes

* avoid hard coded value

* Update weed/storage/erasure_coding/ec_volume_info.go

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update weed/admin/dash/volume_management.go

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update ec_volume_info.go

* address comments

* avoid duplicated functions

* Update weed/admin/dash/volume_management.go

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* simplify

* refactoring

* fix compilation

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-08-02 02:16:49 -07:00
Chris Lu
891a2fb6eb Admin: misc improvements on admin server and workers. EC now works. (#7055)
* initial design

* added simulation as tests

* reorganized the codebase to move the simulation framework and tests into their own dedicated package

* integration test. ec worker task

* remove "enhanced" reference

* start master, volume servers, filer

Current Status
 Master: Healthy and running (port 9333)
 Filer: Healthy and running (port 8888)
 Volume Servers: All 6 servers running (ports 8080-8085)
🔄 Admin/Workers: Will start when dependencies are ready

* generate write load

* tasks are assigned

* admin start wtih grpc port. worker has its own working directory

* Update .gitignore

* working worker and admin. Task detection is not working yet.

* compiles, detection uses volumeSizeLimitMB from master

* compiles

* worker retries connecting to admin

* build and restart

* rendering pending tasks

* skip task ID column

* sticky worker id

* test canScheduleTaskNow

* worker reconnect to admin

* clean up logs

* worker register itself first

* worker can run ec work and report status

but:
1. one volume should not be repeatedly worked on.
2. ec shards needs to be distributed and source data should be deleted.

* move ec task logic

* listing ec shards

* local copy, ec. Need to distribute.

* ec is mostly working now

* distribution of ec shards needs improvement
* need configuration to enable ec

* show ec volumes

* interval field UI component

* rename

* integration test with vauuming

* garbage percentage threshold

* fix warning

* display ec shard sizes

* fix ec volumes list

* Update ui.go

* show default values

* ensure correct default value

* MaintenanceConfig use ConfigField

* use schema defined defaults

* config

* reduce duplication

* refactor to use BaseUIProvider

* each task register its schema

* checkECEncodingCandidate use ecDetector

* use vacuumDetector

* use volumeSizeLimitMB

* remove

remove

* remove unused

* refactor

* use new framework

* remove v2 reference

* refactor

* left menu can scroll now

* The maintenance manager was not being initialized when no data directory was configured for persistent storage.

* saving config

* Update task_config_schema_templ.go

* enable/disable tasks

* protobuf encoded task configurations

* fix system settings

* use ui component

* remove logs

* interface{} Reduction

* reduce interface{}

* reduce interface{}

* avoid from/to map

* reduce interface{}

* refactor

* keep it DRY

* added logging

* debug messages

* debug level

* debug

* show the log caller line

* use configured task policy

* log level

* handle admin heartbeat response

* Update worker.go

* fix EC rack and dc count

* Report task status to admin server

* fix task logging, simplify interface checking, use erasure_coding constants

* factor in empty volume server during task planning

* volume.list adds disk id

* track disk id also

* fix locking scheduled and manual scanning

* add active topology

* simplify task detector

* ec task completed, but shards are not showing up

* implement ec in ec_typed.go

* adjust log level

* dedup

* implementing ec copying shards and only ecx files

* use disk id when distributing ec shards

🎯 Planning: ActiveTopology creates DestinationPlan with specific TargetDisk
📦 Task Creation: maintenance_integration.go creates ECDestination with DiskId
🚀 Task Execution: EC task passes DiskId in VolumeEcShardsCopyRequest
💾 Volume Server: Receives disk_id and stores shards on specific disk (vs.store.Locations[req.DiskId])
📂 File System: EC shards and metadata land in the exact disk directory planned

* Delete original volume from all locations

* clean up existing shard locations

* local encoding and distributing

* Update docker/admin_integration/EC-TESTING-README.md

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* check volume id range

* simplify

* fix tests

* fix types

* clean up logs and tests

---------

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-07-30 12:38:03 -07:00
chrislu
ae5bd0667a rename proto field from DestroyTime to expire_at_sec
For TTL volume converted into EC volume, this change may leave the volumes staying.
2024-10-24 21:35:11 -07:00
LHHDZ
4dc33cc143 fix unclaimed spaces calculation when volumePreallocate is enabled (#6063)
the calculation of `unclaimedSpaces` only needs to subtract `unusedSpace` when `preallocate` is not enabled.

Signed-off-by: LHHDZ <shichanglin5@qq.com>
2024-09-24 23:04:18 -07:00
Konstantin Lebedev
15965f7c54 [shell] fix volume grow in shell (#5992)
* fix volume grow in shell

* revert add Async

* check available volume space

* create a VolumeGrowRequest and remove unnecessary fields
2024-09-09 11:42:56 -07:00
chrislu
0cf2c15828 rename 2024-08-27 09:02:48 -07:00
augustazz
0b00706454 EC volume supports expiration and displays expiration message when executing volume.list (#5895)
* ec volume expire

* volume.list show DestroyTime

* comments

* code optimization

---------

Co-authored-by: xuwenfeng <xuwenfeng1@zto.com>
2024-08-16 00:20:00 -07:00
chrislu
010c5e91e3 add stream assign proto 2023-08-22 09:53:54 -07:00
chrislu
8ec1bc2c99 remove unused cluster node leader 2023-06-19 18:19:13 -07:00
Guo Lei
d8cfa1552b support enable/disable vacuum (#4087)
* stop vacuum

* suspend/resume vacuum

* remove unused code

* rename

* rename param
2022-12-28 01:36:44 -08:00
Konstantin Lebedev
36daa7709d show raft leader via shell (#3796) 2022-10-06 07:10:41 -07:00
chrislu
c8645fd232 master: implement grpc VolumeMarkWritable
fix https://github.com/seaweedfs/seaweedfs/issues/3657
2022-09-14 23:05:30 -07:00
chrislu
b9112747b5 volume server: synchronously report volume readonly status to master
fix https://github.com/seaweedfs/seaweedfs/issues/3628
2022-09-11 19:29:10 -07:00
Konstantin Lebedev
4d08393b7c filer prefer volume server in same data center (#3405)
* initial prefer same data center
https://github.com/seaweedfs/seaweedfs/issues/3404

* GetDataCenter

* prefer same data center for ReplicationSource

* GetDataCenterId

* remove glog
2022-08-04 17:35:00 -07:00
chrislu
26dbc6c905 move to https://github.com/seaweedfs/seaweedfs 2022-07-29 00:17:28 -07:00
chrislu
9f479aab98 allocate brokers to serve segments 2022-07-28 23:24:38 -07:00
chrislu
f25e273e32 display data center and rack in cluster.ps 2022-07-28 23:22:52 -07:00
chrislu
68065128b8 add dc and rack 2022-07-28 23:22:51 -07:00
chrislu
682382648e collect cluster node start time 2022-05-30 16:23:52 -07:00
guol-fnst
b12944f9c6 fix naming convention
notify volume server of duplicate directoris
improve searching efficiency
2022-05-17 15:41:49 +08:00
guol-fnst
de6aa9cce8 avoid duplicated volume directory 2022-05-16 19:33:51 +08:00
chrislu
94635e9b5c filer: add filer group 2022-05-01 21:59:16 -07:00
Konstantin Lebedev
1e35b4929f shell vacuum volume by collection and volume id 2022-04-18 18:40:58 +05:00
chrislu
b4be56bb3b add timing info during ping operation 2022-04-16 12:45:49 -07:00
Konstantin Lebedev
f5246b748d Merge branch 'new_master' into hashicorp_raft
# Conflicts:
#	weed/pb/master_pb/master.pb.go
2022-04-07 18:50:27 +05:00
Konstantin Lebedev
85d80fd36d fix removing old raft server 2022-04-07 15:31:37 +05:00
Konstantin Lebedev
357aa818fe add raft shell cmds 2022-04-06 15:23:53 +05:00
chrislu
bc888226fc erasure coding: tracking encoded/decoded volumes
If an EC shard is created but not spread to other servers, the masterclient would think this shard is not located here.
2022-04-05 19:03:02 -07:00
chrislu
bbbbbd70a4 master supports grpc ping 2022-04-01 16:50:58 -07:00
chrislu
a2d3f89c7b add lock messages 2021-12-10 13:24:38 -08:00
Chris Lu
e0fc2898e9 auto updated filer peer list 2021-11-06 14:23:35 -07:00
Chris Lu
4b9c42996a refactor grpc API 2021-11-05 18:11:40 -07:00
Chris Lu
5ea86ef1da Revert "master: rename grpc function KeepConnected() to SubscribeVolumeLocationUpdates()"
This reverts commit af71ae11aa.
2021-11-05 17:52:15 -07:00
Chris Lu
af71ae11aa master: rename grpc function KeepConnected() to SubscribeVolumeLocationUpdates() 2021-11-03 01:09:48 -07:00
Chris Lu
5160eb08f7 shell: optionally read filer address from master 2021-11-02 23:38:45 -07:00
Chris Lu
e5fc35ed0c change server address from string to a type 2021-09-12 22:47:52 -07:00
Chris Lu
e93d4935e3 add other replica locations when assigning volumes 2021-09-05 23:32:25 -07:00
Chris Lu
5a0f92423e use grpc and jwt 2021-08-12 21:40:33 -07:00
Chris Lu
5571f4f70a master: add master.follower to handle read file id lookup requests 2021-08-12 18:10:59 -07:00
Chris Lu
4370a4db63 use int64 for volume count in case of negative overflow 2021-08-08 15:19:39 -07:00
Chris Lu
f0ad172e80 shell: show which server holds the lock
fix https://github.com/chrislusf/seaweedfs/issues/1983
2021-04-22 23:56:35 -07:00
Chris Lu
f8446b42ab this can compile now!!! 2021-02-16 02:47:02 -08:00
Chris Lu
94525aa0fd allocate volume by disk type 2020-12-13 23:08:21 -08:00