* fix(master): fast resume state and default resumeState to true
When resumeState is enabled in single-master mode, the raft server had
existing log entries so the self-join path couldn't promote to leader.
The server waited the full election timeout (10-20s) before self-electing.
Fix by temporarily setting election timeout to 1ms before Start() when
in single-master + resumeState mode with existing log, then restoring
the original timeout after leader election. This makes resume near-instant.
Also change the default for resumeState from false to true across all
CLI commands (master, mini, server) so state is preserved by default.
* fix(master): prevent fastResume goroutine from hanging forever
Use defer to guarantee election timeout is always restored, and bound
the polling loop with a timeout so it cannot spin indefinitely if
leader election never succeeds.
* fix(master): use ticker instead of time.After in fastResume polling loop
* fix: clear raft vote state file on non-resume startup
The seaweedfs/raft library v1.1.7 added a persistent `state` file for
currentTerm and votedFor. When RaftResumeState=false (the default), the
log, conf, and snapshot directories are cleared but this state file was
not. On repeated restarts, different masters accumulate divergent terms,
causing AppendEntries rejections and preventing leader election.
Fixes#8690
* fix: recover TopologyId from snapshot before clearing raft state
When RaftResumeState=false clears log/conf/snapshot, the TopologyId
(used for license validation) was lost. Now extract it from the latest
snapshot before cleanup and restore it on the topology.
Both seaweedfs/raft and hashicorp/raft paths are handled, with a shared
recoverTopologyIdFromState helper in raft_common.go.
* fix: stagger multi-master bootstrap delay by peer index
Previously all masters used a fixed 1500ms delay before the bootstrap
check. Now the delay is proportional to the peer's sorted index with
randomization (matching the hashicorp raft path), giving the designated
bootstrap node (peer 0) a head start while later peers wait for gRPC
servers to be ready.
Also adds diagnostic logging showing why DoJoinCommand was or wasn't
called, making leader election issues easier to diagnose from logs.
* fix: skip unreachable masters during leader reconnection
When a master leader goes down, non-leader masters still redirect
clients to the stale leader address. The masterClient would follow
these redirects, fail, and retry — wasting round-trips each cycle.
Now tryAllMasters tracks which masters failed within a cycle and skips
redirects pointing to them, reducing log spam and connection overhead
during leader failover.
* fix: take snapshot after TopologyId generation for recovery
After generating a new TopologyId on the leader, immediately take a raft
snapshot so the ID can be recovered from the snapshot on future restarts
with RaftResumeState=false. Without this, short-lived clusters would
lose the TopologyId on restart since no automatic snapshot had been
taken yet.
* test: add multi-master raft failover integration tests
Integration test framework and 5 test scenarios for 3-node master
clusters:
- TestLeaderConsistencyAcrossNodes: all nodes agree on leader and
TopologyId
- TestLeaderDownAndRecoverQuickly: leader stops, new leader elected,
old leader rejoins as follower
- TestLeaderDownSlowRecover: leader gone for extended period, cluster
continues with 2/3 quorum
- TestTwoMastersDownAndRestart: quorum lost (2/3 down), recovered
when both restart
- TestAllMastersDownAndRestart: full cluster restart, leader elected,
all nodes agree on TopologyId
* fix: address PR review comments
- peerIndex: return -1 (not 0) when self not found, add warning log
- recoverTopologyIdFromSnapshot: defer dir.Close()
- tests: check GetTopologyId errors instead of discarding them
* fix: address review comments on failover tests
- Assert no leader after quorum loss (was only logging)
- Verify follower cs.Leader matches expected leader via
ServerAddress.ToHttpAddress() comparison
- Check GetTopologyId error in TestTwoMastersDownAndRestart
* iam: add XML responses for managed user policy APIs
* s3api: implement attach/detach/list attached user policies
* s3api: add embedded IAM tests for managed user policies
* iam: update CredentialStore interface and Manager for managed policies
Updated the `CredentialStore` interface to include `AttachUserPolicy`,
`DetachUserPolicy`, and `ListAttachedUserPolicies` methods.
The `CredentialManager` was updated to delegate these calls to the store.
Added common error variables for policy management.
* iam: implement managed policy methods in MemoryStore
Implemented `AttachUserPolicy`, `DetachUserPolicy`, and
`ListAttachedUserPolicies` in the MemoryStore.
Also ensured deep copying of identities includes PolicyNames.
* iam: implement managed policy methods in PostgresStore
Modified Postgres schema to include `policy_names` JSONB column in `users`.
Implemented `AttachUserPolicy`, `DetachUserPolicy`, and `ListAttachedUserPolicies`.
Updated user CRUD operations to handle policy names persistence.
* iam: implement managed policy methods in remaining stores
Implemented user policy management in:
- `FilerEtcStore` (partial implementation)
- `IamGrpcStore` (delegated via GetUser/UpdateUser)
- `PropagatingCredentialStore` (to broadcast updates)
Ensures cluster-wide consistency for policy attachments.
* s3api: refactor EmbeddedIamApi to use managed policy APIs
- Refactored `AttachUserPolicy`, `DetachUserPolicy`, and `ListAttachedUserPolicies`
to use `e.credentialManager` directly.
- Fixed a critical error suppression bug in `ExecuteAction` that always
returned success even on failure.
- Implemented robust error matching using string comparison fallbacks.
- Improved consistency by reloading configuration after policy changes.
* s3api: update and refine IAM integration tests
- Updated tests to use a real `MemoryStore`-backed `CredentialManager`.
- Refined test configuration synchronization using `sync.Once` and
manual deep-copying to prevent state corruption.
- Improved `extractEmbeddedIamErrorCodeAndMessage` to handle more XML
formats robustly.
- Adjusted test expectations to match current AWS IAM behavior.
* fix compilation
* visibility
* ensure 10 policies
* reload
* add integration tests
* Guard raft command registration
* Allow IAM actions in policy tests
* Validate gRPC policy attachments
* Revert Validate gRPC policy attachments
* Tighten gRPC policy attach/detach
* Improve IAM managed policy handling
* Improve managed policy filters
* Prevent split-brain: Persistent ClusterID and Join Validation
- Persist ClusterId in Raft store to survive restarts.
- Validate ClusterId on Raft command application (piggybacked on MaxVolumeId).
- Prevent masters with conflicting ClusterIds from joining/operating together.
- Update Telemetry to report the persistent ClusterId.
* Refine ClusterID validation based on feedback
- Improved error message in cluster_commands.go.
- Added ClusterId mismatch check in RaftServer.Recovery.
* Handle Raft errors and support Hashicorp Raft for ClusterId
- Check for errors when persisting ClusterId in legacy Raft.
- Implement ClusterId generation and persistence for Hashicorp Raft leader changes.
- Ensure consistent error logging.
* Refactor ClusterId validation
- Centralize ClusterId mismatch check in Topology.SetClusterId.
- Simplify MaxVolumeIdCommand.Apply and RaftServer.Recovery to rely on SetClusterId.
* Fix goroutine leak and add timeout
- Handle channel closure in Hashicorp Raft leader listener.
- Add timeout to Raft Apply call to prevent blocking.
* Fix deadlock in legacy Raft listener
- Wrap ClusterId generation/persistence in a goroutine to avoid blocking the Raft event loop (deadlock).
* Rename ClusterId to SystemId
- Renamed ClusterId to SystemId across the codebase (protobuf, topology, server, telemetry).
- Regenerated telemetry.pb.go with new field.
* Rename SystemId to TopologyId
- Rename to SystemId was intermediate step.
- Final name is TopologyId for the persistent cluster identifier.
- Updated protobuf, topology, raft server, master server, and telemetry.
* Optimize Hashicorp Raft listener
- Integrated TopologyId generation into existing monitorLeaderLoop.
- Removed extra goroutine in master_server.go.
* Fix optimistic TopologyId update
- Removed premature local state update of TopologyId in master_server.go and raft_hashicorp.go.
- State is now solely updated via the Raft state machine Apply/Restore methods after consensus.
* Add explicit log for recovered TopologyId
- Added glog.V(0) info log in RaftServer.Recovery to print the recovered TopologyId on startup.
* Add Raft barrier to prevent TopologyId race condition
- Implement ensureTopologyId helper method
- Send no-op MaxVolumeIdCommand to sync Raft log before checking TopologyId
- Ensures persisted TopologyId is recovered before generating new one
- Prevents race where generation happens during log replay
* Serialize TopologyId generation with mutex
- Add topologyIdGenLock mutex to MasterServer struct
- Wrap ensureTopologyId method with lock to prevent concurrent generation
- Fixes race where event listener and manual leadership check both generate IDs
- Second caller waits for first to complete and sees the generated ID
* Add TopologyId recovery logging to Apply method
- Change log level from V(1) to V(0) for visibility
- Log 'Recovered TopologyId' when applying from Raft log
- Ensures recovery is visible whether from snapshot or log replay
- Matches Recovery() method logging for consistency
* Fix Raft barrier timing issue
- Add 100ms delay after barrier command to ensure log application completes
- Add debug logging to track barrier execution and TopologyId state
- Return early if barrier command fails
- Prevents TopologyId generation before old logs are fully applied
* ensure leader
* address comments
* address comments
* redundant
* clean up
* double check
* refactoring
* comment