Commit Graph

53 Commits

Author SHA1 Message Date
Chris Lu
4efe0acaf5 fix(master): fast resume state and default resumeState to true (#8925)
* fix(master): fast resume state and default resumeState to true

When resumeState is enabled in single-master mode, the raft server had
existing log entries so the self-join path couldn't promote to leader.
The server waited the full election timeout (10-20s) before self-electing.

Fix by temporarily setting election timeout to 1ms before Start() when
in single-master + resumeState mode with existing log, then restoring
the original timeout after leader election. This makes resume near-instant.

Also change the default for resumeState from false to true across all
CLI commands (master, mini, server) so state is preserved by default.

* fix(master): prevent fastResume goroutine from hanging forever

Use defer to guarantee election timeout is always restored, and bound
the polling loop with a timeout so it cannot spin indefinitely if
leader election never succeeds.

* fix(master): use ticker instead of time.After in fastResume polling loop
2026-04-04 14:15:56 -07:00
Chris Lu
15f4a97029 fix: improve raft leader election reliability and failover speed (#8692)
* fix: clear raft vote state file on non-resume startup

The seaweedfs/raft library v1.1.7 added a persistent `state` file for
currentTerm and votedFor. When RaftResumeState=false (the default), the
log, conf, and snapshot directories are cleared but this state file was
not. On repeated restarts, different masters accumulate divergent terms,
causing AppendEntries rejections and preventing leader election.

Fixes #8690

* fix: recover TopologyId from snapshot before clearing raft state

When RaftResumeState=false clears log/conf/snapshot, the TopologyId
(used for license validation) was lost. Now extract it from the latest
snapshot before cleanup and restore it on the topology.

Both seaweedfs/raft and hashicorp/raft paths are handled, with a shared
recoverTopologyIdFromState helper in raft_common.go.

* fix: stagger multi-master bootstrap delay by peer index

Previously all masters used a fixed 1500ms delay before the bootstrap
check. Now the delay is proportional to the peer's sorted index with
randomization (matching the hashicorp raft path), giving the designated
bootstrap node (peer 0) a head start while later peers wait for gRPC
servers to be ready.

Also adds diagnostic logging showing why DoJoinCommand was or wasn't
called, making leader election issues easier to diagnose from logs.

* fix: skip unreachable masters during leader reconnection

When a master leader goes down, non-leader masters still redirect
clients to the stale leader address. The masterClient would follow
these redirects, fail, and retry — wasting round-trips each cycle.

Now tryAllMasters tracks which masters failed within a cycle and skips
redirects pointing to them, reducing log spam and connection overhead
during leader failover.

* fix: take snapshot after TopologyId generation for recovery

After generating a new TopologyId on the leader, immediately take a raft
snapshot so the ID can be recovered from the snapshot on future restarts
with RaftResumeState=false. Without this, short-lived clusters would
lose the TopologyId on restart since no automatic snapshot had been
taken yet.

* test: add multi-master raft failover integration tests

Integration test framework and 5 test scenarios for 3-node master
clusters:

- TestLeaderConsistencyAcrossNodes: all nodes agree on leader and
  TopologyId
- TestLeaderDownAndRecoverQuickly: leader stops, new leader elected,
  old leader rejoins as follower
- TestLeaderDownSlowRecover: leader gone for extended period, cluster
  continues with 2/3 quorum
- TestTwoMastersDownAndRestart: quorum lost (2/3 down), recovered
  when both restart
- TestAllMastersDownAndRestart: full cluster restart, leader elected,
  all nodes agree on TopologyId

* fix: address PR review comments

- peerIndex: return -1 (not 0) when self not found, add warning log
- recoverTopologyIdFromSnapshot: defer dir.Close()
- tests: check GetTopologyId errors instead of discarding them

* fix: address review comments on failover tests

- Assert no leader after quorum loss (was only logging)
- Verify follower cs.Leader matches expected leader via
  ServerAddress.ToHttpAddress() comparison
- Check GetTopologyId error in TestTwoMastersDownAndRestart
2026-03-18 23:28:07 -07:00
Chris Lu
7b8df39cf7 s3api: add AttachUserPolicy/DetachUserPolicy/ListAttachedUserPolicies (#8379)
* iam: add XML responses for managed user policy APIs

* s3api: implement attach/detach/list attached user policies

* s3api: add embedded IAM tests for managed user policies

* iam: update CredentialStore interface and Manager for managed policies

Updated the `CredentialStore` interface to include `AttachUserPolicy`,
`DetachUserPolicy`, and `ListAttachedUserPolicies` methods.
The `CredentialManager` was updated to delegate these calls to the store.
Added common error variables for policy management.

* iam: implement managed policy methods in MemoryStore

Implemented `AttachUserPolicy`, `DetachUserPolicy`, and
`ListAttachedUserPolicies` in the MemoryStore.
Also ensured deep copying of identities includes PolicyNames.

* iam: implement managed policy methods in PostgresStore

Modified Postgres schema to include `policy_names` JSONB column in `users`.
Implemented `AttachUserPolicy`, `DetachUserPolicy`, and `ListAttachedUserPolicies`.
Updated user CRUD operations to handle policy names persistence.

* iam: implement managed policy methods in remaining stores

Implemented user policy management in:
- `FilerEtcStore` (partial implementation)
- `IamGrpcStore` (delegated via GetUser/UpdateUser)
- `PropagatingCredentialStore` (to broadcast updates)
Ensures cluster-wide consistency for policy attachments.

* s3api: refactor EmbeddedIamApi to use managed policy APIs

- Refactored `AttachUserPolicy`, `DetachUserPolicy`, and `ListAttachedUserPolicies`
  to use `e.credentialManager` directly.
- Fixed a critical error suppression bug in `ExecuteAction` that always
  returned success even on failure.
- Implemented robust error matching using string comparison fallbacks.
- Improved consistency by reloading configuration after policy changes.

* s3api: update and refine IAM integration tests

- Updated tests to use a real `MemoryStore`-backed `CredentialManager`.
- Refined test configuration synchronization using `sync.Once` and
  manual deep-copying to prevent state corruption.
- Improved `extractEmbeddedIamErrorCodeAndMessage` to handle more XML
  formats robustly.
- Adjusted test expectations to match current AWS IAM behavior.

* fix compilation

* visibility

* ensure 10 policies

* reload

* add integration tests

* Guard raft command registration

* Allow IAM actions in policy tests

* Validate gRPC policy attachments

* Revert Validate gRPC policy attachments

* Tighten gRPC policy attach/detach

* Improve IAM managed policy handling

* Improve managed policy filters
2026-02-19 12:26:27 -08:00
Chris Lu
753e1db096 Prevent split-brain: Persistent ClusterID and Join Validation (#8022)
* Prevent split-brain: Persistent ClusterID and Join Validation

- Persist ClusterId in Raft store to survive restarts.
- Validate ClusterId on Raft command application (piggybacked on MaxVolumeId).
- Prevent masters with conflicting ClusterIds from joining/operating together.
- Update Telemetry to report the persistent ClusterId.

* Refine ClusterID validation based on feedback

- Improved error message in cluster_commands.go.
- Added ClusterId mismatch check in RaftServer.Recovery.

* Handle Raft errors and support Hashicorp Raft for ClusterId

- Check for errors when persisting ClusterId in legacy Raft.
- Implement ClusterId generation and persistence for Hashicorp Raft leader changes.
- Ensure consistent error logging.

* Refactor ClusterId validation

- Centralize ClusterId mismatch check in Topology.SetClusterId.
- Simplify MaxVolumeIdCommand.Apply and RaftServer.Recovery to rely on SetClusterId.

* Fix goroutine leak and add timeout

- Handle channel closure in Hashicorp Raft leader listener.
- Add timeout to Raft Apply call to prevent blocking.

* Fix deadlock in legacy Raft listener

- Wrap ClusterId generation/persistence in a goroutine to avoid blocking the Raft event loop (deadlock).

* Rename ClusterId to SystemId

- Renamed ClusterId to SystemId across the codebase (protobuf, topology, server, telemetry).
- Regenerated telemetry.pb.go with new field.

* Rename SystemId to TopologyId

- Rename to SystemId was intermediate step.
- Final name is TopologyId for the persistent cluster identifier.
- Updated protobuf, topology, raft server, master server, and telemetry.

* Optimize Hashicorp Raft listener

- Integrated TopologyId generation into existing monitorLeaderLoop.
- Removed extra goroutine in master_server.go.

* Fix optimistic TopologyId update

- Removed premature local state update of TopologyId in master_server.go and raft_hashicorp.go.
- State is now solely updated via the Raft state machine Apply/Restore methods after consensus.

* Add explicit log for recovered TopologyId

- Added glog.V(0) info log in RaftServer.Recovery to print the recovered TopologyId on startup.

* Add Raft barrier to prevent TopologyId race condition

- Implement ensureTopologyId helper method
- Send no-op MaxVolumeIdCommand to sync Raft log before checking TopologyId
- Ensures persisted TopologyId is recovered before generating new one
- Prevents race where generation happens during log replay

* Serialize TopologyId generation with mutex

- Add topologyIdGenLock mutex to MasterServer struct
- Wrap ensureTopologyId method with lock to prevent concurrent generation
- Fixes race where event listener and manual leadership check both generate IDs
- Second caller waits for first to complete and sees the generated ID

* Add TopologyId recovery logging to Apply method

- Change log level from V(1) to V(0) for visibility
- Log 'Recovered TopologyId' when applying from Raft log
- Ensures recovery is visible whether from snapshot or log replay
- Matches Recovery() method logging for consistency

* Fix Raft barrier timing issue

- Add 100ms delay after barrier command to ensure log application completes
- Add debug logging to track barrier execution and TopologyId state
- Return early if barrier command fails
- Prevents TopologyId generation before old logs are fully applied

* ensure leader

* address comments

* address comments

* redundant

* clean up

* double check

* refactoring

* comment
2026-01-18 14:02:34 -08:00
Neo
aa61824442 master:fix empty target in Build() (#6069) 2024-09-27 00:45:18 -07:00
chrislu
a4b25a642d math/rand => math/rand/v2 2024-08-29 09:52:21 -07:00
wyang
4b1f539ab8 fix allocate reduplicated volumeId to different volume (#5811)
* fix allocate reduplicated volumeId to different volume

* only check barrier when read

---------

Co-authored-by: Yang Wang <yangwang@weride.ai>
2024-07-26 21:48:36 -07:00
chrislu
d999f1f0e2 update raft version
fix #4460
2023-05-09 22:54:23 -07:00
chrislu
4511edc871 update raft 2023-05-07 13:33:44 -07:00
Zachary Walters
ef2f741823 Updated the deprecated ioutil dependency (#4239) 2023-02-21 19:47:33 -08:00
chrislu
26dbc6c905 move to https://github.com/seaweedfs/seaweedfs 2022-07-29 00:17:28 -07:00
chrislu
3828b8ce87 "github.com/chrislusf/raft" => "github.com/seaweedfs/raft" 2022-07-27 12:12:40 -07:00
Konstantin Lebedev
c1c8dad677 avoid no such raft date directory
https://github.com/chrislusf/seaweedfs/issues/3214
2022-06-21 13:47:51 +05:00
Konstantin Lebedev
1a1e5778c3 fix cluster status 2022-04-04 18:52:08 +05:00
Konstantin Lebedev
14dd971890 hashicorp raft with state machine 2022-04-04 17:51:51 +05:00
Konstantin Lebedev
c514710b7b initial add hashicorp raft 2022-04-04 13:50:56 +05:00
Konstantin Lebedev
84b7b83517 fix permission mkdir snapshot
avoid open file operation not permitted
2022-03-28 18:41:52 +05:00
chrislu
4ba7127ab1 refactor 2022-03-26 13:13:19 -07:00
chrislu
fba1cfc2d6 simplify a bit 2022-03-26 10:24:05 -07:00
chrislu
a3411dd9da refactor 2022-03-26 10:21:26 -07:00
Konstantin Lebedev
ddd3945c26 fix remove deleted peers of raft server
https://github.com/chrislusf/seaweedfs/issues/2804
2022-03-25 15:09:38 +05:00
Konstantin Lebedev
c1450bf9fe always clear previous log to avoid server is promotable
https://github.com/chrislusf/seaweedfs/issues/2804
2022-03-25 13:40:19 +05:00
garenchan
bd032eabe7 [UPDATE] Make heartbeat interval and election timeout of masters configurable. 2022-02-14 21:09:07 +08:00
Chris Lu
e5fc35ed0c change server address from string to a type 2021-09-12 22:47:52 -07:00
Chris Lu
1b17f71939 adjust election timeout to 10 seconds 2020-10-23 23:06:44 -07:00
Chris Lu
da4edf3651 master: check peers for existing leader before starting a leader election
fix https://github.com/chrislusf/seaweedfs/issues/1509
2020-10-07 01:25:39 -07:00
Устюжанин Антон Александрович
702b1cb876 fix: remove deleted peers if resumeState = true 2020-10-04 21:56:17 +05:00
Устюжанин Антон Александрович
dc31b19469 fix: restore raft state 2020-10-03 14:03:41 +05:00
Устюжанин Антон Александрович
8c82fb7e5f fix: restore raft state 2020-10-02 23:01:20 +05:00
Chris Lu
044841c885 master: always clear previous master meta data directory 2020-06-19 20:42:16 -07:00
Chris Lu
f90c43635d refactoring 2020-03-04 00:39:47 -08:00
Lei Liu
c2884cace2 misc updated
Signed-off-by: Lei Liu <lei01.liu@horizon.ai>
2019-10-29 21:28:28 +08:00
Chris Lu
da871896c3 weed filer: set grpc port to port + 10000 2019-03-19 05:47:41 -07:00
Chris Lu
07af52cb6f raft change from http to grpc
master grpc port is fixed to http port + 10000
2019-02-18 22:38:14 -08:00
Chris Lu
7103c1ab7e go fmt 2019-02-15 00:09:48 -08:00
Chris Lu
8afc632484 raft: use the first master to bootstrap the election 2019-02-09 12:52:09 -08:00
Chris Lu
1334507595 Revert "randomize based on self address"
This reverts commit 6230eb28a6.
2019-01-28 12:12:51 -08:00
Chris Lu
6230eb28a6 randomize based on self address
fix #851
2019-01-28 11:55:33 -08:00
Chris Lu
221105eea3 Revert "use the first entry to bootstrap master cluster"
This reverts commit 40c8725ffa.
2019-01-28 11:46:46 -08:00
Chris Lu
40c8725ffa use the first entry to bootstrap master cluster
fix https://github.com/chrislusf/seaweedfs/issues/851
2019-01-28 10:35:28 -08:00
Chris Lu
834f414af9 add a timeout 2019-01-26 00:15:42 -06:00
Chris Lu
3f56b12ed4 raft: adding idle connection time out
another attempt to fix https://github.com/chrislusf/seaweedfs/issues/825
2019-01-22 09:25:25 -08:00
Chris Lu
1d103e3ed5 timeout http connections
possible fix for https://github.com/chrislusf/seaweedfs/issues/825
2019-01-17 23:38:33 -08:00
Chris Lu
75d63db60d randomize raft server startup
also some go fmt
2018-08-12 14:27:14 -07:00
Chris Lu
03f50180f3 simplifying the leader election by raft
fixing https://github.com/chrislusf/seaweedfs/issues/629
2018-06-12 01:54:09 -07:00
Chris Lu
458ada173e go fmt 2018-05-27 11:52:26 -07:00
Chris Lu
70f6740309 better fix for single master restart without peers changing 2017-12-06 00:14:14 -08:00
Chris Lu
9026b3e86e always remember the max volume id 2017-11-28 17:08:59 -08:00
Chris Lu
72e89b615b 301 is reported as 404 for http post
fix https://github.com/chrislusf/seaweedfs/issues/512
2017-06-15 21:21:32 -07:00
Chris Lu
c8f54aad8b adjust timing of leader election 2017-01-18 09:54:43 -08:00