Commit Graph

113 Commits

Author SHA1 Message Date
Chris Lu
15f4a97029 fix: improve raft leader election reliability and failover speed (#8692)
* fix: clear raft vote state file on non-resume startup

The seaweedfs/raft library v1.1.7 added a persistent `state` file for
currentTerm and votedFor. When RaftResumeState=false (the default), the
log, conf, and snapshot directories are cleared but this state file was
not. On repeated restarts, different masters accumulate divergent terms,
causing AppendEntries rejections and preventing leader election.

Fixes #8690

* fix: recover TopologyId from snapshot before clearing raft state

When RaftResumeState=false clears log/conf/snapshot, the TopologyId
(used for license validation) was lost. Now extract it from the latest
snapshot before cleanup and restore it on the topology.

Both seaweedfs/raft and hashicorp/raft paths are handled, with a shared
recoverTopologyIdFromState helper in raft_common.go.

* fix: stagger multi-master bootstrap delay by peer index

Previously all masters used a fixed 1500ms delay before the bootstrap
check. Now the delay is proportional to the peer's sorted index with
randomization (matching the hashicorp raft path), giving the designated
bootstrap node (peer 0) a head start while later peers wait for gRPC
servers to be ready.

Also adds diagnostic logging showing why DoJoinCommand was or wasn't
called, making leader election issues easier to diagnose from logs.

* fix: skip unreachable masters during leader reconnection

When a master leader goes down, non-leader masters still redirect
clients to the stale leader address. The masterClient would follow
these redirects, fail, and retry — wasting round-trips each cycle.

Now tryAllMasters tracks which masters failed within a cycle and skips
redirects pointing to them, reducing log spam and connection overhead
during leader failover.

* fix: take snapshot after TopologyId generation for recovery

After generating a new TopologyId on the leader, immediately take a raft
snapshot so the ID can be recovered from the snapshot on future restarts
with RaftResumeState=false. Without this, short-lived clusters would
lose the TopologyId on restart since no automatic snapshot had been
taken yet.

* test: add multi-master raft failover integration tests

Integration test framework and 5 test scenarios for 3-node master
clusters:

- TestLeaderConsistencyAcrossNodes: all nodes agree on leader and
  TopologyId
- TestLeaderDownAndRecoverQuickly: leader stops, new leader elected,
  old leader rejoins as follower
- TestLeaderDownSlowRecover: leader gone for extended period, cluster
  continues with 2/3 quorum
- TestTwoMastersDownAndRestart: quorum lost (2/3 down), recovered
  when both restart
- TestAllMastersDownAndRestart: full cluster restart, leader elected,
  all nodes agree on TopologyId

* fix: address PR review comments

- peerIndex: return -1 (not 0) when self not found, add warning log
- recoverTopologyIdFromSnapshot: defer dir.Close()
- tests: check GetTopologyId errors instead of discarding them

* fix: address review comments on failover tests

- Assert no leader after quorum loss (was only logging)
- Verify follower cs.Leader matches expected leader via
  ServerAddress.ToHttpAddress() comparison
- Check GetTopologyId error in TestTwoMastersDownAndRestart
2026-03-18 23:28:07 -07:00
Chris Lu
ba8e2aaae9 Fix master leader election when grpc ports change (#8272)
* Fix master leader detection when grpc ports change

* Canonicalize self peer entry to avoid raft self-alias panic

* Normalize and deduplicate master peer addresses
2026-02-09 18:13:02 -08:00
Chris Lu
c106532b79 fix: prevent MiniClusterCtx race conditions in command shutdown
Capture global MiniClusterCtx into local variables before goroutine/select
evaluation to prevent nil dereference/data race when context is reset to nil
after nil check. Applied to filer, master, volume, and s3 commands.
2026-01-28 19:42:16 -08:00
Chris Lu
01c17478ae command: implement graceful shutdown for mini cluster
- Introduce MiniClusterCtx to coordinate shutdown across mini services
- Update Master, Volume, Filer, S3, and WebDAV servers to respect context cancellation
- Ensure all resources are cleaned up properly during test teardown
- Integrate MiniClusterCtx in s3tables integration tests
2026-01-28 10:36:19 -08:00
Chris Lu
ed1da07665 Add consistent -debug and -debug.port flags to commands (#7816)
* Add consistent -debug and -debug.port flags to commands

Add -debug and -debug.port flags to weed master, weed volume, weed s3,
weed mq.broker, and weed filer.sync commands for consistency with
weed filer.

When -debug is enabled, an HTTP server starts on the specified port
(default 6060) serving runtime profiling data at /debug/pprof/.

For mq.broker, replaced the older -port.pprof flag with the new
-debug and -debug.port pattern for consistency.

* Update weed/util/grace/pprof.go

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

---------

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-12-18 17:44:36 -08:00
Chris Lu
f096b067fd weed master add peers=none option for faster startup (#7419)
* weed master -peers=none

* single master mode only when peers is none

* refactoring

* revert duplicated code

* revert

* Update weed/command/master.go

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* preventing "none" passed to other components if master is not started

---------

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-10-31 18:29:16 -07:00
Chris Lu
5ab49e2971 Adjust cli option (#7418)
* adjust "weed benchmark" CLI to use readOnly/writeOnly

* consistently use "-master" CLI option

* If both -readOnly and -writeOnly are specified, the current logic silently allows it with -writeOnly taking precedence. This is confusing and could lead to unexpected behavior.
2025-10-31 17:08:00 -07:00
zuzuviewer
8fa1a69f8c * Fix undefined http serve behaiver (#6943) 2025-07-07 22:48:12 -07:00
chrislu
166e36bcd3 use telemetry.seaweedfs.com 2025-06-28 19:48:03 -07:00
Chris Lu
a1aab8a083 add telemetry (#6926)
* add telemetry

* fix go mod

* add default telemetry server url

* Update README.md

* replace with broker count instead of s3 count

* Update telemetry.pb.go

* github action to deploy
2025-06-28 14:11:55 -07:00
chrislu
ab49540d2b use master.toml value if not empty
fix https://github.com/seaweedfs/seaweedfs/issues/6922
2025-06-25 17:54:56 -07:00
chrislu
bd4891a117 change version directory 2025-06-03 22:46:10 -07:00
Ethan Mosbaugh
9ebc132ffc fix: bucket-hook fails with gnu wget (#6521) 2025-02-06 23:11:17 -08:00
chrislu
ec155022e7 "golang.org/x/exp/slices" => "slices" and go fmt 2024-12-19 19:25:06 -08:00
zouyixiong
881c9a009e [master] master missing start LoopPushingMetric routine fixed. (#6018) 2024-09-13 20:01:34 -07:00
chrislu
4463296811 add parallel vacuuming 2024-08-21 22:53:54 -07:00
vadimartynov
b796c21fa9 Added loadSecurityConfigOnce (#5792) 2024-07-16 09:15:55 -07:00
vadimartynov
ec9e7493b3 -metricsIp cmd flag (#5773)
* Added/Updated:
- Added metrics ip options for all servers;
- Fixed a bug with the selection of the binIp or ip parameter for the metrics handler;

* Fixed cmd flags
2024-07-12 10:56:26 -07:00
Konstantin Lebedev
5ffacbb6ea refactor all methods strings to const (#5726) 2024-07-01 01:00:39 -07:00
vadimartynov
8aae82dd71 Added context for the MasterClient's methods to avoid endless loops (#5628)
* Added context for the MasterClient's methods to avoid endless loops

* Returned WithClient function. Added WithClientCustomGetMaster function

* Hid unused ctx arguments

* Using a common context for the KeepConnectedToMaster and WaitUntilConnected functions

* Changed the context termination check in the tryConnectToMaster function

* Added a child context to the tryConnectToMaster function

* Added a common context for KeepConnectedToMaster and WaitUntilConnected functions in benchmark
2024-06-14 11:40:34 -07:00
chrislu
645ae8c57b Revert "Revert "Merge branch 'master' of https://github.com/seaweedfs/seaweedfs""
This reverts commit 8cb42c39
2023-09-25 09:35:16 -07:00
chrislu
8cb42c39ad Revert "Merge branch 'master' of https://github.com/seaweedfs/seaweedfs"
This reverts commit 2e5aa06026, reversing
changes made to 4d414f54a2.
2023-09-18 16:12:50 -07:00
dependabot[bot]
a04bd4d26f Bump github.com/rclone/rclone from 1.63.1 to 1.64.0 (#4850)
* Bump github.com/rclone/rclone from 1.63.1 to 1.64.0

Bumps [github.com/rclone/rclone](https://github.com/rclone/rclone) from 1.63.1 to 1.64.0.
- [Release notes](https://github.com/rclone/rclone/releases)
- [Changelog](https://github.com/rclone/rclone/blob/master/RELEASE.md)
- [Commits](https://github.com/rclone/rclone/compare/v1.63.1...v1.64.0)

---
updated-dependencies:
- dependency-name: github.com/rclone/rclone
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

* API changes

* go mod

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Chris Lu <chrislusf@users.noreply.github.com>
Co-authored-by: chrislu <chris.lu@gmail.com>
2023-09-18 14:43:05 -07:00
Stewart Miles
dd71f54c6b Fix -raftHashicorp and -raftBootstrap flag propagation. (#4309)
`weed server` was not correctly propagating
`-master.raftHashicorp` and `-master.raftBootstrap` flags when
starting the master server.

Related to #4307
2023-03-15 13:03:20 -07:00
Jiffs Maverick
4b0430e71d [metrics] Add the ability to control bind ip (#4012) 2022-11-24 10:22:59 -08:00
Konstantin Lebedev
401315f337 master fix interruption through ctrl+c (#3834) 2022-10-12 07:18:40 -07:00
Konstantin Lebedev
b9933d5589 master server graceful stop (#3797) 2022-10-06 09:30:30 -07:00
Ryan Russell
8efe1db01a refactor(various): Listner -> Listener readability improvements (#3672)
* refactor(net_timeout): `listner` -> `listener`

Signed-off-by: Ryan Russell <git@ryanrussell.org>

* refactor(s3): `s3ApiLocalListner` -> `s3ApiLocalListener`

Signed-off-by: Ryan Russell <git@ryanrussell.org>

* refactor(filer): `localPublicListner` -> `localPublicListener`

Signed-off-by: Ryan Russell <git@ryanrussell.org>

* refactor(command): `masterLocalListner` -> `masterLocalListener`

Signed-off-by: Ryan Russell <git@ryanrussell.org>

* refactor(net_timeout): `ipListner` -> `ipListener`

Signed-off-by: Ryan Russell <git@ryanrussell.org>

Signed-off-by: Ryan Russell <git@ryanrussell.org>
2022-09-14 11:59:55 -07:00
Patrick Schmidt
7b424a54dc Add raft server access mutex to avoid races (#3503) 2022-08-24 09:49:05 -07:00
chrislu
67814a5c79 refactor and fix strings.Split 2022-08-07 01:34:32 -07:00
Patrick Schmidt
1a4a36d510 Add healthy indicator for raft status 2022-07-30 19:34:26 +02:00
chrislu
26dbc6c905 move to https://github.com/seaweedfs/seaweedfs 2022-07-29 00:17:28 -07:00
chrislu
3828b8ce87 "github.com/chrislusf/raft" => "github.com/seaweedfs/raft" 2022-07-27 12:12:40 -07:00
chrislu
492da3dbce master: put metadata under instance specific folder 2022-06-20 19:04:49 -07:00
leyou240
89eb87c1d1 Merge branch 'master' into slices.SortFunc 2022-04-18 10:39:29 +08:00
justin
3551ca2fcf enhancement: replace sort.Slice with slices.SortFunc to reduce reflection 2022-04-18 10:35:43 +08:00
Konstantin Lebedev
17c6e8e39f Merge branch 'new_master' into hashicorp_raft
# Conflicts:
#	go.mod
#	go.sum
2022-04-05 13:29:46 +05:00
Konstantin Lebedev
622297f1a7 add stats raft handler 2022-04-04 19:16:06 +05:00
Konstantin Lebedev
14dd971890 hashicorp raft with state machine 2022-04-04 17:51:51 +05:00
Konstantin Lebedev
c514710b7b initial add hashicorp raft 2022-04-04 13:50:56 +05:00
chrislu
daca2d22a5 use original server address string as map key 2022-04-01 17:34:42 -07:00
chrislu
21e0898631 refactor: change masters from a slice to a map 2022-03-26 13:33:17 -07:00
chrislu
4ba7127ab1 refactor 2022-03-26 13:13:19 -07:00
Berck Nash
7ee38fa3a4 The fixes for https://github.com/chrislusf/seaweedfs/issues/1937 had a few problems:
(1) The help file says that in the absence of a ipBind being specified, that it will bind to the "ip" specified.  Instead, it bound to localhost which broke the default configuration.  This change implements the documented behavior instead.

(2) The new IAM filer ip address has no default. This instantiates it to the same as the filer IP.  I'm not sure if there should be a corresponding iam.ip or iam.ipBind option added to the filer command?
2022-03-17 15:30:23 -06:00
Berck Nash
9b14f0c81a Add mTLS support for both master and volume http server. 2022-03-16 09:52:17 -06:00
chrislu
3639cad69c master, filer, s3: also listen to "localhost" in addition to specific ip address
related to https://github.com/chrislusf/seaweedfs/issues/1937
2022-03-15 22:28:18 -07:00
chrislu
3a6eb8ca5f default bind to one ip address
fix https://github.com/chrislusf/seaweedfs/issues/1937
2022-03-11 14:02:39 -08:00
garenchan
bd032eabe7 [UPDATE] Make heartbeat interval and election timeout of masters configurable. 2022-02-14 21:09:07 +08:00
Konstantin Lebedev
77c98b657e master metricsHttpPort 2022-01-19 21:43:22 +05:00
Chris Lu
5ea86ef1da Revert "master: rename grpc function KeepConnected() to SubscribeVolumeLocationUpdates()"
This reverts commit af71ae11aa.
2021-11-05 17:52:15 -07:00