Commit Graph

145 Commits

Author SHA1 Message Date
Chris Lu
b3620c7e14 admin: auto migrating master maintenance scripts to admin_script plugin config (#8509)
* admin: seed admin_script plugin config from master maintenance scripts

When the admin server starts, fetch the maintenance scripts configuration
from the master via GetMasterConfiguration. If the admin_script plugin
worker does not already have a saved config, use the master's scripts as
the default value. This enables seamless migration from master.toml
[master.maintenance] to the admin script plugin worker.

Changes:
- Add maintenance_scripts and maintenance_sleep_minutes fields to
  GetMasterConfigurationResponse in master.proto
- Populate the new fields from viper config in master_grpc_server.go
- On admin server startup, fetch the master config and seed the
  admin_script plugin config if no config exists yet
- Strip lock/unlock commands from the master scripts since the admin
  script worker handles locking automatically

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: address review comments on admin_script seeding

- Replace TOCTOU race (separate Load+Save) with atomic
  SaveJobTypeConfigIfNotExists on ConfigStore and Plugin
- Replace ineffective polling loop with single GetMaster call using
  30s context timeout, since GetMaster respects context cancellation
- Add unit tests for SaveJobTypeConfigIfNotExists (in-memory + on-disk)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: apply maintenance script defaults in gRPC handler

The gRPC handler for GetMasterConfiguration read maintenance scripts
from viper without calling SetDefault, relying on startAdminScripts
having run first. If the admin server calls GetMasterConfiguration
before startAdminScripts sets the defaults, viper returns empty
strings and the seeding is silently skipped.

Apply SetDefault in the gRPC handler itself so it is self-contained.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Revert "fix: apply maintenance script defaults in gRPC handler"

This reverts commit 068a5063303f6bc34825a07bb681adfa67e6f9de.

* fix: use atomic save in ensureJobTypeConfigFromDescriptor

ensureJobTypeConfigFromDescriptor used a separate Load + Save, racing
with seedAdminScriptFromMaster. If the descriptor defaults (empty
script) were saved first, SaveJobTypeConfigIfNotExists in the seeding
goroutine would see an existing config and skip, losing the master's
maintenance scripts.

Switch to SaveJobTypeConfigIfNotExists so both paths are atomic. Whichever
wins, the other is a safe no-op.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: fetch master scripts inline during config bootstrap, not in goroutine

Replace the seedAdminScriptFromMaster goroutine with a
ConfigDefaultsProvider callback. When the plugin bootstraps
admin_script defaults from the worker descriptor, it calls the
provider which fetches maintenance scripts from the master
synchronously. This eliminates the race between the seeding
goroutine and the descriptor-based config bootstrap.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* skip commented lock unlock

Co-Authored-By: Copilot <223556219+Copilot@users.noreply.github.com>

* reduce grpc calls

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-03-04 22:11:07 -08:00
Chris Lu
45ce18266a Disable master maintenance scripts when admin server runs (#8499)
* Disable master maintenance scripts when admin server runs

* Stop defaulting master maintenance scripts

* Apply suggestion from @gemini-code-assist[bot]

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* Apply suggestion from @gemini-code-assist[bot]

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* Clarify master scripts are disabled by default

* Skip master maintenance scripts when admin server is connected

* Restore default master maintenance scripts

* Document admin server skip for master maintenance scripts

---------

Co-authored-by: Copilot <copilot@github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-03-04 00:40:40 -08:00
Chris Lu
57ab99d13e fix: generate topology uuid uniformly in single-master mode (#8405)
* fix: ensure topology uuid is generated in single master setups

* ensureTopologyId adds a Hashicorp-aware implementation

* simplify
2026-02-22 23:45:48 -08:00
Chris Lu
3300874cb5 filer: add default log purging to master maintenance scripts (#8359)
* filer: add default log purging to master maintenance scripts

* filer: fix default maintenance scripts to include full set of tasks

* filer: refactor maintenance scripts to avoid duplication
2026-02-16 16:58:15 -08:00
Chris Lu
cb9e21cdc5 Normalize hashicorp raft peer ids (#8253)
* Normalize raft voter ids

* 4.11

* Update raft_hashicorp.go
2026-02-09 07:46:34 -08:00
Chris Lu
753e1db096 Prevent split-brain: Persistent ClusterID and Join Validation (#8022)
* Prevent split-brain: Persistent ClusterID and Join Validation

- Persist ClusterId in Raft store to survive restarts.
- Validate ClusterId on Raft command application (piggybacked on MaxVolumeId).
- Prevent masters with conflicting ClusterIds from joining/operating together.
- Update Telemetry to report the persistent ClusterId.

* Refine ClusterID validation based on feedback

- Improved error message in cluster_commands.go.
- Added ClusterId mismatch check in RaftServer.Recovery.

* Handle Raft errors and support Hashicorp Raft for ClusterId

- Check for errors when persisting ClusterId in legacy Raft.
- Implement ClusterId generation and persistence for Hashicorp Raft leader changes.
- Ensure consistent error logging.

* Refactor ClusterId validation

- Centralize ClusterId mismatch check in Topology.SetClusterId.
- Simplify MaxVolumeIdCommand.Apply and RaftServer.Recovery to rely on SetClusterId.

* Fix goroutine leak and add timeout

- Handle channel closure in Hashicorp Raft leader listener.
- Add timeout to Raft Apply call to prevent blocking.

* Fix deadlock in legacy Raft listener

- Wrap ClusterId generation/persistence in a goroutine to avoid blocking the Raft event loop (deadlock).

* Rename ClusterId to SystemId

- Renamed ClusterId to SystemId across the codebase (protobuf, topology, server, telemetry).
- Regenerated telemetry.pb.go with new field.

* Rename SystemId to TopologyId

- Rename to SystemId was intermediate step.
- Final name is TopologyId for the persistent cluster identifier.
- Updated protobuf, topology, raft server, master server, and telemetry.

* Optimize Hashicorp Raft listener

- Integrated TopologyId generation into existing monitorLeaderLoop.
- Removed extra goroutine in master_server.go.

* Fix optimistic TopologyId update

- Removed premature local state update of TopologyId in master_server.go and raft_hashicorp.go.
- State is now solely updated via the Raft state machine Apply/Restore methods after consensus.

* Add explicit log for recovered TopologyId

- Added glog.V(0) info log in RaftServer.Recovery to print the recovered TopologyId on startup.

* Add Raft barrier to prevent TopologyId race condition

- Implement ensureTopologyId helper method
- Send no-op MaxVolumeIdCommand to sync Raft log before checking TopologyId
- Ensures persisted TopologyId is recovered before generating new one
- Prevents race where generation happens during log replay

* Serialize TopologyId generation with mutex

- Add topologyIdGenLock mutex to MasterServer struct
- Wrap ensureTopologyId method with lock to prevent concurrent generation
- Fixes race where event listener and manual leadership check both generate IDs
- Second caller waits for first to complete and sees the generated ID

* Add TopologyId recovery logging to Apply method

- Change log level from V(1) to V(0) for visibility
- Log 'Recovered TopologyId' when applying from Raft log
- Ensures recovery is visible whether from snapshot or log replay
- Matches Recovery() method logging for consistency

* Fix Raft barrier timing issue

- Add 100ms delay after barrier command to ensure log application completes
- Add debug logging to track barrier execution and TopologyId state
- Return early if barrier command fails
- Prevents TopologyId generation before old logs are fully applied

* ensure leader

* address comments

* address comments

* redundant

* clean up

* double check

* refactoring

* comment
2026-01-18 14:02:34 -08:00
Chris Lu
07dc552e1c master: Fix raft url (#7255)
* fix signature

* fix url scheme
2025-09-18 14:46:53 -07:00
Dmitriy Pavlov
cd78e653e1 add disable volume_growth flag (#7196) 2025-09-04 05:39:56 -07:00
Chris Lu
e446234e9c remove spoof-able request header (#7103)
* remove spoof-able request header

https://github.com/seaweedfs/seaweedfs/issues/7094#issuecomment-3158320497

* Update weed/security/guard.go

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-08-06 10:08:30 -07:00
Chris Lu
0703308270 remote address parsing should handle special cases (#7101)
* remote address parsing should handle special cases

* handling ipv6

* simplify

* Update weed/security/guard.go

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update weed/security/guard.go

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* x-real-ip

* Update guard.go

* fixes

 Hostname Whitelisting: Fully restored - supports localhost, example.com, etc.
 IP Whitelisting: Still works - supports exact IPs and CIDR ranges
 Header Support: Consistent handling of X-Forwarded-For, X-Real-IP

* simplify

* Update weed/security/guard.go

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* Update weed/security/guard.go

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* Update guard.go

* adjust function signature

* Update weed/security/guard.go

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* indention

* skip empty host

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-08-06 01:03:00 -07:00
chrislu
798f797158 use float for sleep seconds
fix https://github.com/seaweedfs/seaweedfs/pull/6795
2025-07-06 14:16:41 -07:00
chrislu
1733d0ce68 remove features and deployments fields 2025-06-28 20:03:06 -07:00
Chris Lu
a1aab8a083 add telemetry (#6926)
* add telemetry

* fix go mod

* add default telemetry server url

* Update README.md

* replace with broker count instead of s3 count

* Update telemetry.pb.go

* github action to deploy
2025-06-28 14:11:55 -07:00
Aleksey Kosov
5182d46e22 Added middleware for processing request_id grpc and http requests (#6805) 2025-05-21 07:57:39 -07:00
Lisandro Pin
fc4df944a0 Remove rate limit semaphore on master's leader selection logic. (#6494)
This was introduced by 054374c7 (2024-03-12) and serves no practical purpose,
yet it caps the maximum QPS master servers can handle.
2025-01-30 13:08:36 -08:00
Konstantin Lebedev
b65eb2ec45 [security] reload whiteList on http seerver (#6302)
* reload whiteList

* white_list add to scaffold
2024-12-02 10:38:10 -08:00
Konstantin Lebedev
fec88e64eb [master] update LastLeaderChangeTime for hashicorp raft (#6292) 2024-11-26 08:02:45 -08:00
chrislu
ccf1795e6f wait a bit before getting the next volume id if the leader is recently elected 2024-11-23 19:58:45 -08:00
chrislu
6564ceda91 skip resource heavy commands from running on master nodes 2024-09-29 10:51:17 -07:00
chrislu
4463296811 add parallel vacuuming 2024-08-21 22:53:54 -07:00
Riccardo Bertossa
6fe8639504 add http endpoint to get the size of a collection (#5910) 2024-08-19 07:44:45 -07:00
wyang
4b1f539ab8 fix allocate reduplicated volumeId to different volume (#5811)
* fix allocate reduplicated volumeId to different volume

* only check barrier when read

---------

Co-authored-by: Yang Wang <yangwang@weride.ai>
2024-07-26 21:48:36 -07:00
vadimartynov
86d92a42b4 Added tls for http clients (#5766)
* Added global http client

* Added Do func for global http client

* Changed the code to use the global http client

* Fix http client in volume uploader

* Fixed pkg name

* Fixed http util funcs

* Fixed http client for bench_filer_upload

* Fixed http client for stress_filer_upload

* Fixed http client for filer_server_handlers_proxy

* Fixed http client for command_fs_merge_volumes

* Fixed http client for command_fs_merge_volumes and command_volume_fsck

* Fixed http client for s3api_server

* Added init global client for main funcs

* Rename global_client to client

* Changed:
- fixed NewHttpClient;
- added CheckIsHttpsClientEnabled func
- updated security.toml in scaffold

* Reduce the visibility of some functions in the util/http/client pkg

* Added the loadSecurityConfig function

* Use util.LoadSecurityConfiguration() in NewHttpClient func
2024-07-16 23:14:09 -07:00
Konstantin Lebedev
67edf1d014 [master] Do Automatic Volume Grow in background (#5781)
* Do Automatic Volume Grow in backgound

* pass lastGrowCount to master

* fix build

* fix count to uint64
2024-07-16 08:03:40 -07:00
vadimartynov
8aae82dd71 Added context for the MasterClient's methods to avoid endless loops (#5628)
* Added context for the MasterClient's methods to avoid endless loops

* Returned WithClient function. Added WithClientCustomGetMaster function

* Hid unused ctx arguments

* Using a common context for the KeepConnectedToMaster and WaitUntilConnected functions

* Changed the context termination check in the tryConnectToMaster function

* Added a child context to the tryConnectToMaster function

* Added a common context for KeepConnectedToMaster and WaitUntilConnected functions in benchmark
2024-06-14 11:40:34 -07:00
shenxingwuying
ee25ada732 reduce ambiguity about use memory_sequencer (#5555) 2024-04-29 21:51:00 -07:00
chrislu
55976ae04a avoid repeated calls to heavy-weighted viper 2024-04-18 09:09:45 -07:00
chrislu
d9490c5e1f rename 2024-04-18 08:47:45 -07:00
Nico D'Cotta
796b7508f3 Implement SRV lookups for filer (#4767) 2023-08-24 07:08:56 -07:00
chrislu
a315490f7d proxy to master uses http address
fix https://github.com/seaweedfs/seaweedfs/issues/4607
2023-07-04 11:45:21 -07:00
chrislu
adb90bd252 avoid lower casing the command
fix https://github.com/seaweedfs/seaweedfs/pull/4321
2023-03-19 21:20:46 -07:00
Konstantin Lebedev
b9933d5589 master server graceful stop (#3797) 2022-10-06 09:30:30 -07:00
Konstantin Lebedev
e90ab4ac60 avoid race conditions for OnPeerUpdate (#3525)
https://github.com/seaweedfs/seaweedfs/issues/3524
2022-08-26 10:18:49 -07:00
Patrick Schmidt
7b424a54dc Add raft server access mutex to avoid races (#3503) 2022-08-24 09:49:05 -07:00
chrislu
10414fd81c ping timeout at 15 seconds
this 72 minute timeout setting seems unreasonably long

15 seconds is around the time when a new raft leader should be elected.
2022-08-23 23:28:16 -07:00
askeipx
2e78a522ab remove old raft servers if they don't answer to pings for too long (#3398)
* remove old raft servers if they don't answer to pings for too long

add ping durations as options

rename ping fields

fix some todos

get masters through masterclient

raft remove server from leader

use raft servers to ping them

CheckMastersAlive for hashicorp raft only

* prepare blocking ping

* pass waitForReady as param

* pass waitForReady through all functions

* waitForReady works

* refactor

* remove unneeded params

* rollback unneeded changes

* fix
2022-08-23 23:18:21 -07:00
Konstantin Lebedev
4d4cd0948d avoid infinite loop WaitUntilConnected() (#3431)
https://github.com/seaweedfs/seaweedfs/issues/3421
2022-08-11 15:03:26 -07:00
Chris Lu
b59bc607bf Merge pull request #3338 from kmlebedev/issues/3083
rollback over onPeerUpdate implementation of automatic clean-up of failed servers in favor of synchronous ping
2022-08-01 08:23:10 -07:00
Konstantin Lebedev
a98f6d66a3 rollback over onPeerupdate implementation of automatic clean-up of failed servers in favor of synchronous ping 2022-08-01 12:51:41 +05:00
chrislu
26dbc6c905 move to https://github.com/seaweedfs/seaweedfs 2022-07-29 00:17:28 -07:00
chrislu
bb01b68fa0 refactor 2022-07-28 23:24:38 -07:00
chrislu
68065128b8 add dc and rack 2022-07-28 23:22:51 -07:00
chrislu
3828b8ce87 "github.com/chrislusf/raft" => "github.com/seaweedfs/raft" 2022-07-27 12:12:40 -07:00
Konstantin Lebedev
c88ea31f62 fix RUnlock of unlocked RWMutex 2022-07-26 12:57:07 +05:00
Konstantin Lebedev
3c42814b58 avoid deadlock 2022-07-21 17:15:10 +05:00
Konstantin Lebedev
93ca87b7cb use safe onPeerUpdateDoneCns 2022-07-21 15:51:14 +05:00
Konstantin Lebedev
7875470e74 onPeerUpdateGoroutineCount use int32 2022-07-20 18:40:50 +05:00
Konstantin Lebedev
6c390851e7 fix design 2022-07-20 18:08:12 +05:00
Konstantin Lebedev
f6a966b4fc add waiting log message 2022-07-20 00:31:57 +05:00
Konstantin Lebedev
6cfbfb0849 check for ping before deleting raft server
https://github.com/chrislusf/seaweedfs/issues/3083
2022-07-20 00:04:12 +05:00