Files
seaweedFS/weed/pb/master.proto
Chris Lu 75a6a34528 dlm: resilient distributed locks via consistent hashing + backup replication (#8860)
* dlm: replace modulo hashing with consistent hash ring

Introduce HashRing with virtual nodes (CRC32-based consistent hashing)
to replace the modulo-based hashKeyToServer. When a filer node is
removed, only keys that hashed to that node are remapped to the next
server on the ring, leaving all other mappings stable. This is the
foundation for backup replication — the successor on the ring is
always the natural takeover node.

* dlm: add Generation and IsBackup fields to Lock

Lock now carries IsBackup (whether this node holds the lock as a backup
replica) and Generation (a monotonic fencing token that increments on
each fresh acquisition, stays the same on renewal). Add helper methods:
AllLocks, PromoteLock, DemoteLock, InsertBackupLock, RemoveLock, GetLock.

* dlm: add ReplicateLock RPC and generation/is_backup proto fields

Add generation field to LockResponse for fencing tokens.
Add generation and is_backup fields to Lock message.
Add ReplicateLock RPC for primary-to-backup lock replication.
Add ReplicateLockRequest/ReplicateLockResponse messages.

* dlm: add async backup replication to DistributedLockManager

Route lock/unlock via consistent hash ring's GetPrimaryAndBackup().
After a successful lock or unlock on the primary, asynchronously
replicate the operation to the backup server via ReplicateFunc
callback. Single-server deployments skip replication.

* dlm: add ReplicateLock handler and backup-aware topology changes

Add ReplicateLock gRPC handler for primary-to-backup replication.
Revise OnDlmChangeSnapshot to handle three cases on topology change:
- Promote backup locks when this node becomes primary
- Demote primary locks when this node becomes backup
- Transfer locks when this node is neither primary nor backup
Wire up SetupDlmReplication during filer server initialization.

* dlm: expose generation fencing token in lock client

LiveLock now captures the generation from LockResponse and exposes it
via Generation() method. Consumers can use this as a fencing token to
detect stale lock holders.

* dlm: update empty folder cleaner to use consistent hash ring

Replace local modulo-based hashKeyToServer with LockRing.GetPrimary()
which uses the shared consistent hash ring for folder ownership.

* dlm: add unit tests for consistent hash ring

Test basic operations, consistency on server removal (only keys from
removed server move), backup-is-successor property (backup becomes
new primary when primary is removed), and key distribution balance.

* dlm: add integration tests for lock replication failure scenarios

Test cases:
- Primary crash with backup promotion (backup has valid token)
- Backup crash with primary continuing
- Both primary and backup crash (lock lost, re-acquirable)
- Rolling restart across all nodes
- Generation fencing token increments on new acquisition
- Replication failure (primary still works independently)
- Unlock replicates deletion to backup
- Lock survives server addition (topology change)
- Consistent hashing minimal disruption (only removed server's keys move)

* dlm: address PR review findings

1. Causal replication ordering: Add per-lock sequence number (Seq) that
   increments on every mutation. Backup rejects incoming mutations with
   seq <= current seq, preventing stale async replications from
   overwriting newer state. Unlock replication also carries seq and is
   rejected if stale.

2. Demote-after-handoff: OnDlmChangeSnapshot now transfers the lock to
   the new primary first and only demotes to backup after a successful
   TransferLocks RPC. If the transfer fails, the lock stays as primary
   on this node.

3. SetSnapshot candidateServers leak: Replace the candidateServers map
   entirely instead of appending, so removed servers don't linger.

4. TransferLocks preserves Generation and Seq: InsertLock now accepts
   generation and seq parameters. After accepting a transferred lock,
   the receiving node re-replicates to its backup.

5. Rolling restart test: Add re-replication step after promotion and
   assert survivedCount > 0. Add TestDLM_StaleReplicationRejected.

6. Mixed-version upgrade note: Add comment on HashRing documenting that
   all filer nodes must be upgraded together.

* dlm: serve renewals locally during transfer window on node join

When a new node joins and steals hash ranges from surviving nodes,
there's a window between ring update and lock transfer where the
client gets redirected to a node that doesn't have the lock yet.

Fix: if the ring says primary != self but we still hold the lock
locally (non-backup, matching token), serve the renewal/unlock here
rather than redirecting. The lock will be transferred by
OnDlmChangeSnapshot, and subsequent requests will go to the new
primary once the transfer completes.

Add tests:
- TestDLM_NodeDropAndJoin_OwnershipDisruption: measures disruption
  when a node drops and a new one joins (14/100 surviving-node locks
  disrupted, all handled by transfer logic)
- TestDLM_RenewalDuringTransferWindow: verifies renewal succeeds on
  old primary during the transfer window

* dlm: master-managed lock ring with stabilization batching

The master now owns the lock ring membership. Instead of filers
independently reacting to individual ClusterNodeUpdate add/remove
events, the master:

1. Tracks filer membership in LockRingManager
2. Batches rapid changes with a 1-second stabilization timer
   (e.g., a node drop + join within 1 second → single ring update)
3. Broadcasts the complete ring snapshot atomically via the new
   LockRingUpdate message in KeepConnectedResponse

Filers receive the ring as a complete snapshot and apply it via
SetSnapshot, ensuring all filers converge to the same ring state
without intermediate churn.

This eliminates the double-churn problem where a rapid drop+join
would fire two separate ring mutations, each triggering lock
transfers and disrupting ownership on surviving nodes.

* dlm: track ring version, reject stale updates, remove dead code

SetSnapshot now takes a version parameter from the master. Stale
updates (version < current) are rejected, preventing reordered
messages from overwriting a newer ring state. Version 0 is always
accepted for bootstrap.

Remove AddServer/RemoveServer from LockRing — the ring is now
exclusively managed by the master via SetSnapshot. Remove the
candidateServers map that was only used by those methods.

* dlm: fix SelectLocks data race, advance generation on backup insert

- SelectLocks: change RLock to Lock since the function deletes map
  entries, which is a write operation and causes a data race under RLock.
- InsertBackupLock: advance nextGeneration to at least the incoming
  generation so that after failover promotion, new lock acquisitions
  get a generation strictly greater than any replicated lock.
- Bump replication failure log from V(1) to Warningf for production
  visibility.

* dlm: fix SetSnapshot race, test reliability, timer edge cases

- SetSnapshot: hold LockRing lock through both version update and
  Ring.SetServers() so they're atomic. Prevents a concurrent caller
  from seeing the new version but applying stale servers.
- Transfer window test: search for a key that actually moves primary
  when filer4 joins, instead of relying on a fixed key that may not.
- renewLock redirect: pass the existing token to the new primary
  instead of empty string, so redirected renewals work correctly.
- scheduleBroadcast: check timer.Stop() return value. If the timer
  already fired, the callback picks up latest state.
- FlushPending: only broadcast if timer.Stop() returns true (timer
  was still pending). If false, the callback is already running.
- Fix test comment: "idempotent" → "accepted, state-changing".

* dlm: use wall-clock nanoseconds for lock ring version

The lock ring version was an in-memory counter that reset to 0 on
master restart. A filer that had seen version 5 would reject version 1
from the restarted master.

Fix: use time.Now().UnixNano() as the version. This survives master
restarts without persistence — the restarted master produces a
version greater than any pre-restart value.

* dlm: treat expired lock owners as missing

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* dlm: reject stale lock transfers

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* dlm: order replication by generation

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* dlm: bootstrap lock ring on reconnect

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-03-30 23:29:56 -07:00

488 lines
12 KiB
Protocol Buffer

syntax = "proto3";
package master_pb;
option go_package = "github.com/seaweedfs/seaweedfs/weed/pb/master_pb";
import "volume_server.proto";
//////////////////////////////////////////////////
service Seaweed {
rpc SendHeartbeat (stream Heartbeat) returns (stream HeartbeatResponse) {
}
rpc KeepConnected (stream KeepConnectedRequest) returns (stream KeepConnectedResponse) {
}
rpc LookupVolume (LookupVolumeRequest) returns (LookupVolumeResponse) {
}
rpc Assign (AssignRequest) returns (AssignResponse) {
}
rpc StreamAssign (stream AssignRequest) returns (stream AssignResponse) {
}
rpc Statistics (StatisticsRequest) returns (StatisticsResponse) {
}
rpc CollectionList (CollectionListRequest) returns (CollectionListResponse) {
}
rpc CollectionDelete (CollectionDeleteRequest) returns (CollectionDeleteResponse) {
}
rpc VolumeList (VolumeListRequest) returns (VolumeListResponse) {
}
rpc LookupEcVolume (LookupEcVolumeRequest) returns (LookupEcVolumeResponse) {
}
rpc VacuumVolume (VacuumVolumeRequest) returns (VacuumVolumeResponse) {
}
rpc DisableVacuum (DisableVacuumRequest) returns (DisableVacuumResponse) {
}
rpc EnableVacuum (EnableVacuumRequest) returns (EnableVacuumResponse) {
}
rpc VolumeMarkReadonly (VolumeMarkReadonlyRequest) returns (VolumeMarkReadonlyResponse) {
}
rpc GetMasterConfiguration (GetMasterConfigurationRequest) returns (GetMasterConfigurationResponse) {
}
rpc ListClusterNodes (ListClusterNodesRequest) returns (ListClusterNodesResponse) {
}
rpc LeaseAdminToken (LeaseAdminTokenRequest) returns (LeaseAdminTokenResponse) {
}
rpc ReleaseAdminToken (ReleaseAdminTokenRequest) returns (ReleaseAdminTokenResponse) {
}
rpc Ping (PingRequest) returns (PingResponse) {
}
rpc RaftListClusterServers (RaftListClusterServersRequest) returns (RaftListClusterServersResponse) {
}
rpc RaftAddServer (RaftAddServerRequest) returns (RaftAddServerResponse) {
}
rpc RaftRemoveServer (RaftRemoveServerRequest) returns (RaftRemoveServerResponse) {
}
rpc RaftLeadershipTransfer (RaftLeadershipTransferRequest) returns (RaftLeadershipTransferResponse) {
}
rpc VolumeGrow (VolumeGrowRequest) returns (VolumeGrowResponse) {
}
}
//////////////////////////////////////////////////
message DiskTag {
uint32 disk_id = 1;
repeated string tags = 2;
}
message Heartbeat {
string ip = 1;
uint32 port = 2;
string public_url = 3;
uint64 max_file_key = 5;
string data_center = 6;
string rack = 7;
uint32 admin_port = 8;
repeated VolumeInformationMessage volumes = 9;
// delta volumes
repeated VolumeShortInformationMessage new_volumes = 10;
repeated VolumeShortInformationMessage deleted_volumes = 11;
bool has_no_volumes = 12;
// erasure coding
repeated VolumeEcShardInformationMessage ec_shards = 16;
// delta erasure coding shards
repeated VolumeEcShardInformationMessage new_ec_shards = 17;
repeated VolumeEcShardInformationMessage deleted_ec_shards = 18;
bool has_no_ec_shards = 19;
map<string, uint32> max_volume_counts = 4;
uint32 grpc_port = 20;
repeated string location_uuids = 21;
string id = 22; // volume server id, independent of ip:port for stable identification
// state flags
volume_server_pb.VolumeServerState state = 23;
repeated DiskTag disk_tags = 24;
}
message HeartbeatResponse {
uint64 volume_size_limit = 1;
string leader = 2;
string metrics_address = 3;
uint32 metrics_interval_seconds = 4;
repeated StorageBackend storage_backends = 5;
repeated string duplicated_uuids = 6;
bool preallocate = 7;
}
message VolumeInformationMessage {
uint32 id = 1;
uint64 size = 2;
string collection = 3;
uint64 file_count = 4;
uint64 delete_count = 5;
uint64 deleted_byte_count = 6;
bool read_only = 7;
uint32 replica_placement = 8;
uint32 version = 9;
uint32 ttl = 10;
uint32 compact_revision = 11;
int64 modified_at_second = 12;
string remote_storage_name = 13;
string remote_storage_key = 14;
string disk_type = 15;
uint32 disk_id = 16;
}
message VolumeShortInformationMessage {
uint32 id = 1;
string collection = 3;
uint32 replica_placement = 8;
uint32 version = 9;
uint32 ttl = 10;
string disk_type = 15;
uint32 disk_id = 16;
}
message VolumeEcShardInformationMessage {
uint32 id = 1;
string collection = 2;
uint32 ec_index_bits = 3;
string disk_type = 4;
uint64 expire_at_sec = 5; // used to record the destruction time of ec volume
uint32 disk_id = 6;
repeated int64 shard_sizes = 7; // optimized: sizes for shards in order of set bits in ec_index_bits
}
message StorageBackend {
string type = 1;
string id = 2;
map<string, string> properties = 3;
}
message Empty {
}
message SuperBlockExtra {
message ErasureCoding {
uint32 data = 1;
uint32 parity = 2;
repeated uint32 volume_ids = 3;
}
ErasureCoding erasure_coding = 1;
}
message KeepConnectedRequest {
string client_type = 1;
string client_address = 3;
string version = 4;
string filer_group = 5;
string data_center = 6;
string rack = 7;
}
message VolumeLocation {
string url = 1;
string public_url = 2;
repeated uint32 new_vids = 3;
repeated uint32 deleted_vids = 4;
string leader = 5; // optional when leader is not itself
string data_center = 6; // optional when DataCenter is in use
uint32 grpc_port = 7;
repeated uint32 new_ec_vids = 8;
repeated uint32 deleted_ec_vids = 9;
}
message ClusterNodeUpdate {
string node_type = 1;
string address = 2;
bool is_add = 4;
string filer_group = 5;
int64 created_at_ns = 6;
}
message KeepConnectedResponse {
VolumeLocation volume_location = 1;
ClusterNodeUpdate cluster_node_update = 2;
LockRingUpdate lock_ring_update = 3;
}
// LockRingUpdate is sent by the master to all filers when the lock ring
// membership changes. The master batches rapid changes (e.g., node drop + join)
// and sends the complete member list atomically, avoiding intermediate ring
// states that would cause unnecessary lock churn.
message LockRingUpdate {
string filer_group = 1;
repeated string servers = 2;
int64 version = 3;
}
message LookupVolumeRequest {
repeated string volume_or_file_ids = 1;
string collection = 2; // optional, a bit faster if provided.
}
message LookupVolumeResponse {
message VolumeIdLocation {
string volume_or_file_id = 1;
repeated Location locations = 2;
string error = 3;
string auth = 4;
}
repeated VolumeIdLocation volume_id_locations = 1;
}
message Location {
string url = 1;
string public_url = 2;
uint32 grpc_port = 3;
string data_center = 4;
}
message AssignRequest {
uint64 count = 1;
string replication = 2;
string collection = 3;
string ttl = 4;
string data_center = 5;
string rack = 6;
string data_node = 7;
uint32 memory_map_max_size_mb = 8;
uint32 writable_volume_count = 9;
string disk_type = 10;
}
message VolumeGrowRequest {
uint32 writable_volume_count = 1;
string replication = 2;
string collection = 3;
string ttl = 4;
string data_center = 5;
string rack = 6;
string data_node = 7;
uint32 memory_map_max_size_mb = 8;
string disk_type = 9;
}
message AssignResponse {
string fid = 1;
uint64 count = 4;
string error = 5;
string auth = 6;
repeated Location replicas = 7;
Location location = 8;
}
message StatisticsRequest {
string replication = 1;
string collection = 2;
string ttl = 3;
string disk_type = 4;
}
message StatisticsResponse {
uint64 total_size = 4;
uint64 used_size = 5;
uint64 file_count = 6;
}
//
// collection related
//
message Collection {
string name = 1;
}
message CollectionListRequest {
bool include_normal_volumes = 1;
bool include_ec_volumes = 2;
}
message CollectionListResponse {
repeated Collection collections = 1;
}
message CollectionDeleteRequest {
string name = 1;
}
message CollectionDeleteResponse {
}
//
// volume related
//
message DiskInfo {
string type = 1;
int64 volume_count = 2;
int64 max_volume_count = 3;
int64 free_volume_count = 4;
int64 active_volume_count = 5;
repeated VolumeInformationMessage volume_infos = 6;
repeated VolumeEcShardInformationMessage ec_shard_infos = 7;
int64 remote_volume_count = 8;
uint32 disk_id = 9;
repeated string tags = 10;
}
message DataNodeInfo {
string id = 1;
map<string, DiskInfo> diskInfos = 2;
uint32 grpc_port = 3;
string address = 4; // ip:port for connecting to the volume server
}
message RackInfo {
string id = 1;
repeated DataNodeInfo data_node_infos = 2;
map<string, DiskInfo> diskInfos = 3;
}
message DataCenterInfo {
string id = 1;
repeated RackInfo rack_infos = 2;
map<string, DiskInfo> diskInfos = 3;
}
message TopologyInfo {
string id = 1;
repeated DataCenterInfo data_center_infos = 2;
map<string, DiskInfo> diskInfos = 3;
}
message VolumeListRequest {
}
message VolumeListResponse {
TopologyInfo topology_info = 1;
uint64 volume_size_limit_mb = 2;
}
message LookupEcVolumeRequest {
uint32 volume_id = 1;
}
message LookupEcVolumeResponse {
uint32 volume_id = 1;
message EcShardIdLocation {
uint32 shard_id = 1;
repeated Location locations = 2;
}
repeated EcShardIdLocation shard_id_locations = 2;
}
message VacuumVolumeRequest {
float garbage_threshold = 1;
uint32 volume_id = 2;
string collection = 3;
}
message VacuumVolumeResponse {
}
message DisableVacuumRequest {
bool by_plugin = 1;
}
message DisableVacuumResponse {
}
message EnableVacuumRequest {
bool by_plugin = 1;
}
message EnableVacuumResponse {
}
message VolumeMarkReadonlyRequest {
string ip = 1;
uint32 port = 2;
uint32 volume_id = 4;
string collection = 5;
uint32 replica_placement = 6;
uint32 version = 7;
uint32 ttl = 8;
string disk_type = 9;
bool is_readonly = 10;
}
message VolumeMarkReadonlyResponse {
}
message GetMasterConfigurationRequest {
}
message GetMasterConfigurationResponse {
string metrics_address = 1;
uint32 metrics_interval_seconds = 2;
repeated StorageBackend storage_backends = 3;
string default_replication = 4;
string leader = 5;
uint32 volume_size_limit_m_b = 6;
bool volume_preallocate = 7;
// MIGRATION: fields 8-9 help migrate master.toml [master.maintenance] to admin script plugin. Remove after March 2027.
string maintenance_scripts = 8;
uint32 maintenance_sleep_minutes = 9;
}
message ListClusterNodesRequest {
string client_type = 1;
string filer_group = 2;
int32 limit = 4;
}
message ListClusterNodesResponse {
message ClusterNode {
string address = 1;
string version = 2;
int64 created_at_ns = 4;
string data_center = 5;
string rack = 6;
}
repeated ClusterNode cluster_nodes = 1;
}
message LeaseAdminTokenRequest {
int64 previous_token = 1;
int64 previous_lock_time = 2;
string lock_name = 3;
string client_name = 4;
string message = 5;
}
message LeaseAdminTokenResponse {
int64 token = 1;
int64 lock_ts_ns = 2;
}
message ReleaseAdminTokenRequest {
int64 previous_token = 1;
int64 previous_lock_time = 2;
string lock_name = 3;
}
message ReleaseAdminTokenResponse {
}
message PingRequest {
string target = 1; // default to ping itself
string target_type = 2;
}
message PingResponse {
int64 start_time_ns = 1;
int64 remote_time_ns = 2;
int64 stop_time_ns = 3;
}
message RaftAddServerRequest {
string id = 1;
string address = 2;
bool voter = 3;
}
message RaftAddServerResponse {
}
message RaftRemoveServerRequest {
string id = 1;
bool force = 2;
}
message RaftRemoveServerResponse {
}
message RaftListClusterServersRequest {
}
message RaftListClusterServersResponse {
message ClusterServers {
string id = 1;
string address = 2;
string suffrage = 3;
bool isLeader = 4;
}
repeated ClusterServers cluster_servers = 1;
}
message RaftLeadershipTransferRequest {
string target_id = 1; // Optional: target server ID. If empty, transfers to any eligible follower
string target_address = 2; // Optional: target server address. Required if target_id is specified
}
message RaftLeadershipTransferResponse {
string previous_leader = 1;
string new_leader = 2;
}
message VolumeGrowResponse {
}