Files
seaweedFS/weed/pb/filer.proto
Chris Lu 75a6a34528 dlm: resilient distributed locks via consistent hashing + backup replication (#8860)
* dlm: replace modulo hashing with consistent hash ring

Introduce HashRing with virtual nodes (CRC32-based consistent hashing)
to replace the modulo-based hashKeyToServer. When a filer node is
removed, only keys that hashed to that node are remapped to the next
server on the ring, leaving all other mappings stable. This is the
foundation for backup replication — the successor on the ring is
always the natural takeover node.

* dlm: add Generation and IsBackup fields to Lock

Lock now carries IsBackup (whether this node holds the lock as a backup
replica) and Generation (a monotonic fencing token that increments on
each fresh acquisition, stays the same on renewal). Add helper methods:
AllLocks, PromoteLock, DemoteLock, InsertBackupLock, RemoveLock, GetLock.

* dlm: add ReplicateLock RPC and generation/is_backup proto fields

Add generation field to LockResponse for fencing tokens.
Add generation and is_backup fields to Lock message.
Add ReplicateLock RPC for primary-to-backup lock replication.
Add ReplicateLockRequest/ReplicateLockResponse messages.

* dlm: add async backup replication to DistributedLockManager

Route lock/unlock via consistent hash ring's GetPrimaryAndBackup().
After a successful lock or unlock on the primary, asynchronously
replicate the operation to the backup server via ReplicateFunc
callback. Single-server deployments skip replication.

* dlm: add ReplicateLock handler and backup-aware topology changes

Add ReplicateLock gRPC handler for primary-to-backup replication.
Revise OnDlmChangeSnapshot to handle three cases on topology change:
- Promote backup locks when this node becomes primary
- Demote primary locks when this node becomes backup
- Transfer locks when this node is neither primary nor backup
Wire up SetupDlmReplication during filer server initialization.

* dlm: expose generation fencing token in lock client

LiveLock now captures the generation from LockResponse and exposes it
via Generation() method. Consumers can use this as a fencing token to
detect stale lock holders.

* dlm: update empty folder cleaner to use consistent hash ring

Replace local modulo-based hashKeyToServer with LockRing.GetPrimary()
which uses the shared consistent hash ring for folder ownership.

* dlm: add unit tests for consistent hash ring

Test basic operations, consistency on server removal (only keys from
removed server move), backup-is-successor property (backup becomes
new primary when primary is removed), and key distribution balance.

* dlm: add integration tests for lock replication failure scenarios

Test cases:
- Primary crash with backup promotion (backup has valid token)
- Backup crash with primary continuing
- Both primary and backup crash (lock lost, re-acquirable)
- Rolling restart across all nodes
- Generation fencing token increments on new acquisition
- Replication failure (primary still works independently)
- Unlock replicates deletion to backup
- Lock survives server addition (topology change)
- Consistent hashing minimal disruption (only removed server's keys move)

* dlm: address PR review findings

1. Causal replication ordering: Add per-lock sequence number (Seq) that
   increments on every mutation. Backup rejects incoming mutations with
   seq <= current seq, preventing stale async replications from
   overwriting newer state. Unlock replication also carries seq and is
   rejected if stale.

2. Demote-after-handoff: OnDlmChangeSnapshot now transfers the lock to
   the new primary first and only demotes to backup after a successful
   TransferLocks RPC. If the transfer fails, the lock stays as primary
   on this node.

3. SetSnapshot candidateServers leak: Replace the candidateServers map
   entirely instead of appending, so removed servers don't linger.

4. TransferLocks preserves Generation and Seq: InsertLock now accepts
   generation and seq parameters. After accepting a transferred lock,
   the receiving node re-replicates to its backup.

5. Rolling restart test: Add re-replication step after promotion and
   assert survivedCount > 0. Add TestDLM_StaleReplicationRejected.

6. Mixed-version upgrade note: Add comment on HashRing documenting that
   all filer nodes must be upgraded together.

* dlm: serve renewals locally during transfer window on node join

When a new node joins and steals hash ranges from surviving nodes,
there's a window between ring update and lock transfer where the
client gets redirected to a node that doesn't have the lock yet.

Fix: if the ring says primary != self but we still hold the lock
locally (non-backup, matching token), serve the renewal/unlock here
rather than redirecting. The lock will be transferred by
OnDlmChangeSnapshot, and subsequent requests will go to the new
primary once the transfer completes.

Add tests:
- TestDLM_NodeDropAndJoin_OwnershipDisruption: measures disruption
  when a node drops and a new one joins (14/100 surviving-node locks
  disrupted, all handled by transfer logic)
- TestDLM_RenewalDuringTransferWindow: verifies renewal succeeds on
  old primary during the transfer window

* dlm: master-managed lock ring with stabilization batching

The master now owns the lock ring membership. Instead of filers
independently reacting to individual ClusterNodeUpdate add/remove
events, the master:

1. Tracks filer membership in LockRingManager
2. Batches rapid changes with a 1-second stabilization timer
   (e.g., a node drop + join within 1 second → single ring update)
3. Broadcasts the complete ring snapshot atomically via the new
   LockRingUpdate message in KeepConnectedResponse

Filers receive the ring as a complete snapshot and apply it via
SetSnapshot, ensuring all filers converge to the same ring state
without intermediate churn.

This eliminates the double-churn problem where a rapid drop+join
would fire two separate ring mutations, each triggering lock
transfers and disrupting ownership on surviving nodes.

* dlm: track ring version, reject stale updates, remove dead code

SetSnapshot now takes a version parameter from the master. Stale
updates (version < current) are rejected, preventing reordered
messages from overwriting a newer ring state. Version 0 is always
accepted for bootstrap.

Remove AddServer/RemoveServer from LockRing — the ring is now
exclusively managed by the master via SetSnapshot. Remove the
candidateServers map that was only used by those methods.

* dlm: fix SelectLocks data race, advance generation on backup insert

- SelectLocks: change RLock to Lock since the function deletes map
  entries, which is a write operation and causes a data race under RLock.
- InsertBackupLock: advance nextGeneration to at least the incoming
  generation so that after failover promotion, new lock acquisitions
  get a generation strictly greater than any replicated lock.
- Bump replication failure log from V(1) to Warningf for production
  visibility.

* dlm: fix SetSnapshot race, test reliability, timer edge cases

- SetSnapshot: hold LockRing lock through both version update and
  Ring.SetServers() so they're atomic. Prevents a concurrent caller
  from seeing the new version but applying stale servers.
- Transfer window test: search for a key that actually moves primary
  when filer4 joins, instead of relying on a fixed key that may not.
- renewLock redirect: pass the existing token to the new primary
  instead of empty string, so redirected renewals work correctly.
- scheduleBroadcast: check timer.Stop() return value. If the timer
  already fired, the callback picks up latest state.
- FlushPending: only broadcast if timer.Stop() returns true (timer
  was still pending). If false, the callback is already running.
- Fix test comment: "idempotent" → "accepted, state-changing".

* dlm: use wall-clock nanoseconds for lock ring version

The lock ring version was an in-memory counter that reset to 0 on
master restart. A filer that had seen version 5 would reject version 1
from the restarted master.

Fix: use time.Now().UnixNano() as the version. This survives master
restarts without persistence — the restarted master produces a
version greater than any pre-restart value.

* dlm: treat expired lock owners as missing

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* dlm: reject stale lock transfers

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* dlm: order replication by generation

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* dlm: bootstrap lock ring on reconnect

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-03-30 23:29:56 -07:00

599 lines
16 KiB
Protocol Buffer

syntax = "proto3";
package filer_pb;
option go_package = "github.com/seaweedfs/seaweedfs/weed/pb/filer_pb";
option java_package = "seaweedfs.client";
option java_outer_classname = "FilerProto";
//////////////////////////////////////////////////
service SeaweedFiler {
rpc LookupDirectoryEntry (LookupDirectoryEntryRequest) returns (LookupDirectoryEntryResponse) {
}
rpc ListEntries (ListEntriesRequest) returns (stream ListEntriesResponse) {
}
rpc CreateEntry (CreateEntryRequest) returns (CreateEntryResponse) {
}
rpc UpdateEntry (UpdateEntryRequest) returns (UpdateEntryResponse) {
}
rpc AppendToEntry (AppendToEntryRequest) returns (AppendToEntryResponse) {
}
rpc DeleteEntry (DeleteEntryRequest) returns (DeleteEntryResponse) {
}
rpc AtomicRenameEntry (AtomicRenameEntryRequest) returns (AtomicRenameEntryResponse) {
}
rpc StreamRenameEntry (StreamRenameEntryRequest) returns (stream StreamRenameEntryResponse) {
}
rpc StreamMutateEntry (stream StreamMutateEntryRequest) returns (stream StreamMutateEntryResponse) {
}
rpc AssignVolume (AssignVolumeRequest) returns (AssignVolumeResponse) {
}
rpc LookupVolume (LookupVolumeRequest) returns (LookupVolumeResponse) {
}
rpc CollectionList (CollectionListRequest) returns (CollectionListResponse) {
}
rpc DeleteCollection (DeleteCollectionRequest) returns (DeleteCollectionResponse) {
}
rpc Statistics (StatisticsRequest) returns (StatisticsResponse) {
}
rpc Ping (PingRequest) returns (PingResponse) {
}
rpc GetFilerConfiguration (GetFilerConfigurationRequest) returns (GetFilerConfigurationResponse) {
}
rpc TraverseBfsMetadata (TraverseBfsMetadataRequest) returns (stream TraverseBfsMetadataResponse) {
}
rpc SubscribeMetadata (SubscribeMetadataRequest) returns (stream SubscribeMetadataResponse) {
}
rpc SubscribeLocalMetadata (SubscribeMetadataRequest) returns (stream SubscribeMetadataResponse) {
}
rpc KvGet (KvGetRequest) returns (KvGetResponse) {
}
rpc KvPut (KvPutRequest) returns (KvPutResponse) {
}
rpc CacheRemoteObjectToLocalCluster (CacheRemoteObjectToLocalClusterRequest) returns (CacheRemoteObjectToLocalClusterResponse) {
}
rpc DistributedLock(LockRequest) returns (LockResponse) {
}
rpc DistributedUnlock(UnlockRequest) returns (UnlockResponse) {
}
rpc FindLockOwner(FindLockOwnerRequest) returns (FindLockOwnerResponse) {
}
// distributed lock management internal use only
rpc TransferLocks(TransferLocksRequest) returns (TransferLocksResponse) {
}
rpc ReplicateLock(ReplicateLockRequest) returns (ReplicateLockResponse) {
}
}
//////////////////////////////////////////////////
message LookupDirectoryEntryRequest {
string directory = 1;
string name = 2;
}
message LookupDirectoryEntryResponse {
Entry entry = 1;
}
message ListEntriesRequest {
string directory = 1;
string prefix = 2;
string startFromFileName = 3;
bool inclusiveStartFrom = 4;
uint32 limit = 5;
int64 snapshot_ts_ns = 6;
}
message ListEntriesResponse {
Entry entry = 1;
int64 snapshot_ts_ns = 2;
}
message RemoteEntry {
string storage_name = 1;
int64 last_local_sync_ts_ns = 2;
string remote_e_tag = 3;
int64 remote_mtime = 4;
int64 remote_size = 5;
}
message Entry {
string name = 1;
bool is_directory = 2;
repeated FileChunk chunks = 3;
FuseAttributes attributes = 4;
map<string, bytes> extended = 5;
bytes hard_link_id = 7;
int32 hard_link_counter = 8; // only exists in hard link meta data
bytes content = 9; // if not empty, the file content
RemoteEntry remote_entry = 10;
int64 quota = 11; // for bucket only. Positive/Negative means enabled/disabled.
int64 worm_enforced_at_ts_ns = 12;
}
message FullEntry {
string dir = 1;
Entry entry = 2;
}
message EventNotification {
Entry old_entry = 1;
Entry new_entry = 2;
bool delete_chunks = 3;
string new_parent_path = 4;
bool is_from_other_cluster = 5;
repeated int32 signatures = 6;
}
enum SSEType {
NONE = 0; // No server-side encryption
SSE_C = 1; // Server-Side Encryption with Customer-Provided Keys
SSE_KMS = 2; // Server-Side Encryption with KMS-Managed Keys
SSE_S3 = 3; // Server-Side Encryption with S3-Managed Keys
}
message FileChunk {
string file_id = 1; // to be deprecated
int64 offset = 2;
uint64 size = 3;
int64 modified_ts_ns = 4;
string e_tag = 5;
string source_file_id = 6; // to be deprecated
FileId fid = 7;
FileId source_fid = 8;
bytes cipher_key = 9;
bool is_compressed = 10;
bool is_chunk_manifest = 11; // content is a list of FileChunks
SSEType sse_type = 12; // Server-side encryption type
bytes sse_metadata = 13; // Serialized SSE metadata for this chunk (SSE-C, SSE-KMS, or SSE-S3)
}
message FileChunkManifest {
repeated FileChunk chunks = 1;
}
message FileId {
uint32 volume_id = 1;
uint64 file_key = 2;
fixed32 cookie = 3;
}
message FuseAttributes {
uint64 file_size = 1;
int64 mtime = 2; // unix time in seconds
uint32 file_mode = 3;
uint32 uid = 4;
uint32 gid = 5;
int64 crtime = 6; // unix time in seconds
string mime = 7;
int32 ttl_sec = 10;
string user_name = 11; // for hdfs
repeated string group_name = 12; // for hdfs
string symlink_target = 13;
bytes md5 = 14;
uint32 rdev = 16;
uint64 inode = 17;
}
message CreateEntryRequest {
string directory = 1;
Entry entry = 2;
bool o_excl = 3;
bool is_from_other_cluster = 4;
repeated int32 signatures = 5;
bool skip_check_parent_directory = 6;
}
// Structured error codes for filer entry operations.
// Values are stable — do not reorder or reuse numbers.
enum FilerError {
OK = 0;
ENTRY_NAME_TOO_LONG = 1; // name exceeds max_file_name_length
PARENT_IS_FILE = 2; // parent path component is a file, not a directory
EXISTING_IS_DIRECTORY = 3; // cannot overwrite directory with file
EXISTING_IS_FILE = 4; // cannot overwrite file with directory
ENTRY_ALREADY_EXISTS = 5; // O_EXCL and entry already exists
}
message CreateEntryResponse {
string error = 1; // kept for human readability + backward compat
SubscribeMetadataResponse metadata_event = 2;
FilerError error_code = 3; // machine-readable error code
}
message UpdateEntryRequest {
string directory = 1;
Entry entry = 2;
bool is_from_other_cluster = 3;
repeated int32 signatures = 4;
map<string, bytes> expected_extended = 5;
}
message UpdateEntryResponse {
SubscribeMetadataResponse metadata_event = 1;
}
message AppendToEntryRequest {
string directory = 1;
string entry_name = 2;
repeated FileChunk chunks = 3;
}
message AppendToEntryResponse {
}
message DeleteEntryRequest {
string directory = 1;
string name = 2;
// bool is_directory = 3;
bool is_delete_data = 4;
bool is_recursive = 5;
bool ignore_recursive_error = 6;
bool is_from_other_cluster = 7;
repeated int32 signatures = 8;
int64 if_not_modified_after = 9;
}
message DeleteEntryResponse {
string error = 1;
SubscribeMetadataResponse metadata_event = 2;
}
message AtomicRenameEntryRequest {
string old_directory = 1;
string old_name = 2;
string new_directory = 3;
string new_name = 4;
repeated int32 signatures = 5;
}
message AtomicRenameEntryResponse {
}
message StreamRenameEntryRequest {
string old_directory = 1;
string old_name = 2;
string new_directory = 3;
string new_name = 4;
repeated int32 signatures = 5;
}
message StreamRenameEntryResponse {
string directory = 1;
EventNotification event_notification = 2;
int64 ts_ns = 3;
}
message AssignVolumeRequest {
int32 count = 1;
string collection = 2;
string replication = 3;
int32 ttl_sec = 4;
string data_center = 5;
string path = 6;
string rack = 7;
string data_node = 9;
string disk_type = 8;
}
message AssignVolumeResponse {
string file_id = 1;
int32 count = 4;
string auth = 5;
string collection = 6;
string replication = 7;
string error = 8;
Location location = 9;
}
message LookupVolumeRequest {
repeated string volume_ids = 1;
}
message Locations {
repeated Location locations = 1;
}
message Location {
string url = 1;
string public_url = 2;
uint32 grpc_port = 3;
string data_center = 4;
}
message LookupVolumeResponse {
map<string, Locations> locations_map = 1;
}
message Collection {
string name = 1;
}
message CollectionListRequest {
bool include_normal_volumes = 1;
bool include_ec_volumes = 2;
}
message CollectionListResponse {
repeated Collection collections = 1;
}
message DeleteCollectionRequest {
string collection = 1;
}
message DeleteCollectionResponse {
}
message StatisticsRequest {
string replication = 1;
string collection = 2;
string ttl = 3;
string disk_type = 4;
}
message StatisticsResponse {
uint64 total_size = 4;
uint64 used_size = 5;
uint64 file_count = 6;
}
message PingRequest {
string target = 1; // default to ping itself
string target_type = 2;
}
message PingResponse {
int64 start_time_ns = 1;
int64 remote_time_ns = 2;
int64 stop_time_ns = 3;
}
message GetFilerConfigurationRequest {
}
message GetFilerConfigurationResponse {
repeated string masters = 1;
string replication = 2;
string collection = 3;
uint32 max_mb = 4;
string dir_buckets = 5;
bool cipher = 7;
int32 signature = 8;
string metrics_address = 9;
int32 metrics_interval_sec = 10;
string version = 11;
string cluster_id = 12;
string filer_group = 13;
int32 major_version = 14;
int32 minor_version = 15;
}
message SubscribeMetadataRequest {
string client_name = 1;
string path_prefix = 2;
int64 since_ns = 3;
int32 signature = 4;
repeated string path_prefixes = 6;
int32 client_id = 7;
int64 until_ns = 8;
int32 client_epoch = 9;
repeated string directories = 10; // exact directory to watch
bool client_supports_batching = 11; // client can unpack SubscribeMetadataResponse.events
bool client_supports_metadata_chunks = 12; // client can read log file chunks from volume servers
}
message SubscribeMetadataResponse {
string directory = 1;
EventNotification event_notification = 2;
int64 ts_ns = 3;
repeated SubscribeMetadataResponse events = 4; // batch of additional events (backlog catch-up)
repeated LogFileChunkRef log_file_refs = 5; // log file chunk refs for client direct-read
}
// A persisted log file that the client can read directly from volume servers.
// The file format is: [4-byte size | protobuf LogEntry] repeated.
// Each LogEntry.Data contains a marshaled SubscribeMetadataResponse.
message LogFileChunkRef {
repeated FileChunk chunks = 1; // chunk references (fids) to read from volume servers
int64 file_ts_ns = 2; // minute-level timestamp of the log file
string filer_id = 3; // filer signature suffix from log filename
}
message TraverseBfsMetadataRequest {
string directory = 1;
repeated string excluded_prefixes = 2;
}
message TraverseBfsMetadataResponse {
string directory = 1;
Entry entry = 2;
}
message LogEntry {
int64 ts_ns = 1;
int32 partition_key_hash = 2;
bytes data = 3;
bytes key = 4;
int64 offset = 5; // Sequential offset within partition
}
message KeepConnectedRequest {
string name = 1;
uint32 grpc_port = 2;
repeated string resources = 3;
}
message KeepConnectedResponse {
}
message LocateBrokerRequest {
string resource = 1;
}
message LocateBrokerResponse {
bool found = 1;
// if found, send the exact address
// if not found, send the full list of existing brokers
message Resource {
string grpc_addresses = 1;
int32 resource_count = 2;
}
repeated Resource resources = 2;
}
/////////////////////////
// Key-Value operations
/////////////////////////
message KvGetRequest {
bytes key = 1;
}
message KvGetResponse {
bytes value = 1;
string error = 2;
}
message KvPutRequest {
bytes key = 1;
bytes value = 2;
}
message KvPutResponse {
string error = 1;
}
/////////////////////////
// path-based configurations
/////////////////////////
message FilerConf {
int32 version = 1;
message PathConf {
string location_prefix = 1;
string collection = 2;
string replication = 3;
string ttl = 4;
string disk_type = 5;
bool fsync = 6;
uint32 volume_growth_count = 7;
bool read_only = 8;
string data_center = 9;
string rack = 10;
string data_node = 11;
uint32 max_file_name_length = 12;
bool disable_chunk_deletion = 13;
bool worm = 14;
uint64 worm_grace_period_seconds = 15;
uint64 worm_retention_time_seconds = 16;
}
repeated PathConf locations = 2;
}
/////////////////////////
// Remote Storage related
/////////////////////////
message CacheRemoteObjectToLocalClusterRequest {
string directory = 1;
string name = 2;
int32 chunk_concurrency = 3; // parallel chunk downloads per file, 0 = default (8)
int32 download_concurrency = 4; // multipart download concurrency per chunk (if supported by remote storage), 0 = default (5 for S3)
}
message CacheRemoteObjectToLocalClusterResponse {
Entry entry = 1;
SubscribeMetadataResponse metadata_event = 2;
}
/////////////////////////
// distributed lock management
/////////////////////////
message LockRequest {
string name = 1;
int64 seconds_to_lock = 2;
string renew_token = 3;
bool is_moved = 4;
string owner = 5;
}
message LockResponse {
string renew_token = 1;
string lock_owner = 2;
string lock_host_moved_to = 3;
string error = 4;
int64 generation = 5;
}
message UnlockRequest {
string name = 1;
string renew_token = 2;
bool is_moved = 3;
}
message UnlockResponse {
string error = 1;
string moved_to = 2;
}
message FindLockOwnerRequest {
string name = 1;
bool is_moved = 2;
}
message FindLockOwnerResponse {
string owner = 1;
}
message Lock {
string name = 1;
string renew_token = 2;
int64 expired_at_ns = 3;
string owner = 4;
int64 generation = 5;
bool is_backup = 6;
int64 seq = 7;
}
message TransferLocksRequest {
repeated Lock locks = 1;
}
message TransferLocksResponse {
}
message ReplicateLockRequest {
string name = 1;
string renew_token = 2;
int64 expired_at_ns = 3;
string owner = 4;
int64 generation = 5;
bool is_unlock = 6;
int64 seq = 7;
}
message ReplicateLockResponse {
}
//////////////////////////////////////////////////
// StreamMutateEntry: ordered bidirectional streaming for all filer mutations.
// All create/update/delete/rename operations from a single mount go through
// one stream, preserving mutation ordering and eliminating per-request
// connection overhead.
message StreamMutateEntryRequest {
uint64 request_id = 1;
oneof request {
CreateEntryRequest create_request = 2;
UpdateEntryRequest update_request = 3;
DeleteEntryRequest delete_request = 4;
StreamRenameEntryRequest rename_request = 5;
}
}
message StreamMutateEntryResponse {
uint64 request_id = 1;
bool is_last = 2; // always true except for rename, which sends multiple events
oneof response {
CreateEntryResponse create_response = 3;
UpdateEntryResponse update_response = 4;
DeleteEntryResponse delete_response = 5;
StreamRenameEntryResponse rename_response = 6;
}
string error = 7; // human-readable error message when the operation failed
int32 errno = 8; // POSIX errno (e.g. ENOENT=2, ENOTEMPTY=66) for direct FUSE status mapping
}