Files
seaweedFS/weed/pb/filer.proto
Chris Lu c2c58419b8 filer.sync: send log file chunk fids to clients for direct volume server reads (#8792)
* filer.sync: send log file chunk fids to clients for direct volume server reads

Instead of the server reading persisted log files from volume servers, parsing
entries, and streaming them over gRPC (serial bottleneck), clients that opt in
via client_supports_metadata_chunks receive log file chunk references (fids)
and read directly from volume servers in parallel.

New proto messages:
- LogFileChunkRef: chunk fids + timestamp + filer ID for one log file
- SubscribeMetadataRequest.client_supports_metadata_chunks: client opt-in
- SubscribeMetadataResponse.log_file_refs: server sends refs during backlog

Server changes:
- CollectLogFileRefs: lists log files and returns chunk refs without any
  volume server I/O (metadata-only operation)
- SubscribeMetadata/SubscribeLocalMetadata: when client opts in, sends refs
  during persisted log phase, then falls back to normal streaming for
  in-memory events

Client changes:
- ReadLogFileRefs: reads log files from volume servers, parses entries,
  filters by path prefix, invokes processEventFn
- MetadataFollowOption.LogFileReaderFn: factory for chunk readers,
  enables metadata chunks when non-nil
- Both filer_pb_tail.go and meta_aggregator.go recv loops accumulate
  refs then process them at the disk→memory transition

Backward compatible: old clients don't set the flag, get existing behavior.

Ref: #8771

* filer.sync: merge entries across filers in timestamp order on client side

ReadLogFileRefs now groups refs by filer ID and merges entries from
multiple filers using a min-heap priority queue — the same algorithm
the server uses in OrderedLogVisitor + LogEntryItemPriorityQueue.

This ensures events are processed in correct timestamp order even when
log files from different filers have interleaved timestamps. Single-filer
case takes the fast path (no heap allocation).

* filer.sync: integration tests for direct-read metadata chunks

Three test categories:

1. Merge correctness (TestReadLogFileRefsMergeOrder):
   Verifies entries from 3 filers are delivered in strict timestamp order,
   matching the server-side OrderedLogVisitor guarantee.

2. Path filtering (TestReadLogFileRefsPathFilter):
   Verifies client-side path prefix filtering works correctly.

3. Throughput comparison (TestDirectReadVsServerSideThroughput):
   3 filers × 7 files × 300 events = 6300 events, 2ms per file read:

   server-side:  6300 events  218ms   28,873 events/sec
   direct-read:  6300 events   51ms  123,566 events/sec  (4.3x)
   parallel:     6300 events   17ms  378,628 events/sec  (13.1x)

   Direct-read eliminates gRPC send overhead per event (4.3x).
   Parallel per-filer reading eliminates serial file I/O (13.1x).

* filer.sync: parallel per-filer reads with prefetching in ReadLogFileRefs

ReadLogFileRefs now has two levels of I/O overlap:

1. Cross-filer parallelism: one goroutine per filer reads its files
   concurrently. Entries feed into per-filer channels, merged by the
   main goroutine via min-heap (same ordering guarantee as the server's
   OrderedLogVisitor).

2. Within-filer prefetching: while the current file's entries are being
   consumed by the merge heap, the next file is already being read from
   the volume server in a background goroutine.

Single-filer fast path avoids the heap and channels.

Test results (3 filers × 7 files × 300 events, 2ms per file read):

  server-side sequential:  6300 events  212ms   29,760 events/sec
  parallel + prefetch:     6300 events   36ms  177,443 events/sec
  Speedup: 6.0x

* filer.sync: address all review comments on metadata chunks PR

Critical fixes:
- sendLogFileRefs: bypass pipelinedSender, send directly on gRPC stream.
  Ref messages have TsNs=0 and were being incorrectly batched into the
  Events field by the adaptive batching logic, corrupting ref delivery.
- readLogFileEntries: use io.ReadFull instead of reader.Read to prevent
  partial reads from corrupting size values or protobuf data.
- Error handling: only skip chunk-not-found errors (matching server-side
  isChunkNotFoundError). Other I/O or decode failures are propagated so
  the follower can retry.

High-priority fixes:
- CollectLogFileRefs: remove incorrect +24h padding from stopTime. The
  extra day caused unnecessary log file refs to be collected.
- Path filtering: ReadLogFileRefs now accepts PathFilter struct with
  PathPrefix, AdditionalPathPrefixes, and DirectoriesToWatch. Uses
  util.Join for path construction (avoids "//foo" on root). Excludes
  /.system/log/ internal entries. Matches server-side
  eachEventNotificationFn filtering logic.

Medium-priority fixes:
- CollectLogFileRefs: accept context.Context, propagate to
  ListDirectoryEntries calls for cancellation support.
- NewChunkStreamReaderFromLookup: accept context.Context, propagate to
  doNewChunkStreamReader.

Test fixes:
- Check error returns from ReadLogFileRefs in all test call sites.

---------

Co-authored-by: Copilot <copilot@github.com>
2026-03-27 11:01:29 -07:00

582 lines
15 KiB
Protocol Buffer

syntax = "proto3";
package filer_pb;
option go_package = "github.com/seaweedfs/seaweedfs/weed/pb/filer_pb";
option java_package = "seaweedfs.client";
option java_outer_classname = "FilerProto";
//////////////////////////////////////////////////
service SeaweedFiler {
rpc LookupDirectoryEntry (LookupDirectoryEntryRequest) returns (LookupDirectoryEntryResponse) {
}
rpc ListEntries (ListEntriesRequest) returns (stream ListEntriesResponse) {
}
rpc CreateEntry (CreateEntryRequest) returns (CreateEntryResponse) {
}
rpc UpdateEntry (UpdateEntryRequest) returns (UpdateEntryResponse) {
}
rpc AppendToEntry (AppendToEntryRequest) returns (AppendToEntryResponse) {
}
rpc DeleteEntry (DeleteEntryRequest) returns (DeleteEntryResponse) {
}
rpc AtomicRenameEntry (AtomicRenameEntryRequest) returns (AtomicRenameEntryResponse) {
}
rpc StreamRenameEntry (StreamRenameEntryRequest) returns (stream StreamRenameEntryResponse) {
}
rpc StreamMutateEntry (stream StreamMutateEntryRequest) returns (stream StreamMutateEntryResponse) {
}
rpc AssignVolume (AssignVolumeRequest) returns (AssignVolumeResponse) {
}
rpc LookupVolume (LookupVolumeRequest) returns (LookupVolumeResponse) {
}
rpc CollectionList (CollectionListRequest) returns (CollectionListResponse) {
}
rpc DeleteCollection (DeleteCollectionRequest) returns (DeleteCollectionResponse) {
}
rpc Statistics (StatisticsRequest) returns (StatisticsResponse) {
}
rpc Ping (PingRequest) returns (PingResponse) {
}
rpc GetFilerConfiguration (GetFilerConfigurationRequest) returns (GetFilerConfigurationResponse) {
}
rpc TraverseBfsMetadata (TraverseBfsMetadataRequest) returns (stream TraverseBfsMetadataResponse) {
}
rpc SubscribeMetadata (SubscribeMetadataRequest) returns (stream SubscribeMetadataResponse) {
}
rpc SubscribeLocalMetadata (SubscribeMetadataRequest) returns (stream SubscribeMetadataResponse) {
}
rpc KvGet (KvGetRequest) returns (KvGetResponse) {
}
rpc KvPut (KvPutRequest) returns (KvPutResponse) {
}
rpc CacheRemoteObjectToLocalCluster (CacheRemoteObjectToLocalClusterRequest) returns (CacheRemoteObjectToLocalClusterResponse) {
}
rpc DistributedLock(LockRequest) returns (LockResponse) {
}
rpc DistributedUnlock(UnlockRequest) returns (UnlockResponse) {
}
rpc FindLockOwner(FindLockOwnerRequest) returns (FindLockOwnerResponse) {
}
// distributed lock management internal use only
rpc TransferLocks(TransferLocksRequest) returns (TransferLocksResponse) {
}
}
//////////////////////////////////////////////////
message LookupDirectoryEntryRequest {
string directory = 1;
string name = 2;
}
message LookupDirectoryEntryResponse {
Entry entry = 1;
}
message ListEntriesRequest {
string directory = 1;
string prefix = 2;
string startFromFileName = 3;
bool inclusiveStartFrom = 4;
uint32 limit = 5;
int64 snapshot_ts_ns = 6;
}
message ListEntriesResponse {
Entry entry = 1;
int64 snapshot_ts_ns = 2;
}
message RemoteEntry {
string storage_name = 1;
int64 last_local_sync_ts_ns = 2;
string remote_e_tag = 3;
int64 remote_mtime = 4;
int64 remote_size = 5;
}
message Entry {
string name = 1;
bool is_directory = 2;
repeated FileChunk chunks = 3;
FuseAttributes attributes = 4;
map<string, bytes> extended = 5;
bytes hard_link_id = 7;
int32 hard_link_counter = 8; // only exists in hard link meta data
bytes content = 9; // if not empty, the file content
RemoteEntry remote_entry = 10;
int64 quota = 11; // for bucket only. Positive/Negative means enabled/disabled.
int64 worm_enforced_at_ts_ns = 12;
}
message FullEntry {
string dir = 1;
Entry entry = 2;
}
message EventNotification {
Entry old_entry = 1;
Entry new_entry = 2;
bool delete_chunks = 3;
string new_parent_path = 4;
bool is_from_other_cluster = 5;
repeated int32 signatures = 6;
}
enum SSEType {
NONE = 0; // No server-side encryption
SSE_C = 1; // Server-Side Encryption with Customer-Provided Keys
SSE_KMS = 2; // Server-Side Encryption with KMS-Managed Keys
SSE_S3 = 3; // Server-Side Encryption with S3-Managed Keys
}
message FileChunk {
string file_id = 1; // to be deprecated
int64 offset = 2;
uint64 size = 3;
int64 modified_ts_ns = 4;
string e_tag = 5;
string source_file_id = 6; // to be deprecated
FileId fid = 7;
FileId source_fid = 8;
bytes cipher_key = 9;
bool is_compressed = 10;
bool is_chunk_manifest = 11; // content is a list of FileChunks
SSEType sse_type = 12; // Server-side encryption type
bytes sse_metadata = 13; // Serialized SSE metadata for this chunk (SSE-C, SSE-KMS, or SSE-S3)
}
message FileChunkManifest {
repeated FileChunk chunks = 1;
}
message FileId {
uint32 volume_id = 1;
uint64 file_key = 2;
fixed32 cookie = 3;
}
message FuseAttributes {
uint64 file_size = 1;
int64 mtime = 2; // unix time in seconds
uint32 file_mode = 3;
uint32 uid = 4;
uint32 gid = 5;
int64 crtime = 6; // unix time in seconds
string mime = 7;
int32 ttl_sec = 10;
string user_name = 11; // for hdfs
repeated string group_name = 12; // for hdfs
string symlink_target = 13;
bytes md5 = 14;
uint32 rdev = 16;
uint64 inode = 17;
}
message CreateEntryRequest {
string directory = 1;
Entry entry = 2;
bool o_excl = 3;
bool is_from_other_cluster = 4;
repeated int32 signatures = 5;
bool skip_check_parent_directory = 6;
}
// Structured error codes for filer entry operations.
// Values are stable — do not reorder or reuse numbers.
enum FilerError {
OK = 0;
ENTRY_NAME_TOO_LONG = 1; // name exceeds max_file_name_length
PARENT_IS_FILE = 2; // parent path component is a file, not a directory
EXISTING_IS_DIRECTORY = 3; // cannot overwrite directory with file
EXISTING_IS_FILE = 4; // cannot overwrite file with directory
ENTRY_ALREADY_EXISTS = 5; // O_EXCL and entry already exists
}
message CreateEntryResponse {
string error = 1; // kept for human readability + backward compat
SubscribeMetadataResponse metadata_event = 2;
FilerError error_code = 3; // machine-readable error code
}
message UpdateEntryRequest {
string directory = 1;
Entry entry = 2;
bool is_from_other_cluster = 3;
repeated int32 signatures = 4;
map<string, bytes> expected_extended = 5;
}
message UpdateEntryResponse {
SubscribeMetadataResponse metadata_event = 1;
}
message AppendToEntryRequest {
string directory = 1;
string entry_name = 2;
repeated FileChunk chunks = 3;
}
message AppendToEntryResponse {
}
message DeleteEntryRequest {
string directory = 1;
string name = 2;
// bool is_directory = 3;
bool is_delete_data = 4;
bool is_recursive = 5;
bool ignore_recursive_error = 6;
bool is_from_other_cluster = 7;
repeated int32 signatures = 8;
int64 if_not_modified_after = 9;
}
message DeleteEntryResponse {
string error = 1;
SubscribeMetadataResponse metadata_event = 2;
}
message AtomicRenameEntryRequest {
string old_directory = 1;
string old_name = 2;
string new_directory = 3;
string new_name = 4;
repeated int32 signatures = 5;
}
message AtomicRenameEntryResponse {
}
message StreamRenameEntryRequest {
string old_directory = 1;
string old_name = 2;
string new_directory = 3;
string new_name = 4;
repeated int32 signatures = 5;
}
message StreamRenameEntryResponse {
string directory = 1;
EventNotification event_notification = 2;
int64 ts_ns = 3;
}
message AssignVolumeRequest {
int32 count = 1;
string collection = 2;
string replication = 3;
int32 ttl_sec = 4;
string data_center = 5;
string path = 6;
string rack = 7;
string data_node = 9;
string disk_type = 8;
}
message AssignVolumeResponse {
string file_id = 1;
int32 count = 4;
string auth = 5;
string collection = 6;
string replication = 7;
string error = 8;
Location location = 9;
}
message LookupVolumeRequest {
repeated string volume_ids = 1;
}
message Locations {
repeated Location locations = 1;
}
message Location {
string url = 1;
string public_url = 2;
uint32 grpc_port = 3;
string data_center = 4;
}
message LookupVolumeResponse {
map<string, Locations> locations_map = 1;
}
message Collection {
string name = 1;
}
message CollectionListRequest {
bool include_normal_volumes = 1;
bool include_ec_volumes = 2;
}
message CollectionListResponse {
repeated Collection collections = 1;
}
message DeleteCollectionRequest {
string collection = 1;
}
message DeleteCollectionResponse {
}
message StatisticsRequest {
string replication = 1;
string collection = 2;
string ttl = 3;
string disk_type = 4;
}
message StatisticsResponse {
uint64 total_size = 4;
uint64 used_size = 5;
uint64 file_count = 6;
}
message PingRequest {
string target = 1; // default to ping itself
string target_type = 2;
}
message PingResponse {
int64 start_time_ns = 1;
int64 remote_time_ns = 2;
int64 stop_time_ns = 3;
}
message GetFilerConfigurationRequest {
}
message GetFilerConfigurationResponse {
repeated string masters = 1;
string replication = 2;
string collection = 3;
uint32 max_mb = 4;
string dir_buckets = 5;
bool cipher = 7;
int32 signature = 8;
string metrics_address = 9;
int32 metrics_interval_sec = 10;
string version = 11;
string cluster_id = 12;
string filer_group = 13;
int32 major_version = 14;
int32 minor_version = 15;
}
message SubscribeMetadataRequest {
string client_name = 1;
string path_prefix = 2;
int64 since_ns = 3;
int32 signature = 4;
repeated string path_prefixes = 6;
int32 client_id = 7;
int64 until_ns = 8;
int32 client_epoch = 9;
repeated string directories = 10; // exact directory to watch
bool client_supports_batching = 11; // client can unpack SubscribeMetadataResponse.events
bool client_supports_metadata_chunks = 12; // client can read log file chunks from volume servers
}
message SubscribeMetadataResponse {
string directory = 1;
EventNotification event_notification = 2;
int64 ts_ns = 3;
repeated SubscribeMetadataResponse events = 4; // batch of additional events (backlog catch-up)
repeated LogFileChunkRef log_file_refs = 5; // log file chunk refs for client direct-read
}
// A persisted log file that the client can read directly from volume servers.
// The file format is: [4-byte size | protobuf LogEntry] repeated.
// Each LogEntry.Data contains a marshaled SubscribeMetadataResponse.
message LogFileChunkRef {
repeated FileChunk chunks = 1; // chunk references (fids) to read from volume servers
int64 file_ts_ns = 2; // minute-level timestamp of the log file
string filer_id = 3; // filer signature suffix from log filename
}
message TraverseBfsMetadataRequest {
string directory = 1;
repeated string excluded_prefixes = 2;
}
message TraverseBfsMetadataResponse {
string directory = 1;
Entry entry = 2;
}
message LogEntry {
int64 ts_ns = 1;
int32 partition_key_hash = 2;
bytes data = 3;
bytes key = 4;
int64 offset = 5; // Sequential offset within partition
}
message KeepConnectedRequest {
string name = 1;
uint32 grpc_port = 2;
repeated string resources = 3;
}
message KeepConnectedResponse {
}
message LocateBrokerRequest {
string resource = 1;
}
message LocateBrokerResponse {
bool found = 1;
// if found, send the exact address
// if not found, send the full list of existing brokers
message Resource {
string grpc_addresses = 1;
int32 resource_count = 2;
}
repeated Resource resources = 2;
}
/////////////////////////
// Key-Value operations
/////////////////////////
message KvGetRequest {
bytes key = 1;
}
message KvGetResponse {
bytes value = 1;
string error = 2;
}
message KvPutRequest {
bytes key = 1;
bytes value = 2;
}
message KvPutResponse {
string error = 1;
}
/////////////////////////
// path-based configurations
/////////////////////////
message FilerConf {
int32 version = 1;
message PathConf {
string location_prefix = 1;
string collection = 2;
string replication = 3;
string ttl = 4;
string disk_type = 5;
bool fsync = 6;
uint32 volume_growth_count = 7;
bool read_only = 8;
string data_center = 9;
string rack = 10;
string data_node = 11;
uint32 max_file_name_length = 12;
bool disable_chunk_deletion = 13;
bool worm = 14;
uint64 worm_grace_period_seconds = 15;
uint64 worm_retention_time_seconds = 16;
}
repeated PathConf locations = 2;
}
/////////////////////////
// Remote Storage related
/////////////////////////
message CacheRemoteObjectToLocalClusterRequest {
string directory = 1;
string name = 2;
int32 chunk_concurrency = 3; // parallel chunk downloads per file, 0 = default (8)
int32 download_concurrency = 4; // multipart download concurrency per chunk (if supported by remote storage), 0 = default (5 for S3)
}
message CacheRemoteObjectToLocalClusterResponse {
Entry entry = 1;
SubscribeMetadataResponse metadata_event = 2;
}
/////////////////////////
// distributed lock management
/////////////////////////
message LockRequest {
string name = 1;
int64 seconds_to_lock = 2;
string renew_token = 3;
bool is_moved = 4;
string owner = 5;
}
message LockResponse {
string renew_token = 1;
string lock_owner = 2;
string lock_host_moved_to = 3;
string error = 4;
}
message UnlockRequest {
string name = 1;
string renew_token = 2;
bool is_moved = 3;
}
message UnlockResponse {
string error = 1;
string moved_to = 2;
}
message FindLockOwnerRequest {
string name = 1;
bool is_moved = 2;
}
message FindLockOwnerResponse {
string owner = 1;
}
message Lock {
string name = 1;
string renew_token = 2;
int64 expired_at_ns = 3;
string owner = 4;
}
message TransferLocksRequest {
repeated Lock locks = 1;
}
message TransferLocksResponse {
}
//////////////////////////////////////////////////
// StreamMutateEntry: ordered bidirectional streaming for all filer mutations.
// All create/update/delete/rename operations from a single mount go through
// one stream, preserving mutation ordering and eliminating per-request
// connection overhead.
message StreamMutateEntryRequest {
uint64 request_id = 1;
oneof request {
CreateEntryRequest create_request = 2;
UpdateEntryRequest update_request = 3;
DeleteEntryRequest delete_request = 4;
StreamRenameEntryRequest rename_request = 5;
}
}
message StreamMutateEntryResponse {
uint64 request_id = 1;
bool is_last = 2; // always true except for rename, which sends multiple events
oneof response {
CreateEntryResponse create_response = 3;
UpdateEntryResponse update_response = 4;
DeleteEntryResponse delete_response = 5;
StreamRenameEntryResponse rename_response = 6;
}
string error = 7; // human-readable error message when the operation failed
int32 errno = 8; // POSIX errno (e.g. ENOENT=2, ENOTEMPTY=66) for direct FUSE status mapping
}