mount: improve small file write performance (#8769)

* mount: defer file creation gRPC to flush time for faster small file writes

When creating a file via FUSE Create(), skip the synchronous gRPC
CreateEntry call to the filer. Instead, allocate the inode and build
the entry locally, deferring the filer create to the Flush/Release path
where flushMetadataToFiler already sends a CreateEntry with chunk data.

This eliminates one synchronous gRPC round-trip per file during creation.
For workloads with many small files (e.g. 30K files), this reduces the
per-file overhead from ~2 gRPC calls to ~1.

Mknod retains synchronous filer creation since it has no file handle
and thus no flush path.

* mount: use bounded worker pool for async flush operations

Replace unbounded goroutine spawning in writebackCache async flush
with a fixed-size worker pool backed by a channel. When many files
are closed rapidly (e.g., cp -r of 30K files), the previous approach
spawned one goroutine per file, leading to resource contention on
gRPC/HTTP connections and high goroutine overhead.

The worker pool size matches ConcurrentWriters (default 128), which
provides good parallelism while bounding resource usage. Work items
are queued into a buffered channel and processed by persistent worker
goroutines.

* mount: fix deferred create cache visibility and async flush race

Three fixes for the deferred create and async flush changes:

1. Insert a local placeholder entry into the metadata cache during
   deferred file creation so that maybeLoadEntry() can find the file
   for duplicate-create checks, stat, and readdir. Uses InsertEntry
   directly (not applyLocalMetadataEvent) to avoid triggering the
   directory hot-threshold eviction that would wipe the entry.

2. Fix race in ReleaseHandle where asyncFlushWg.Add(1) and the
   channel send happened after pendingAsyncFlushMu was unlocked.
   A concurrent WaitForAsyncFlush could observe a zero counter,
   close the channel, and cause a send-on-closed panic. Move Add(1)
   before the unlock; keep the send after unlock to avoid deadlock
   with workers that acquire the same mutex during cleanup.

3. Update TestCreateCreatesAndOpensFile to flush the file handle
   before verifying the CreateEntry gRPC call, since file creation
   is now deferred to flush time.
This commit is contained in:
Chris Lu
2026-03-24 20:31:53 -07:00
committed by GitHub
parent 28fe92065a
commit e47054a7e7
5 changed files with 101 additions and 28 deletions

View File

@@ -119,10 +119,15 @@ type WFS struct {
dirHotThreshold int
dirIdleEvict time.Duration
// asyncFlushWg tracks pending background flush goroutines for writebackCache mode.
// asyncFlushWg tracks pending background flush work items for writebackCache mode.
// Must be waited on before unmount cleanup to prevent data loss.
asyncFlushWg sync.WaitGroup
// asyncFlushCh is a bounded work queue for background flush operations.
// A fixed pool of worker goroutines processes items from this channel,
// preventing resource exhaustion from unbounded goroutine creation.
asyncFlushCh chan *asyncFlushItem
// pendingAsyncFlush tracks in-flight async flush goroutines by inode.
// AcquireHandle checks this to wait for a pending flush before reopening
// the same inode, preventing stale metadata from overwriting the async flush.
@@ -277,6 +282,13 @@ func NewSeaweedFileSystem(option *Option) *WFS {
wfs.concurrentWriters = util.NewLimitedConcurrentExecutor(wfs.option.ConcurrentWriters)
wfs.concurrentCopiersSem = make(chan struct{}, wfs.option.ConcurrentWriters)
}
if wfs.option.WritebackCache {
numWorkers := wfs.option.ConcurrentWriters
if numWorkers <= 0 {
numWorkers = 128
}
wfs.startAsyncFlushWorkers(numWorkers)
}
wfs.copyBufferPool.New = func() any {
return make([]byte, option.ChunkSizeLimit)
}