Commit Graph

31 Commits

Author SHA1 Message Date
Chris Lu
9434d3733d mount: async flush on close() when writebackCache is enabled (#8727)
* mount: async flush on close() when writebackCache is enabled

When -writebackCache is enabled, defer data upload and metadata flush
from Flush() (triggered by close()) to a background goroutine in
Release(). This allows processes like rsync that write many small files
to proceed to the next file immediately instead of blocking on two
network round-trips (volume upload + filer metadata) per file.

Fixes #8718

* mount: add retry with backoff for async metadata flush

The metadata flush in completeAsyncFlush now retries up to 3 times
with exponential backoff (1s, 2s, 4s) on transient gRPC errors.
Since the chunk data is already safely on volume servers at this point,
only the filer metadata reference needs persisting — retrying is both
safe and effective.

Data flush (FlushData) is not retried externally because
UploadWithRetry already handles transient HTTP/gRPC errors internally;
if it still fails, the chunk memory has been freed.

* test: add integration tests for writebackCache async flush

Add comprehensive FUSE integration tests for the writebackCache
async flush feature (issue #8718):

- Basic operations: write/read, sequential files, large files, empty
  files, overwrites
- Fsync correctness: fsync forces synchronous flush even in writeback
  mode, immediate read-after-fsync
- Concurrent small files: multi-worker parallel writes (rsync-like
  workload), multi-directory, rapid create/close
- Data integrity: append after close, partial writes, file size
  correctness, binary data preservation
- Performance comparison: writeback vs synchronous flush throughput
- Stress test: 16 workers x 100 files with content verification
- Mixed concurrent operations: reads, writes, creates running together

Also fix pre-existing test infrastructure issues:
- Rename framework.go to framework_test.go (fixes Go package conflict)
- Fix undefined totalSize variable in concurrent_operations_test.go

* ci: update fuse-integration workflow to run full test suite

The workflow previously only ran placeholder tests (simple_test.go,
working_demo_test.go) in a temp directory due to a Go module conflict.
Now that framework.go is renamed to framework_test.go, the full test
suite compiles and runs correctly from test/fuse_integration/.

Changes:
- Run go test directly in test/fuse_integration/ (no temp dir copy)
- Install weed binary to /usr/local/bin for test framework discovery
- Configure /etc/fuse.conf with user_allow_other for FUSE mounts
- Install fuse3 for modern FUSE support
- Stream test output to log file for artifact upload

* mount: fix three P1 races in async flush

P1-1: Reopen overwrites data still flushing in background
ReleaseByHandle removes the old handle from fhMap before the deferred
flush finishes. A reopen of the same inode during that window would
build from stale filer metadata, overwriting the async flush.

Fix: Track in-flight async flushes per inode via pendingAsyncFlush map.
AcquireHandle now calls waitForPendingAsyncFlush(inode) to block until
any pending flush completes before reading filer metadata.

P1-2: Deferred flush races rename and unlink after close
completeAsyncFlush captured the path once at entry, but rename or
unlink after close() could cause metadata to be written under the
wrong name or recreate a deleted file.

Fix: Re-resolve path from inode via GetPath right before metadata
flush. GetPath returns the current path (reflecting renames) or
ENOENT (if unlinked), in which case we skip the metadata flush.

P1-3: SIGINT/SIGTERM bypasses the async-flush drain
grace.OnInterrupt runs hooks then calls os.Exit(0), so
WaitForAsyncFlush after server.Serve() never executes on signal.

Fix: Add WaitForAsyncFlush (with 10s timeout) to the WFS interrupt
handler, before cache cleanup. The timeout prevents hanging on Ctrl-C
when the filer is unreachable.

* mount: fix P1 races — draining handle stays in fhMap

P1-1: Reopen TOCTOU
The gap between ReleaseByHandle removing from fhMap and
submitAsyncFlush registering in pendingAsyncFlush allowed a
concurrent AcquireHandle to slip through with stale metadata.

Fix: Hold pendingAsyncFlushMu across both the counter decrement
(ReleaseByHandle) and the pending registration. The handle is
registered as pending before the lock is released, so
waitForPendingAsyncFlush always sees it.

P1-2: Rename/unlink can't find draining handle
ReleaseByHandle deleted from fhMap immediately. Rename's
FindFileHandle(inode) at line 251 could not find the handle to
update entry.Name. Unlink could not coordinate either.

Fix: When asyncFlushPending is true, ReleaseByHandle/ReleaseByInode
leave the handle in fhMap (counter=0 but maps intact). The handle
stays visible to FindFileHandle so rename can update entry.Name.
completeAsyncFlush re-resolves the path from the inode (GetPath)
right before metadata flush for correctness after rename/unlink.
After drain, RemoveFileHandle cleans up the maps.

Double-return prevention: ReleaseByHandle/ReleaseByInode return nil
if counter is already <= 0, so Forget after Release doesn't start a
second drain goroutine.

P1-3: SIGINT deletes swap files under running goroutines
After the 10s timeout, os.RemoveAll deleted the write cache dir
(containing swap files) while FlushData goroutines were still
reading from them.

Fix: Increase timeout to 30s. If timeout expires, skip write cache
dir removal so in-flight goroutines can finish reading swap files.
The OS (or next mount) cleans them up. Read cache is always removed.

* mount: never skip metadata flush when Forget drops inode mapping

Forget removes the inode→path mapping when the kernel's lookup count
reaches zero, but this does NOT mean the file was unlinked — it only
means the kernel evicted its cache entry.  completeAsyncFlush was
treating GetPath failure as "file unlinked" and skipping the metadata
flush, which orphaned the just-uploaded chunks for live files.

Fix: Save dir and name at doFlush defer time.  In completeAsyncFlush,
try GetPath first to pick up renames; if the mapping is gone, fall
back to the saved dir/name.  Always attempt the metadata flush — the
filer is the authority on whether the file exists, not the local
inode cache.

* mount: distinguish Forget from Unlink in async flush path fallback

The saved-path fallback (from the previous fix) always flushed
metadata when GetPath failed, which recreated files that were
explicitly unlinked after close().  The same stale fallback could
recreate the pre-rename path if Forget dropped the inode mapping
after a rename.

Root cause: GetPath failure has two meanings:
  1. Forget — kernel evicted the cache entry (file still exists)
  2. Unlink — file was explicitly deleted (should not recreate)

Fix (three coordinated changes):

Unlink (weedfs_file_mkrm.go): Before RemovePath, look up the inode
and find any draining handle via FindFileHandle.  Set fh.isDeleted =
true so the async flush knows the file was explicitly removed.

Rename (weedfs_rename.go): When renaming a file with a draining
handle, update asyncFlushDir/asyncFlushName to the post-rename
location.  This keeps the saved-path fallback current so Forget
after rename doesn't flush to the old (pre-rename) path.

completeAsyncFlush (weedfs_async_flush.go): Check fh.isDeleted
first — if true, skip metadata flush (file was unlinked, chunks
become orphans for volume.fsck).  Otherwise, try GetPath for the
current path (renames); fall back to saved path if Forget dropped
the mapping (file is live, just evicted from kernel cache).

* test/ci: address PR review nitpicks

concurrent_operations_test.go:
- Restore precise totalSize assertion instead of info.Size() > 0

writeback_cache_test.go:
- Check rand.Read errors in all 3 locations (lines 310, 512, 757)
- Check os.MkdirAll error in stress test (line 752)
- Remove dead verifyErrors variable (line 332)
- Replace both time.Sleep(5s) with polling via waitForFileContent
  to avoid flaky tests under CI load (lines 638, 700)

fuse-integration.yml:
- Add set -o pipefail so go test failures propagate through tee

* ci: fix fuse3/fuse package conflict on ubuntu-22.04 runner

fuse3 is pre-installed on ubuntu-22.04 runners and conflicts with
the legacy fuse package. Only install libfuse3-dev for the headers.

* mount/page_writer: remove debug println statements

Remove leftover debug println("read new data1/2") from
ReadDataAt in MemChunk and SwapFileChunk.

* test: fix findWeedBinary matching source directory instead of binary

findWeedBinary() matched ../../weed (the source directory) via
os.Stat before checking PATH, then tried to exec a directory
which fails with "permission denied" on the CI runner.

Fix: Check PATH first (reliable in CI where the binary is installed
to /usr/local/bin). For relative paths, verify the candidate is a
regular file (!info.IsDir()). Add ../../weed/weed as a candidate
for in-tree builds.

* test: fix framework — dynamic ports, output capture, data dirs

The integration test framework was failing in CI because:

1. All tests used hardcoded ports (19333/18080/18888), so sequential
   tests could conflict when prior processes hadn't fully released
   their ports yet.

2. Data subdirectories (data/master, data/volume) were not created
   before starting processes.

3. Master was started with -peers=none which is not a valid address.

4. Process stdout/stderr was not captured, making failures opaque
   ("service not ready within timeout" with no diagnostics).

5. The unmount fallback used 'umount' instead of 'fusermount -u'.

6. The mount used -cacheSizeMB (nonexistent) instead of
   -cacheCapacityMB and was missing -allowOthers=false for
   unprivileged CI runners.

Fixes:
- Dynamic port allocation via freePort() (net.Listen ":0")
- Explicit gRPC ports via -port.grpc to avoid default port conflicts
- Create data/master and data/volume directories in Setup()
- Remove invalid -peers=none and -raftBootstrap flags
- Capture process output to logDir/*.log via startProcess() helper
- dumpLog() prints tail of log file on service startup failure
- Use fusermount3/fusermount -u for unmount
- Fix mount flag names (-cacheCapacityMB, -allowOthers=false)

* test: remove explicit -port.grpc flags from test framework

SeaweedFS convention: gRPC port = HTTP port + 10000.  Volume and
filer discover the master gRPC port by this convention.  Setting
explicit -port.grpc on master/volume/filer broke inter-service
communication because the volume server computed master gRPC as
HTTP+10000 but the actual gRPC was on a different port.

Remove all -port.grpc flags and let the default convention work.
Dynamic HTTP ports already ensure uniqueness; the derived gRPC
ports (HTTP+10000) will also be unique.

---------

Co-authored-by: Copilot <copilot@github.com>
2026-03-22 15:24:08 -07:00
Chris Lu
3f946fc0c0 mount: make metadata cache rebuilds snapshot-consistent (#8531)
* filer: expose metadata events and list snapshots

* mount: invalidate hot directory caches

* mount: read hot directories directly from filer

* mount: add sequenced metadata cache applier

* mount: apply metadata responses through cache applier

* mount: replay snapshot-consistent directory builds

* mount: dedupe self metadata events

* mount: factor directory build cleanup

* mount: replace proto marshal dedup with composite key and ring buffer

The dedup logic was doing a full deterministic proto.Marshal on every
metadata event just to produce a dedup key. Replace with a cheap
composite string key (TsNs|Directory|OldName|NewName).

Also replace the sliding-window slice (which leaked the backing array
unboundedly) with a fixed-size ring buffer that reuses the same array.

* filer: remove mutex and proto.Clone from request-scoped MetadataEventSink

MetadataEventSink is created per-request and only accessed by the
goroutine handling the gRPC call. The mutex and double proto.Clone
(once in Record, once in Last) were unnecessary overhead on every
filer write operation. Store the pointer directly instead.

* mount: skip proto.Clone for caller-owned metadata events

Add ApplyMetadataResponseOwned that takes ownership of the response
without cloning. Local metadata events (mkdir, create, flush, etc.)
are freshly constructed and never shared, so the clone is unnecessary.

* filer: only populate MetadataEvent on successful DeleteEntry

Avoid calling eventSink.Last() on error paths where the sink may
contain a partial event from an intermediate child deletion during
recursive deletes.

* mount: avoid map allocation in collectDirectoryNotifications

Replace the map with a fixed-size array and linear dedup. There are
at most 3 directories to notify (old parent, new parent, new child
if directory), so a 3-element array avoids the heap allocation on
every metadata event.

* mount: fix potential deadlock in enqueueApplyRequest

Release applyStateMu before the blocking channel send. Previously,
if the channel was full (cap 128), the send would block while holding
the mutex, preventing Shutdown from acquiring it to set applyClosed.

* mount: restore signature-based self-event filtering as fast path

Re-add the signature check that was removed when content-based dedup
was introduced. Checking signatures is O(1) on a small slice and
avoids enqueuing and processing events that originated from this
mount instance. The content-based dedup remains as a fallback.

* filer: send snapshotTsNs only in first ListEntries response

The snapshot timestamp is identical for every entry in a single
ListEntries stream. Sending it in every response message wastes
wire bandwidth for large directories. The client already reads
it only from the first response.

* mount: exit read-through mode after successful full directory listing

MarkDirectoryRefreshed was defined but never called, so directories
that entered read-through mode (hot invalidation threshold) stayed
there permanently, hitting the filer on every readdir even when cold.
Call it after a complete read-through listing finishes.

* mount: include event shape and full paths in dedup key

The previous dedup key only used Names, which could collapse distinct
rename targets. Include the event shape (C/D/U/R), source directory,
new parent path, and both entry names so structurally different events
are never treated as duplicates.

* mount: drain pending requests on shutdown in runApplyLoop

After receiving the shutdown sentinel, drain any remaining requests
from applyCh non-blockingly and signal each with errMetaCacheClosed
so callers waiting on req.done are released.

* mount: include IsDirectory in synthetic delete events

metadataDeleteEvent now accepts an isDirectory parameter so the
applier can distinguish directory deletes from file deletes. Rmdir
passes true, Unlink passes false.

* mount: fall back to synthetic event when MetadataEvent is nil

In mknod and mkdir, if the filer response omits MetadataEvent (e.g.
older filer without the field), synthesize an equivalent local
metadata event so the cache is always updated.

* mount: make Flush metadata apply best-effort after successful commit

After filer_pb.CreateEntryWithResponse succeeds, the entry is
persisted. Don't fail the Flush syscall if the local metadata cache
apply fails — log and invalidate the directory cache instead.
Also fall back to a synthetic event when MetadataEvent is nil.

* mount: make Rename metadata apply best-effort

The rename has already succeeded on the filer by the time we apply
the local metadata event. Log failures instead of returning errors
that would be dropped by the caller anyway.

* mount: make saveEntry metadata apply best-effort with fallback

After UpdateEntryWithResponse succeeds, treat local metadata apply
as non-fatal. Log and invalidate the directory cache on failure.
Also fall back to a synthetic event when MetadataEvent is nil.

* filer_pb: preserve snapshotTsNs on error in ReadDirAllEntriesWithSnapshot

Return the snapshot timestamp even when the first page fails, so
callers receive the snapshot boundary when partial data was received.

* filer: send snapshot token for empty directory listings

When no entries are streamed, send a final ListEntriesResponse with
only SnapshotTsNs so clients always receive the snapshot boundary.

* mount: distinguish not-found vs transient errors in lookupEntry

Return fuse.EIO for non-not-found filer errors instead of
unconditionally returning ENOENT, so transient failures don't
masquerade as missing entries.

* mount: make CacheRemoteObject metadata apply best-effort

The file content has already been cached successfully. Don't fail
the read if the local metadata cache update fails.

* mount: use consistent snapshot for readdir in direct mode

Capture the SnapshotTsNs from the first loadDirectoryEntriesDirect
call and store it on the DirectoryHandle. Subsequent batch loads
pass this stored timestamp so all batches use the same snapshot.

Also export DoSeaweedListWithSnapshot so mount can use it directly
with snapshot passthrough.

* filer_pb: fix test fake to send SnapshotTsNs only on first response

Match the server behavior: only the first ListEntriesResponse in a
page carries the snapshot timestamp, subsequent entries leave it zero.

* Fix nil pointer dereference in ListEntries stream consumers

Remove the empty-directory snapshot-only response from ListEntries
that sent a ListEntriesResponse with Entry==nil, which crashed every
raw stream consumer that assumed resp.Entry is always non-nil.

Also add defensive nil checks for resp.Entry in all raw ListEntries
stream consumers across: S3 listing, broker topic lookup, broker
topic config, admin dashboard, topic retention, hybrid message
scanner, Kafka integration, and consumer offset storage.

* Add nil guards for resp.Entry in remaining ListEntries stream consumers

Covers: S3 object lock check, MQ management dashboard (version/
partition/offset loops), and topic retention version loop.

* Make applyLocalMetadataEvent best-effort in Link and Symlink

The filer operations already succeeded; failing the syscall because
the local cache apply failed is wrong. Log a warning and invalidate
the parent directory cache instead.

* Make applyLocalMetadataEvent best-effort in Mkdir/Rmdir/Mknod/Unlink

The filer RPC already committed; don't fail the syscall when the
local metadata cache apply fails. Log a warning and invalidate the
parent directory cache to force a re-fetch on next access.

* flushFileMetadata: add nil-fallback for metadata event and best-effort apply

Synthesize a metadata event when resp.GetMetadataEvent() is nil
(matching doFlush), and make the apply best-effort with cache
invalidation on failure.

* Prevent double-invocation of cleanupBuild in doEnsureVisited

Add a cleanupDone guard so the deferred cleanup and inline error-path
cleanup don't both call DeleteFolderChildren/AbortDirectoryBuild.

* Fix comment: signature check is O(n) not O(1)

* Prevent deferred cleanup after successful CompleteDirectoryBuild

Set cleanupDone before returning from the success path so the
deferred context-cancellation check cannot undo a published build.

* Invalidate parent directory caches on rename metadata apply failure

When applyLocalMetadataEvent fails during rename, invalidate the
source and destination parent directory caches so subsequent accesses
trigger a re-fetch from the filer.

* Add event nil-fallback and cache invalidation to Link and Symlink

Synthesize metadata events when the server doesn't return one, and
invalidate parent directory caches on apply failure.

* Match requested partition when scanning partition directories

Parse the partition range format (NNNN-NNNN) and match against the
requested partition parameter instead of using the first directory.

* Preserve snapshot timestamp across empty directory listings

Initialize actualSnapshotTsNs from the caller-requested value so it
isn't lost when the server returns no entries. Re-add the server-side
snapshot-only response for empty directories (all raw stream consumers
now have nil guards for Entry).

* Fix CreateEntry error wrapping to support errors.Is/errors.As

Use errors.New + %w instead of %v for resp.Error so callers can
unwrap the underlying error.

* Fix object lock pagination: only advance on non-nil entries

Move entriesReceived inside the nil check so nil entries don't
cause repeated ListEntries calls with the same lastFileName.

* Guard Attributes nil check before accessing Mtime in MQ management

* Do not send nil-Entry response for empty directory listings

The snapshot-only ListEntriesResponse (with Entry == nil) for empty
directories breaks consumers that treat any received response as an
entry (Java FilerClient, S3 listing). The Go client-side
DoSeaweedListWithSnapshot already preserves the caller-requested
snapshot via actualSnapshotTsNs initialization, so the server-side
send is unnecessary.

* Fix review findings: subscriber dedup, invalidation normalization, nil guards, shutdown race

- Remove self-signature early-return in processEventFn so all events
  flow through the applier (directory-build buffering sees self-originated
  events that arrive after a snapshot)
- Normalize NewParentPath in collectEntryInvalidations to avoid duplicate
  invalidations when NewParentPath is empty (same-directory update)
- Guard resp.Entry.Attributes for nil in admin_server.go and
  topic_retention.go to prevent panics on entries without attributes
- Fix enqueueApplyRequest race with shutdown by using select on both
  applyCh and applyDone, preventing sends after the apply loop exits
- Add cleanupDone check to deferred cleanup in meta_cache_init.go for
  clarity alongside the existing guard in cleanupBuild
- Add empty directory test case for snapshot consistency

* Propagate authoritative metadata event from CacheRemoteObjectToLocalCluster and generate client-side snapshot for empty directories

- Add metadata_event field to CacheRemoteObjectToLocalClusterResponse
  proto so the filer-emitted event is available to callers
- Use WithMetadataEventSink in the server handler to capture the event
  from NotifyUpdateEvent and return it on the response
- Update filehandle_read.go to prefer the RPC's metadata event over
  a locally fabricated one, falling back to metadataUpdateEvent when
  the server doesn't provide one (e.g., older filers)
- Generate a client-side snapshot cutoff in DoSeaweedListWithSnapshot
  when the server sends no snapshot (empty directory), so callers like
  CompleteDirectoryBuild get a meaningful boundary for filtering
  buffered events

* Skip directory notifications for dirs being built to prevent mid-build cache wipe

When a metadata event is buffered during a directory build,
applyMetadataSideEffects was still firing noteDirectoryUpdate for the
building directory. If the directory accumulated enough updates to
become "hot", markDirectoryReadThrough would call DeleteFolderChildren,
wiping entries that EnsureVisited had already inserted. The build would
then complete and mark the directory cached with incomplete data.

Fix by using applyMetadataSideEffectsSkippingBuildingDirs for buffered
events, which suppresses directory notifications for dirs currently in
buildingDirs while still applying entry invalidations.

* Add test for directory notification suppression during active build

TestDirectoryNotificationsSuppressedDuringBuild verifies that metadata
events targeting a directory under active EnsureVisited build do NOT
fire onDirectoryUpdate for that directory. In production, this prevents
markDirectoryReadThrough from calling DeleteFolderChildren mid-build,
which would wipe entries already inserted by the listing.

The test inserts an entry during a build, sends multiple metadata events
for the building directory, asserts no notifications fired for it,
verifies the entry survives, and confirms buffered events are replayed
after CompleteDirectoryBuild.

* Fix create invalidations, build guard, event shape, context, and snapshot error path

- collectEntryInvalidations: invalidate FUSE kernel cache on pure
  create events (OldEntry==nil && NewEntry!=nil), not just updates
  and deletes
- completeDirectoryBuildNow: only call markCachedFn when an active
  build existed (state != nil), preventing an unpopulated directory
  from being marked as cached
- Add metadataCreateEvent helper that produces a create-shaped event
  (NewEntry only, no OldEntry) and use it in mkdir, mknod, symlink,
  and hardlink create fallback paths instead of metadataUpdateEvent
  which incorrectly set both OldEntry and NewEntry
- applyMetadataResponseEnqueue: use context.Background() for the
  queued mutation so a cancelled caller context cannot abort the
  apply loop mid-write
- DoSeaweedListWithSnapshot: move snapshot initialization before
  ListEntries call so the error path returns the preserved snapshot
  instead of 0

* Fix review findings: test loop, cache race, context safety, snapshot consistency

- Fix build test loop starting at i=1 instead of i=0, missing new-0.txt verification
- Re-check IsDirectoryCached after cache miss to avoid ENOENT race with markDirectoryReadThrough
- Use context.Background() in enqueueAndWait so caller cancellation can't abort build/complete mid-way
- Pass dh.snapshotTsNs in skip-batch loadDirectoryEntriesDirect for snapshot consistency
- Prefer resp.MetadataEvent over fallback in Unlink event derivation
- Add comment on MetadataEventSink.Record single-event assumption

* Fix empty-directory snapshot clock skew and build cancellation race

Empty-directory snapshot: Remove client-side time.Now() synthesis when
the server returns no entries. Instead return snapshotTsNs=0, and in
completeDirectoryBuildNow replay ALL buffered events when snapshot is 0.
This eliminates the clock-skew bug where a client ahead of the filer
would filter out legitimate post-list events.

Build cancellation: Use context.Background() for BeginDirectoryBuild
and CompleteDirectoryBuild calls in doEnsureVisited, so errgroup
cancellation doesn't cause enqueueAndWait to return early and trigger
cleanupBuild while the operation is still queued.

* Add tests for empty-directory build replay and cancellation resilience

TestEmptyDirectoryBuildReplaysAllBufferedEvents: verifies that when
CompleteDirectoryBuild receives snapshotTsNs=0 (empty directory, no
server snapshot), ALL buffered events are replayed regardless of their
TsNs values — no clock-skew-sensitive filtering occurs.

TestBuildCompletionSurvivesCallerCancellation: verifies that once
CompleteDirectoryBuild is enqueued, a cancelled caller context does not
prevent the build from completing. The apply loop runs with
context.Background(), so the directory becomes cached and buffered
events are replayed even when the caller gives up waiting.

* Fix directory subtree cleanup, Link rollback, test robustness

- applyMetadataResponseLocked: when a directory entry is deleted or
  moved, call DeleteFolderChildren on the old path so cached descendants
  don't leak as stale entries.

- Link: save original HardLinkId/Counter before mutation. If
  CreateEntryWithResponse fails after the source was already updated,
  rollback the source entry to its original state via UpdateEntry.

- TestBuildCompletionSurvivesCallerCancellation: replace fixed
  time.Sleep(50ms) with a deadline-based poll that checks
  IsDirectoryCached in a loop, failing only after 2s timeout.

- TestReadDirAllEntriesWithSnapshotEmptyDirectory: assert that
  ListEntries was actually invoked on the mock client so the test
  exercises the RPC path.

- newMetadataEvent: add early return when both oldEntry and newEntry are
  nil to avoid emitting events with empty Directory.

---------

Co-authored-by: Copilot <copilot@github.com>
2026-03-07 09:19:40 -08:00
Chris Lu
d4ecfaeda7 Enable writeback_cache and async_dio FUSE options (#7980)
* Enable writeback_cache and async_dio FUSE options

Fixes #7978

- Update mount_std.go to use EnableWriteback and EnableAsyncDio from go-fuse
- Add go.mod replace directive to use local go-fuse with capability support
- Remove temporary workaround that disabled these options

This enables proper FUSE kernel capability negotiation for writeback cache
and async direct I/O, improving performance for small writes and concurrent
direct I/O operations.

* Address PR review comments

- Remove redundant nil checks for writebackCache and asyncDio flags
- Update go.mod replace directive to use seaweedfs/go-fuse fork instead of local path

* Add TODO comment for go.mod replace directive

The replace directive must use a local path until seaweedfs/go-fuse#1 is merged.
After merge, this should be updated to use the proper version.

* Use seaweedfs/go-fuse v2.9.0 instead of local repository

Replace local path with seaweedfs/go-fuse v2.9.0 fork which includes
the writeback_cache and async_dio capability support.

* Use github.com/seaweedfs/go-fuse/v2 directly without replace directive

- Updated all imports to use github.com/seaweedfs/go-fuse/v2
- Removed replace directive from go.mod
- Using seaweedfs/go-fuse v2.0.0-20260106181308-87f90219ce09 which includes:
  * writeback_cache and async_dio support
  * Corrected module path

* Update to seaweedfs/go-fuse v2.9.1

Use v2.9.1 tag which includes the corrected module path
(github.com/seaweedfs/go-fuse/v2) along with writeback_cache
and async_dio support.
2026-01-06 10:50:54 -08:00
Chris Lu
6442da6f17 mount: efficient file lookup in large directories, skipping directory caching (#7818)
* mount: skip directory caching on file lookup and write

When opening or creating a file in a directory that hasn't been cached yet,
don't list the entire directory. Instead:
- For reads: fetch only the single file's metadata directly from the filer
- For writes: create on filer but skip local cache insertion

This fixes a performance issue where opening a file in a directory
with millions of files would hang because EnsureVisited() had to
list all entries before the open could complete.

The directory will still be cached when explicitly listed (ReadDir),
but individual file operations now bypass the full directory caching.

Key optimizations:
- Extract shared lookupEntry() method to eliminate code duplication
- Skip EnsureVisited on Lookup (file open)
- Skip cache insertion on Mknod, Mkdir, Symlink, Link if dir not cached
- Skip cache update on file sync/flush if dir not cached
- If directory IS cached and entry not found, return ENOENT immediately

Fixes #7145

* mount: add error handling for meta cache insert/update operations

Handle errors from metaCache.InsertEntry and metaCache.UpdateEntry calls
instead of silently ignoring them. This prevents silent cache inconsistencies
and ensures errors are properly propagated.

Files updated:
- filehandle_read.go: handle InsertEntry error in downloadRemoteEntry
- weedfs_file_sync.go: handle InsertEntry error in doFlush
- weedfs_link.go: handle UpdateEntry and InsertEntry errors in Link
- weedfs_symlink.go: handle InsertEntry error in Symlink

* mount: use error wrapping (%w) for consistent error handling

Use %w instead of %v in fmt.Errorf to preserve the original error,
allowing it to be inspected up the call stack with errors.Is/As.
2025-12-18 21:19:15 -08:00
Chris Lu
ec3378f7a6 fix: improve mount quota enforcement to prevent overflow (#7804)
* fix: improve mount quota enforcement to prevent overflow (fixes seaweedfs-csi-driver#218)

* test: add unit tests for quota enforcement
2025-12-17 01:14:01 -08:00
Aleksey Kosov
283d9e0079 Add context with request (#6824) 2025-05-28 11:34:02 -07:00
Aleksey Kosov
165af32d6b added context to filer_client method calls (#6808)
Co-authored-by: akosov <a.kosov@kryptonite.ru>
2025-05-22 09:46:49 -07:00
chrislu
f99a818ad2 deal with handle not found when flushing
fix https://github.com/seaweedfs/seaweedfs/issues/6645#issuecomment-2737400492
2025-03-19 21:02:26 -07:00
Guang Jiong Lou
6c986e9d70 improve worm support (#5983)
* improve worm support

Signed-off-by: lou <alex1988@outlook.com>

* worm mode in filer

Signed-off-by: lou <alex1988@outlook.com>

* update after review

Signed-off-by: lou <alex1988@outlook.com>

* update after review

Signed-off-by: lou <alex1988@outlook.com>

* move to fs configure

Signed-off-by: lou <alex1988@outlook.com>

* remove flag

Signed-off-by: lou <alex1988@outlook.com>

* update after review

Signed-off-by: lou <alex1988@outlook.com>

* support worm hardlink

Signed-off-by: lou <alex1988@outlook.com>

* update after review

Signed-off-by: lou <alex1988@outlook.com>

* typo

Signed-off-by: lou <alex1988@outlook.com>

* sync filer conf

Signed-off-by: lou <alex1988@outlook.com>

---------

Signed-off-by: lou <alex1988@outlook.com>
2024-09-16 21:02:21 -07:00
Chris Lu
952afd810c Fix dead lock (#5815)
* reduce locks to avoid dead lock

Flush->FlushData->uplloadPipeline.FluahAll
uploaderCount>0

goroutine 1 [sync.Cond.Wait, 71 minutes]:
sync.runtime_notifyListWait(0xc0007ae4d0, 0x0)
	/usr/local/go/src/runtime/sema.go:569 +0x159
sync.(*Cond).Wait(0xc001a59290?)
	/usr/local/go/src/sync/cond.go:70 +0x85
github.com/seaweedfs/seaweedfs/weed/mount/page_writer.(*UploadPipeline).waitForCurrentWritersToComplete(0xc0002ee4d0)
	/github/workspace/weed/mount/page_writer/upload_pipeline_lock.go:58 +0x32
github.com/seaweedfs/seaweedfs/weed/mount/page_writer.(*UploadPipeline).FlushAll(0xc0002ee4d0)
	/github/workspace/weed/mount/page_writer/upload_pipeline.go:151 +0x25
github.com/seaweedfs/seaweedfs/weed/mount.(*ChunkedDirtyPages).FlushData(0xc00087e840)
	/github/workspace/weed/mount/dirty_pages_chunked.go:54 +0x29
github.com/seaweedfs/seaweedfs/weed/mount.(*PageWriter).FlushData(...)
	/github/workspace/weed/mount/page_writer.go:50
github.com/seaweedfs/seaweedfs/weed/mount.(*WFS).doFlush(0xc0006ad600, 0xc00030d380, 0x0, 0x0)
	/github/workspace/weed/mount/weedfs_file_sync.go:101 +0x169
github.com/seaweedfs/seaweedfs/weed/mount.(*WFS).Flush(0xc0006ad600, 0xc001a594a8?, 0xc0004c1ca0)
	/github/workspace/weed/mount/weedfs_file_sync.go:59 +0x48
github.com/hanwen/go-fuse/v2/fuse.doFlush(0xc0000da870?, 0xc0004c1b08)

SaveContent -> MemChunk.RLock ->
	ChunkedDirtyPages.saveChunkedFileIntervalToStorage
	pages.fh.AddChunks([]*filer_pb.FileChunk{chunk})
		fh.entryLock.Lock()

sync.(*RWMutex).Lock(0x0?)
	/usr/local/go/src/sync/rwmutex.go:146 +0x31
github.com/seaweedfs/seaweedfs/weed/mount.(*FileHandle).AddChunks(0xc00030d380, {0xc00028bdc8, 0x1, 0x1})
	/github/workspace/weed/mount/filehandle.go:93 +0x45
github.com/seaweedfs/seaweedfs/weed/mount.(*ChunkedDirtyPages).saveChunkedFileIntervalToStorage(0xc00087e840, {0x2be7ac0, 0xc00018d9e0}, 0x0, 0x121, 0x17e3c624565ace45, 0x1?)
	/github/workspace/weed/mount/dirty_pages_chunked.go:80 +0x2d4
github.com/seaweedfs/seaweedfs/weed/mount/page_writer.(*MemChunk).SaveContent(0xc0008d9130, 0xc0008093e0)
	/github/workspace/weed/mount/page_writer/page_chunk_mem.go:115 +0x112
github.com/seaweedfs/seaweedfs/weed/mount/page_writer.(*UploadPipeline).moveToSealed.func1()
	/github/workspace/weed/mount/page_writer/upload_pipeline.go:187 +0x55
github.com/seaweedfs/seaweedfs/weed/util.(*LimitedConcurrentExecutor).Execute.func1()
	/github/workspace/weed/util/limited_executor.go:38 +0x62
created by github.com/seaweedfs/seaweedfs/weed/util.(*LimitedConcurrentExecutor).Execute in goroutine 1
	/github/workspace/weed/util/limited_executor.go:33 +0x97

On metadata update
fh.entryLock.Lock()
	fh.dirtyPages.Destroy()
		up.chunksLock.Lock => each sealed chunk.FreeReference => MemChunk.Lock

goroutine 134 [sync.RWMutex.Lock, 71 minutes]:
sync.runtime_SemacquireRWMutex(0xc0007c3558?, 0xea?, 0x3fb0800?)
	/usr/local/go/src/runtime/sema.go:87 +0x25
sync.(*RWMutex).Lock(0xc0007c35a8?)
	/usr/local/go/src/sync/rwmutex.go:151 +0x6a
github.com/seaweedfs/seaweedfs/weed/mount/page_writer.(*MemChunk).FreeResource(0xc0008d9130)
	/github/workspace/weed/mount/page_writer/page_chunk_mem.go:38 +0x2a
github.com/seaweedfs/seaweedfs/weed/mount/page_writer.(*SealedChunk).FreeReference(0xc00071cdb0, {0xc0006ba1a0, 0x20})
	/github/workspace/weed/mount/page_writer/upload_pipeline.go:38 +0xb7
github.com/seaweedfs/seaweedfs/weed/mount/page_writer.(*UploadPipeline).Shutdown(0xc0002ee4d0)
	/github/workspace/weed/mount/page_writer/upload_pipeline.go:220 +0x185
github.com/seaweedfs/seaweedfs/weed/mount.(*ChunkedDirtyPages).Destroy(0xc0008cea40?)
	/github/workspace/weed/mount/dirty_pages_chunked.go:87 +0x17
github.com/seaweedfs/seaweedfs/weed/mount.(*PageWriter).Destroy(...)
	/github/workspace/weed/mount/page_writer.go:78
github.com/seaweedfs/seaweedfs/weed/mount.NewSeaweedFileSystem.func3({0xc00069a6c0, 0x30}, 0x6?)
	/github/workspace/weed/mount/weedfs.go:119 +0x17a
github.com/seaweedfs/seaweedfs/weed/mount/meta_cache.NewMetaCache.func1({0xc00069a6c0?, 0xc00069a480?}, 0x4015b40?)
	/github/workspace/weed/mount/meta_cache/meta_cache.go:37 +0x1c
github.com/seaweedfs/seaweedfs/weed/mount/meta_cache.SubscribeMetaEvents.func1(0xc000661810)
	/github/workspace/weed/mount/meta_cache/meta_cache_subscribe.go:43 +0x570

* use locked entry everywhere

* modifiable remote entry

* skip locking after getting lock from fhLockTable
2024-07-24 23:46:40 -07:00
Gaspare Iengo
fce8fc1e16 Fix Issue #5649 (#5652) 2024-06-06 12:49:33 -07:00
chrislu
31fc165715 mount switch to ordered lock requests 2023-09-21 11:08:26 -07:00
Chengyu Liu
0f653d2153 update fuse client log which should output the fh of FileHandle. (#4404) 2023-04-15 22:15:16 -07:00
chrislu
efcffe4f77 no need to set chunks during flush 2023-01-10 10:01:52 -08:00
chrislu
86657ed3a8 avoid repeatedly adding manifest chunks 2023-01-10 01:35:10 -08:00
Chris Lu
d4566d4aaa more solid weed mount (#4089)
* compare chunks by timestamp

* fix slab clearing error

* fix test compilation

* move oldest chunk to sealed, instead of by fullness

* lock on fh.entryViewCache

* remove verbose logs

* revert slat clearing

* less logs

* less logs

* track write and read by timestamp

* remove useless logic

* add entry lock on file handle release

* use mem chunk only, swap file chunk has problems

* comment out code that maybe used later

* add debug mode to compare data read and write

* more efficient readResolvedChunks with linked list

* small optimization

* fix test compilation

* minor fix on writer

* add SeparateGarbageChunks

* group chunks into sections

* turn off debug mode

* fix tests

* fix tests

* tmp enable swap file chunk

* Revert "tmp enable swap file chunk"

This reverts commit 985137ec472924e4815f258189f6ca9f2168a0a7.

* simple refactoring

* simple refactoring

* do not re-use swap file chunk. Sealed chunks should not be re-used.

* comment out debugging facilities

* either mem chunk or swap file chunk is fine now

* remove orderedMutex  as *semaphore.Weighted

not found impactful

* optimize size calculation for changing large files

* optimize performance to avoid going through the long list of chunks

* still problems with swap file chunk

* rename

* tiny optimization

* swap file chunk save only successfully read data

* fix

* enable both mem and swap file chunk

* resolve chunks with range

* rename

* fix chunk interval list

* also change file handle chunk group when adding chunks

* pick in-active chunk with time-decayed counter

* fix compilation

* avoid nil with empty fh.entry

* refactoring

* rename

* rename

* refactor visible intervals to *list.List

* refactor chunkViews to *list.List

* add IntervalList for generic interval list

* change visible interval to use IntervalList in generics

* cahnge chunkViews to *IntervalList[*ChunkView]

* use NewFileChunkSection to create

* rename variables

* refactor

* fix renaming leftover

* renaming

* renaming

* add insert interval

* interval list adds lock

* incrementally add chunks to readers

Fixes:
1. set start and stop offset for the value object
2. clone the value object
3. use pointer instead of copy-by-value when passing to interval.Value
4. use insert interval since adding chunk could be out of order

* fix tests compilation

* fix tests compilation
2023-01-02 23:20:45 -08:00
chrislu
659f28b855 simple refactoring 2022-12-05 23:19:38 -08:00
Chris Lu
dac9c28d05 Revert "refactor: moved to locked entry" (#4035)
* Revert "refactor: moved to locked entry"

This reverts commit 94bc9afd9d.

* only add LockedEntry, no changes to entryLock

* fix compilation
2022-12-05 12:32:27 -08:00
chrislu
94bc9afd9d refactor: moved to locked entry 2022-12-04 23:33:05 -08:00
chrislu
70a4c98b00 refactor filer_pb.Entry and filer.Entry to use GetChunks()
for later locking on reading chunks
2022-11-15 06:33:36 -08:00
chrislu
22064c3425 mount: ensure ordered file handle lock and unlock 2022-09-11 19:44:34 -07:00
chrislu
26dbc6c905 move to https://github.com/seaweedfs/seaweedfs 2022-07-29 00:17:28 -07:00
chrislu
4fd5f96598 filer: remove replication, collection, disk_type info from entry metadata
these metadata can change and are not used
2022-06-06 00:39:35 -07:00
chrislu
d65bb2c6df mount: file handle locks entry better
related to https://github.com/chrislusf/seaweedfs/issues/2952
2022-06-05 18:15:06 -07:00
chrislu
7542fd7f4d mount: optimize a bit when writing files 2022-03-17 00:02:38 -07:00
chrislu
f3442e36e6 mount: quota adjust error type to be syscall.ENOSPC 2022-03-06 17:04:21 -08:00
chrislu
b7c992f410 add flag to enforce quota 2022-03-05 22:10:43 -08:00
chrislu
3639fedd01 mount: fix fsync opened and renamed files 2022-02-28 00:34:17 -08:00
chrislu
e8ce30fdc5 mount2: adjust file mode 2022-02-18 00:47:02 -08:00
chrislu
7286e525ad support write 2022-02-13 23:27:11 -08:00
chrislu
bb9919b07a add open release, refactor 2022-02-13 19:14:34 -08:00