* chore: remove unreachable dead code across the codebase
Remove ~50,000 lines of unreachable code identified by static analysis.
Major removals:
- weed/filer/redis_lua: entire unused Redis Lua filer store implementation
- weed/wdclient/net2, resource_pool: unused connection/resource pool packages
- weed/plugin/worker/lifecycle: unused lifecycle plugin worker
- weed/s3api: unused S3 policy templates, presigned URL IAM, streaming copy,
multipart IAM, key rotation, and various SSE helper functions
- weed/mq/kafka: unused partition mapping, compression, schema, and protocol functions
- weed/mq/offset: unused SQL storage and migration code
- weed/worker: unused registry, task, and monitoring functions
- weed/query: unused SQL engine, parquet scanner, and type functions
- weed/shell: unused EC proportional rebalance functions
- weed/storage/erasure_coding/distribution: unused distribution analysis functions
- Individual unreachable functions removed from 150+ files across admin,
credential, filer, iam, kms, mount, mq, operation, pb, s3api, server,
shell, storage, topology, and util packages
* fix(s3): reset shared memory store in IAM test to prevent flaky failure
TestLoadIAMManagerFromConfig_EmptyConfigWithFallbackKey was flaky because
the MemoryStore credential backend is a singleton registered via init().
Earlier tests that create anonymous identities pollute the shared store,
causing LookupAnonymous() to unexpectedly return true.
Fix by calling Reset() on the memory store before the test runs.
* style: run gofmt on changed files
* fix: restore KMS functions used by integration tests
* fix(plugin): prevent panic on send to closed worker session channel
The Plugin.sendToWorker method could panic with "send on closed channel"
when a worker disconnected while a message was being sent. The race was
between streamSession.close() closing the outgoing channel and sendToWorker
writing to it concurrently.
Add a done channel to streamSession that is closed before the outgoing
channel, and check it in sendToWorker's select to safely detect closed
sessions without panicking.
* mount: async flush on close() when writebackCache is enabled
When -writebackCache is enabled, defer data upload and metadata flush
from Flush() (triggered by close()) to a background goroutine in
Release(). This allows processes like rsync that write many small files
to proceed to the next file immediately instead of blocking on two
network round-trips (volume upload + filer metadata) per file.
Fixes#8718
* mount: add retry with backoff for async metadata flush
The metadata flush in completeAsyncFlush now retries up to 3 times
with exponential backoff (1s, 2s, 4s) on transient gRPC errors.
Since the chunk data is already safely on volume servers at this point,
only the filer metadata reference needs persisting — retrying is both
safe and effective.
Data flush (FlushData) is not retried externally because
UploadWithRetry already handles transient HTTP/gRPC errors internally;
if it still fails, the chunk memory has been freed.
* test: add integration tests for writebackCache async flush
Add comprehensive FUSE integration tests for the writebackCache
async flush feature (issue #8718):
- Basic operations: write/read, sequential files, large files, empty
files, overwrites
- Fsync correctness: fsync forces synchronous flush even in writeback
mode, immediate read-after-fsync
- Concurrent small files: multi-worker parallel writes (rsync-like
workload), multi-directory, rapid create/close
- Data integrity: append after close, partial writes, file size
correctness, binary data preservation
- Performance comparison: writeback vs synchronous flush throughput
- Stress test: 16 workers x 100 files with content verification
- Mixed concurrent operations: reads, writes, creates running together
Also fix pre-existing test infrastructure issues:
- Rename framework.go to framework_test.go (fixes Go package conflict)
- Fix undefined totalSize variable in concurrent_operations_test.go
* ci: update fuse-integration workflow to run full test suite
The workflow previously only ran placeholder tests (simple_test.go,
working_demo_test.go) in a temp directory due to a Go module conflict.
Now that framework.go is renamed to framework_test.go, the full test
suite compiles and runs correctly from test/fuse_integration/.
Changes:
- Run go test directly in test/fuse_integration/ (no temp dir copy)
- Install weed binary to /usr/local/bin for test framework discovery
- Configure /etc/fuse.conf with user_allow_other for FUSE mounts
- Install fuse3 for modern FUSE support
- Stream test output to log file for artifact upload
* mount: fix three P1 races in async flush
P1-1: Reopen overwrites data still flushing in background
ReleaseByHandle removes the old handle from fhMap before the deferred
flush finishes. A reopen of the same inode during that window would
build from stale filer metadata, overwriting the async flush.
Fix: Track in-flight async flushes per inode via pendingAsyncFlush map.
AcquireHandle now calls waitForPendingAsyncFlush(inode) to block until
any pending flush completes before reading filer metadata.
P1-2: Deferred flush races rename and unlink after close
completeAsyncFlush captured the path once at entry, but rename or
unlink after close() could cause metadata to be written under the
wrong name or recreate a deleted file.
Fix: Re-resolve path from inode via GetPath right before metadata
flush. GetPath returns the current path (reflecting renames) or
ENOENT (if unlinked), in which case we skip the metadata flush.
P1-3: SIGINT/SIGTERM bypasses the async-flush drain
grace.OnInterrupt runs hooks then calls os.Exit(0), so
WaitForAsyncFlush after server.Serve() never executes on signal.
Fix: Add WaitForAsyncFlush (with 10s timeout) to the WFS interrupt
handler, before cache cleanup. The timeout prevents hanging on Ctrl-C
when the filer is unreachable.
* mount: fix P1 races — draining handle stays in fhMap
P1-1: Reopen TOCTOU
The gap between ReleaseByHandle removing from fhMap and
submitAsyncFlush registering in pendingAsyncFlush allowed a
concurrent AcquireHandle to slip through with stale metadata.
Fix: Hold pendingAsyncFlushMu across both the counter decrement
(ReleaseByHandle) and the pending registration. The handle is
registered as pending before the lock is released, so
waitForPendingAsyncFlush always sees it.
P1-2: Rename/unlink can't find draining handle
ReleaseByHandle deleted from fhMap immediately. Rename's
FindFileHandle(inode) at line 251 could not find the handle to
update entry.Name. Unlink could not coordinate either.
Fix: When asyncFlushPending is true, ReleaseByHandle/ReleaseByInode
leave the handle in fhMap (counter=0 but maps intact). The handle
stays visible to FindFileHandle so rename can update entry.Name.
completeAsyncFlush re-resolves the path from the inode (GetPath)
right before metadata flush for correctness after rename/unlink.
After drain, RemoveFileHandle cleans up the maps.
Double-return prevention: ReleaseByHandle/ReleaseByInode return nil
if counter is already <= 0, so Forget after Release doesn't start a
second drain goroutine.
P1-3: SIGINT deletes swap files under running goroutines
After the 10s timeout, os.RemoveAll deleted the write cache dir
(containing swap files) while FlushData goroutines were still
reading from them.
Fix: Increase timeout to 30s. If timeout expires, skip write cache
dir removal so in-flight goroutines can finish reading swap files.
The OS (or next mount) cleans them up. Read cache is always removed.
* mount: never skip metadata flush when Forget drops inode mapping
Forget removes the inode→path mapping when the kernel's lookup count
reaches zero, but this does NOT mean the file was unlinked — it only
means the kernel evicted its cache entry. completeAsyncFlush was
treating GetPath failure as "file unlinked" and skipping the metadata
flush, which orphaned the just-uploaded chunks for live files.
Fix: Save dir and name at doFlush defer time. In completeAsyncFlush,
try GetPath first to pick up renames; if the mapping is gone, fall
back to the saved dir/name. Always attempt the metadata flush — the
filer is the authority on whether the file exists, not the local
inode cache.
* mount: distinguish Forget from Unlink in async flush path fallback
The saved-path fallback (from the previous fix) always flushed
metadata when GetPath failed, which recreated files that were
explicitly unlinked after close(). The same stale fallback could
recreate the pre-rename path if Forget dropped the inode mapping
after a rename.
Root cause: GetPath failure has two meanings:
1. Forget — kernel evicted the cache entry (file still exists)
2. Unlink — file was explicitly deleted (should not recreate)
Fix (three coordinated changes):
Unlink (weedfs_file_mkrm.go): Before RemovePath, look up the inode
and find any draining handle via FindFileHandle. Set fh.isDeleted =
true so the async flush knows the file was explicitly removed.
Rename (weedfs_rename.go): When renaming a file with a draining
handle, update asyncFlushDir/asyncFlushName to the post-rename
location. This keeps the saved-path fallback current so Forget
after rename doesn't flush to the old (pre-rename) path.
completeAsyncFlush (weedfs_async_flush.go): Check fh.isDeleted
first — if true, skip metadata flush (file was unlinked, chunks
become orphans for volume.fsck). Otherwise, try GetPath for the
current path (renames); fall back to saved path if Forget dropped
the mapping (file is live, just evicted from kernel cache).
* test/ci: address PR review nitpicks
concurrent_operations_test.go:
- Restore precise totalSize assertion instead of info.Size() > 0
writeback_cache_test.go:
- Check rand.Read errors in all 3 locations (lines 310, 512, 757)
- Check os.MkdirAll error in stress test (line 752)
- Remove dead verifyErrors variable (line 332)
- Replace both time.Sleep(5s) with polling via waitForFileContent
to avoid flaky tests under CI load (lines 638, 700)
fuse-integration.yml:
- Add set -o pipefail so go test failures propagate through tee
* ci: fix fuse3/fuse package conflict on ubuntu-22.04 runner
fuse3 is pre-installed on ubuntu-22.04 runners and conflicts with
the legacy fuse package. Only install libfuse3-dev for the headers.
* mount/page_writer: remove debug println statements
Remove leftover debug println("read new data1/2") from
ReadDataAt in MemChunk and SwapFileChunk.
* test: fix findWeedBinary matching source directory instead of binary
findWeedBinary() matched ../../weed (the source directory) via
os.Stat before checking PATH, then tried to exec a directory
which fails with "permission denied" on the CI runner.
Fix: Check PATH first (reliable in CI where the binary is installed
to /usr/local/bin). For relative paths, verify the candidate is a
regular file (!info.IsDir()). Add ../../weed/weed as a candidate
for in-tree builds.
* test: fix framework — dynamic ports, output capture, data dirs
The integration test framework was failing in CI because:
1. All tests used hardcoded ports (19333/18080/18888), so sequential
tests could conflict when prior processes hadn't fully released
their ports yet.
2. Data subdirectories (data/master, data/volume) were not created
before starting processes.
3. Master was started with -peers=none which is not a valid address.
4. Process stdout/stderr was not captured, making failures opaque
("service not ready within timeout" with no diagnostics).
5. The unmount fallback used 'umount' instead of 'fusermount -u'.
6. The mount used -cacheSizeMB (nonexistent) instead of
-cacheCapacityMB and was missing -allowOthers=false for
unprivileged CI runners.
Fixes:
- Dynamic port allocation via freePort() (net.Listen ":0")
- Explicit gRPC ports via -port.grpc to avoid default port conflicts
- Create data/master and data/volume directories in Setup()
- Remove invalid -peers=none and -raftBootstrap flags
- Capture process output to logDir/*.log via startProcess() helper
- dumpLog() prints tail of log file on service startup failure
- Use fusermount3/fusermount -u for unmount
- Fix mount flag names (-cacheCapacityMB, -allowOthers=false)
* test: remove explicit -port.grpc flags from test framework
SeaweedFS convention: gRPC port = HTTP port + 10000. Volume and
filer discover the master gRPC port by this convention. Setting
explicit -port.grpc on master/volume/filer broke inter-service
communication because the volume server computed master gRPC as
HTTP+10000 but the actual gRPC was on a different port.
Remove all -port.grpc flags and let the default convention work.
Dynamic HTTP ports already ensure uniqueness; the derived gRPC
ports (HTTP+10000) will also be unique.
---------
Co-authored-by: Copilot <copilot@github.com>
* Fix: prevent panic when swap file creation fails
* weed mount: fix race condition in swap file initialization
Ensure thread-safe access to sf.file and other state in NewSwapFileChunk
and FreeResource by using sf.chunkTrackingLock consistently. Also
set sf.file to nil after closing to prevent reuse.
* weed mount: improve swap directory creation logic
- Check error for os.MkdirAll and log it if it fails.
- Use 0700 permissions for the swap directory for better security.
- Improve error logging context.
* weed mount: add unit tests for swap file creation
Add tests to verify:
- Concurrent initialization of the swap file.
- Correct directory permissions (0700).
- Automatic directory recreation if deleted.
* weed mount: fix thread-safety in swap file unit tests
Use atomic.Uint32 to track failures within goroutines in
TestSwapFile_NewSwapFileChunk_Concurrent to avoid unsafe calls to
t.Errorf from multiple goroutines.
* weed mount: simplify swap file creation logic
Refactor the directory check and retry logic for better readability and
to avoid re-using the main error variable for directory creation errors.
Remove redundant error logging.
* weed mount: improve error checking in swap file tests
Explicitly check if NewSwapFileChunk returns nil to provide more
informative failures.
* weed mount: update DirtyPages interface to return error
Propagate errors from SaveDataAt when swap file creation fails. This
prevents potential panics in the write path.
* weed mount: handle AddPage errors in write paths
Update ChunkedDirtyPages and PageWriter to propagate errors and update
WFS.Write and WFS.CopyFileRange to return fuse.EIO on failure.
* weed mount: update swap directory creation error message
Change "recreate" to "create/recreate" to better reflect that this path
is also taken during the initial creation of the swap directory.
---------
Co-authored-by: lixiang58 <lixiang58@lenovo.com>
Co-authored-by: Chris Lu <chris.lu@gmail.com>
* compare chunks by timestamp
* fix slab clearing error
* fix test compilation
* move oldest chunk to sealed, instead of by fullness
* lock on fh.entryViewCache
* remove verbose logs
* revert slat clearing
* less logs
* less logs
* track write and read by timestamp
* remove useless logic
* add entry lock on file handle release
* use mem chunk only, swap file chunk has problems
* comment out code that maybe used later
* add debug mode to compare data read and write
* more efficient readResolvedChunks with linked list
* small optimization
* fix test compilation
* minor fix on writer
* add SeparateGarbageChunks
* group chunks into sections
* turn off debug mode
* fix tests
* fix tests
* tmp enable swap file chunk
* Revert "tmp enable swap file chunk"
This reverts commit 985137ec472924e4815f258189f6ca9f2168a0a7.
* simple refactoring
* simple refactoring
* do not re-use swap file chunk. Sealed chunks should not be re-used.
* comment out debugging facilities
* either mem chunk or swap file chunk is fine now
* remove orderedMutex as *semaphore.Weighted
not found impactful
* optimize size calculation for changing large files
* optimize performance to avoid going through the long list of chunks
* still problems with swap file chunk
* rename
* tiny optimization
* swap file chunk save only successfully read data
* fix
* enable both mem and swap file chunk
* resolve chunks with range
* rename
* fix chunk interval list
* also change file handle chunk group when adding chunks
* pick in-active chunk with time-decayed counter
* fix compilation
* avoid nil with empty fh.entry
* refactoring
* rename
* rename
* refactor visible intervals to *list.List
* refactor chunkViews to *list.List
* add IntervalList for generic interval list
* change visible interval to use IntervalList in generics
* cahnge chunkViews to *IntervalList[*ChunkView]
* use NewFileChunkSection to create
* rename variables
* refactor
* fix renaming leftover
* renaming
* renaming
* add insert interval
* interval list adds lock
* incrementally add chunks to readers
Fixes:
1. set start and stop offset for the value object
2. clone the value object
3. use pointer instead of copy-by-value when passing to interval.Value
4. use insert interval since adding chunk could be out of order
* fix tests compilation
* fix tests compilation