* check for nil needle map before compaction sync
When CommitCompact runs concurrently, it sets v.nm = nil under
dataFileAccessLock. CompactByIndex does not hold that lock, so
v.nm.Sync() can hit a nil pointer. Add an early nil check to
return an error instead of crashing.
Fixes#8591
* guard copyDataBasedOnIndexFile size check against nil needle map
The post-compaction size validation at line 538 accesses
v.nm.ContentSize() and v.nm.DeletedSize(). If CommitCompact has
concurrently set v.nm to nil, this causes a SIGSEGV. Skip the
validation when v.nm is nil since the actual data copy uses local
needle maps (oldNm/newNm) and is unaffected.
Fixes#8591
* use atomic.Bool for compaction flags to prevent concurrent vacuum races
The isCompacting and isCommitCompacting flags were plain bools
read and written from multiple goroutines without synchronization.
This allowed concurrent vacuums on the same volume to pass the
guard checks and run simultaneously, leading to the nil pointer
crash. Using atomic.Bool with CompareAndSwap ensures only one
compaction or commit can run per volume at a time.
Fixes#8591
* use go-version-file in CI workflows instead of hardcoded versions
Use go-version-file: 'go.mod' so CI automatically picks up the Go
version from go.mod, avoiding future version drift. Reordered
checkout before setup-go in go.yml and e2e.yml so go.mod is
available. Removed the now-unused GO_VERSION env vars.
* capture v.nm locally in CompactByIndex to close TOCTOU race
A bare nil check on v.nm followed by v.nm.Sync() has a race window
where CommitCompact can set v.nm = nil between the two. Snapshot
the pointer into a local variable so the nil check and Sync operate
on the same reference.
* add dynamic timeouts to plugin worker vacuum gRPC calls
All vacuum gRPC calls used context.Background() with no deadline,
so the plugin scheduler's execution timeout could kill a job while
a large volume compact was still in progress. Use volume-size-scaled
timeouts matching the topology vacuum approach: 3 min/GB for compact,
1 min/GB for check, commit, and cleanup.
Fixes#8591
* Revert "add dynamic timeouts to plugin worker vacuum gRPC calls"
This reverts commit 80951934c37416bc4f6c1472a5d3f8d204a637d9.
* unify compaction lifecycle into single atomic flag
Replace separate isCompacting and isCommitCompacting flags with a
single isCompactionInProgress atomic.Bool. This ensures CompactBy*,
CommitCompact, Close, and Destroy are mutually exclusive — only one
can run at a time per volume.
Key changes:
- All entry points use CompareAndSwap(false, true) to claim exclusive
access. CompactByVolumeData and CompactByIndex now also guard v.nm
and v.DataBackend with local captures.
- Close() waits for the flag outside dataFileAccessLock to avoid
deadlocking with CommitCompact (which holds the flag while waiting
for the lock). It claims the flag before acquiring the lock so no
new compaction can start.
- Destroy() uses CAS instead of a racy Load check, preventing
concurrent compaction from racing with volume teardown.
- unmountVolumeByCollection no longer deletes from the map;
DeleteCollectionFromDiskLocation removes entries only after
successful Destroy, preventing orphaned volumes on failure.
Fixes#8591
* fix: prevent filer.backup stall in single-filer setups (#4977)
When MetaAggregator.MetaLogBuffer is empty (which happens in single-filer
setups with no peers), ReadFromBuffer was returning nil error, causing
LoopProcessLogData to enter an infinite wait loop on ListenersCond.
This fix returns ResumeFromDiskError instead, allowing SubscribeMetadata
to loop back and read from persisted logs on disk. This ensures filer.backup
continues processing events even when the in-memory aggregator buffer is empty.
Fixes#4977
* test: add integration tests for metadata subscription
Add integration tests for metadata subscription functionality:
- TestMetadataSubscribeBasic: Tests basic subscription and event receiving
- TestMetadataSubscribeSingleFilerNoStall: Regression test for #4977,
verifies subscription doesn't stall under high load in single-filer setups
- TestMetadataSubscribeResumeFromDisk: Tests resuming subscription from disk
Related to #4977
* ci: add GitHub Actions workflow for metadata subscribe tests
Add CI workflow that runs on:
- Push/PR to master affecting filer, log_buffer, or metadata subscribe code
- Runs the integration tests for metadata subscription
- Uploads logs on failure for debugging
Related to #4977
* fix: use multipart form-data for file uploads in integration tests
The filer expects multipart/form-data for file uploads, not raw POST body.
This fixes the 'Content-Type isn't multipart/form-data' error.
* test: use -peers=none for faster master startup
* test: add -peers=none to remaining master startup in ec tests
* fix: use filer HTTP port 8888, WithFilerClient adds 10000 for gRPC
WithFilerClient calls ToGrpcAddress() which adds 10000 to the port.
Passing 18888 resulted in connecting to 28888. Use 8888 instead.
* test: add concurrent writes and million updates tests
- TestMetadataSubscribeConcurrentWrites: 50 goroutines writing 20 files each
- TestMetadataSubscribeMillionUpdates: 1 million metadata entries via gRPC
(metadata only, no actual file content for speed)
* fix: address PR review comments
- Handle os.MkdirAll errors explicitly instead of ignoring
- Handle log file creation errors with proper error messages
- Replace silent event dropping with 100ms timeout and warning log
* Update metadata_subscribe_integration_test.go