Rust volume server implementation with CI (#8539)
* Match Go gRPC client transport defaults
* Honor Go HTTP idle timeout
* Honor maintenanceMBps during volume copy
* Honor images.fix.orientation on uploads
* Honor cpuprofile when pprof is disabled
* Match Go memory status payloads
* Propagate request IDs across gRPC calls
* Format pending Rust source updates
* Match Go stats endpoint payloads
* Serve Go volume server UI assets
* Enforce Go HTTP whitelist guards
* Align Rust metrics admin-port test with Go behavior
* Format pending Rust server updates
* Honor access.ui without per-request JWT checks
* Honor keepLocalDatFile in tier upload shortcut
* Honor Go remote volume write mode
* Load tier backends from master config
* Check master config before loading volumes
* Remove vif files on volume destroy
* Delete remote tier data on volume destroy
* Honor vif version defaults and overrides
* Reject mismatched vif bytes offsets
* Load remote-only tiered volumes
* Report Go tail offsets in sync status
* Stream remote dat in incremental copy
* Honor collection vif for EC shard config
* Persist EC expireAtSec in vif metadata
* Stream remote volume reads through HTTP
* Serve HTTP ranges from backend source
* Match Go ReadAllNeedles scan order
* Match Go CopyFile zero-stop metadata
* Delete EC volumes with collection cleanup
* Drop deleted collection metrics
* Match Go tombstone ReadNeedleMeta
* Match Go TTL parsing: all-digit default to minutes, two-pass fit algorithm
* Match Go needle ID/cookie formatting and name size computation
* Match Go image ext checks: webp resize only, no crop; empty healthz body
* Match Go Prometheus metric names and add missing handler counter constants
* Match Go ReplicaPlacement short string parsing with zero-padding
* Add missing EC constants MAX_SHARD_COUNT and MIN_TOTAL_DISKS
* Add walk_ecx_stats for accurate EC volume file counts and size
* Match Go VolumeStatus dat file size, EC shard stats, and disk pct precision
* Match Go needle map: unconditional delete counter, fix redb idx walk offset
* Add CompactMapSegment overflow panic guard matching Go
* Match Go volume: vif creation, version from superblock, TTL expiry, dedup data_size, garbage_level fallback
* Match Go 304 Not Modified: return bare status with no headers
* Match Go JWT error message: use "wrong jwt" instead of detailed error
* Match Go read handler bare 400, delete error prefix, download throttle timeout
* Match Go pretty JSON 1-space indent and "Deletion Failed:" error prefix
* Match Go heartbeat: keep is_heartbeating on error, add EC shard identification
* Match Go needle ReadBytes V2: tolerate EOF on truncated body
* Match Go volume: cookie check on any existing needle, return DataSize, 128KB meta guard
* Match Go DeleteCollection: propagate destroy errors
* Match Go gRPC: BatchDelete no flag, IncrementalCopy error, FetchAndWrite concurrent, VolumeUnmount/DeleteCollection errors, tail draining, query error code
* Match Go Content-Disposition RFC 6266 formatting with RFC 2231 encoding
* Match Go Guard isWriteActive: combine whitelist and signing key check
* Match Go DeleteCollectionMetrics: use partial label matching
* Match Go heartbeat: send state-only delta on volume state changes
* Match Go ReadNeedleMeta paged I/O: read header+tail only, skip data; add EIO tracking
* Match Go ScrubVolume INDEX mode dispatch; add VolumeCopy preallocation and EC NeedleStatus TODOs
* Add read_ec_shard_needle for full needle reconstruction from local EC shards
* Make heartbeat master config helpers pub for VolumeCopy preallocation
* Match Go gRPC: VolumeCopy preallocation, EC NeedleStatus full read, error message wording
* Match Go HTTP responses: omitempty fields, 2-space JSON indent, JWT JSON error, delete pretty/JSONP, 304 Last-Modified, raw write error
* Match Go WriteNeedleBlob V3 timestamp patching, fix makeup_diff double padding, count==0 read handling
* Add rebuild_ecx_file for EC index reconstruction from data shards
* Match Go gRPC: tail header first-chunk-only, EC cleanup on failure, copy append mode, ecx rebuild, compact cancellation
* Add EC volume read and delete support in HTTP handlers
* Add per-shard EC mount/unmount, location predicate search, idx directory for EC
* Add CheckVolumeDataIntegrity on volume load matching Go
* Match Go gRPC: EC multi-disk placement, per-shard mount/unmount, no auto-mount on reconstruct, streaming ReadAll/EcShardRead, ReceiveFile cleanup, version check, proxy streaming, redirect Content-Type
* Match Go heartbeat metric accounting
* Match Go duplicate UUID heartbeat retries
* Delete expired EC volumes during heartbeat
* Match Go volume heartbeat pruning
* Honor master preallocate in volume max
* Report remote storage info in heartbeats
* Emit EC heartbeat deltas on shard changes
* Match Go throttle boundary: use <= instead of <, fix pretty JSON to 1-space
* Match Go write_needle_blob monotonic appendAtNs via get_append_at_ns
* Match Go VolumeUnmount: idempotent success when volume not found
* Match Go TTL Display: return empty string when unit is Empty
Go checks `t.Unit == Empty` separately and returns "" for TTLs
with nonzero count but Empty unit. Rust only checked is_empty()
(count==0 && unit==0), so count>0 with unit=0 would format as
"5 " instead of "".
* Match Go error behavior for truncated needle data in read_body_v2
Go's readNeedleDataVersion2 returns "index out of range %d" errors
(indices 1-7) when needle body or metadata fields are truncated.
Rust was silently tolerating truncation and returning Ok. Now returns
NeedleError::IndexOutOfRange with the matching index for each field.
* Match Go download throttle: return JSON error instead of plain text
* Match Go crop params: default x1/y1 to 0 when not provided
* Match Go ScrubEcVolume: accumulate total_files from EC shards
* Match Go ScrubVolume: count total_files even on scrub error
* Match Go VolumeEcShardsCopy: set ignore_source_file_not_found for .vif
* Match Go VolumeTailSender: send needle_header on every chunk
* Match Go read_super_block: apply replication override from .vif
* Match Go check_volume_data_integrity: verify all 10 entries, detect trailing corruption
* Match Go WriteNeedleBlob: dedup check before writing during replication
* handlers: use meta-only reads for HEAD
* handlers: align range parsing and responses with Go
* handlers: align upload parsing with Go
* deps: enable webp support
* Make 5bytes the default feature for idx entry compatibility
* Match Go TTL: preserve original unit when count fits in byte
* Fix EC locate_needle: use get_actual_size for full needle size
* Fix raw body POST: only parse multipart when Content-Type contains form-data
* Match Go ReceiveFile: return protocol errors in response body, not gRPC status
* add docs
* Match Go VolumeEcShardsCopy: append to .ecj file instead of truncating
* Match Go ParsePath: support _delta suffix on file IDs for sub-file addressing
* Match Go chunk manifest: add Accept-Ranges, Content-Disposition, filename fallback, MIME detection
* Match Go privateStoreHandler: use proper JSON error for unsupported methods
* Match Go Destroy: add only_empty parameter to reject non-empty volume deletion
* Fix compilation: set_read_only_persist and set_writable return ()
These methods fire-and-forget save_vif internally, so gRPC callers
should not try to chain .map_err() on the unit return type.
* Match Go SaveVolumeInfo: check writability and propagate errors in save_vif
* Match Go VolumeDelete: propagate only_empty to delete_volume for defense in depth
The gRPC VolumeDelete handler had a pre-check for only_empty but then
passed false to store.delete_volume(), bypassing the store-level check.
Go passes req.OnlyEmpty directly to DeleteVolume. Now Rust does the same
for defense in depth against TOCTOU races (though the store write lock
makes this unlikely).
* Match Go ProcessRangeRequest: return full content for empty/oversized ranges
Go returns nil from ProcessRangeRequest when ranges are empty or total
range size exceeds content length, causing the caller to serve the full
content as a normal 200 response. Rust was returning an empty 200 body.
* Match Go Query: quote JSON keys in output records
Go's ToJson produces valid JSON with quoted keys like {"name":"Alice"}.
Rust was producing invalid JSON with unquoted keys like {name:"Alice"}.
* Match Go VolumeCopy: reject when no suitable disk location exists
Go returns ErrVolumeNoSpaceLeft when no location matches the disk type
and has sufficient space. Rust had an unsafe fallback that silently
picked the first location regardless of type or available space.
* Match Go DeleteVolumeNeedle: check noWriteOrDelete before allowing delete
Go checks v.noWriteOrDelete before proceeding with needle deletion,
returning "volume is read only" if true. Rust was skipping this check.
* Match Go ReceiveFile: prefer HardDrive location for EC and use response-level write errors
Two fixes: (1) Go prefers HardDriveType disk location for EC volumes,
falling back to first location. Returns "no storage location available"
when no locations exist. (2) Write failures are now response-level
errors (in response body) instead of gRPC status errors, matching Go.
* Match Go CopyFile: sync EC volume journal to disk before copying
Go calls ecVolume.Sync() before copying EC volume files to ensure the
.ecj journal is flushed to disk. Added sync_to_disk() to EcVolume and
call it in the CopyFile EC branch.
* Match Go readSuperBlock: propagate replication parse errors
Go returns an error when parsing the replication string from the .vif
file fails. Rust was silently ignoring the parse failure and using the
super block's replication as-is.
* Match Go TTL expiry: remove append_at_ns > 0 guard
Go computes TTL expiry from AppendAtNs without guarding against zero.
When append_at_ns is 0, the expiry is epoch + TTL which is in the past,
correctly returning NotFound. Rust's extra guard skipped the check,
incorrectly returning success for such needles.
* Match Go delete_collection: skip volumes with compaction in progress
Go checks !v.isCompactionInProgress.Load() before destroying a volume
during collection deletion, skipping compacting volumes. Also changed
destroy errors to log instead of aborting the entire collection delete.
* Match Go MarkReadonly/MarkWritable: always notify master even on local error
Go always notifies the master regardless of whether the local
set_read_only_persist or set_writable step fails. The Rust code was
using `?` which short-circuited on error, skipping the final master
notification. Save the result and defer the `?` until after the
notify call.
* Match Go PostHandler: return 500 for all write errors
Go returns 500 (InternalServerError) for all write failures. Rust was
returning 404 for volume-not-found and 403 for read-only volumes.
* Match Go makeupDiff: validate .cpd compaction revision is old + 1
Go reads the new .cpd file's super block and verifies the compaction
revision is exactly old + 1. Rust only validated the old revision.
* Match Go VolumeStatus: check data backend before returning status
Go checks v.DataBackend != nil before building the status response,
returning an error if missing. Rust was silently returning size 0.
* Match Go PostHandler: always include mime field in upload response JSON
Go always serializes the mime field even when empty ("mime":""). Rust was
omitting it when empty due to Option<String> with skip_serializing_if.
* Match Go FindFreeLocation: account for EC shards in free slot calculation
Go subtracts EC shard equivalents when computing available volume slots.
Rust was only comparing volume count, potentially over-counting free
slots on locations with many EC shards.
* Match Go privateStoreHandler: use INVALID as metrics label for unsupported methods
Go records the method as INVALID in metrics for unsupported HTTP methods.
Rust was using the actual method name.
* Match Go volume: add commit_compact guard and scrub data size validation
Two fixes: (1) commit_compact now checks/sets is_compacting flag to
prevent concurrent commits, matching Go's CompareAndSwap guard.
(2) scrub now validates total needle sizes against .dat file size.
* Match Go gRPC: fix TailSender error propagation, EcShardsInfo all slots, EcShardRead .ecx check
Three fixes: (1) VolumeTailSender now propagates binary search errors
instead of silently falling back to start. (2) VolumeEcShardsInfo
returns entries for all shard slots including unmounted. (3)
VolumeEcShardRead checks .ecx index for deletions instead of .ecj.
* Match Go metrics: add BuildInfo gauge and connection tracking functions
Go exposes a BuildInfo Prometheus metric with version labels, and tracks
open connections via stats.ConnectionOpen/Close. Added both to Rust.
* Match Go NeedleMap.Delete: use !is_deleted() instead of is_valid()
Go's CompactMap.Delete checks !IsDeleted() not IsValid(), so needles
with size==0 (live but anomalous) can still be deleted. The Rust code
was using is_valid() which returns false for size==0, preventing
deletion of such needles.
* Match Go fitTtlCount: always normalize TTL to coarsest unit
Go's fitTtlCount always converts to seconds first, then finds the
coarsest unit that fits in one byte (e.g., 120m → 2h). Rust had an
early return for count<=255 that skipped normalization, producing
different binary encodings for the same duration.
* Match Go BuildInfo metric: correct name and add missing labels
Go uses SeaweedFS_build_info (Namespace=SeaweedFS, Subsystem=build,
Name=info) with labels [version, commit, sizelimit, goos, goarch].
Rust had SeaweedFS_volumeServer_buildInfo with only [version].
* Match Go HTTP handlers: fix UploadResult fields, DiskStatus JSON, chunk manifest ETag
- UploadResult.mime: add skip_serializing_if to omit empty MIME (Go uses omitempty)
- UploadResult.contentMd5: only include when request provided Content-MD5 header
- Content-MD5 response header: only set when request provided it
- DiskStatuses: use camelCase field names (percentFree, percentUsed, diskType)
to match Go's protobuf JSON marshaling
- Chunk manifest: preserve needle ETag in expanded response headers
* Match Go volume: fix version(), integrity check, scrub, and commit_compact
- version(): use self.version() instead of self.super_block.version in
read_all_needles, check_volume_data_integrity, scan_raw_needles_from
to respect volumeInfo.version override
- check_volume_data_integrity: initialize healthy_index_size to idx_size
(matching Go) and continue on EOF instead of returning error
- scrub(): count deleted needles in total_read since they still occupy
space in the .dat file (matches Go's totalRead += actualSize for deleted)
- commit_compact: clean up .cpd/.cpx files on makeup_diff failure
(matches Go's error path cleanup)
* Match Go write queue: add 4MB batch byte limit
Go's startWorker breaks the batch at either 128 requests or 4MB of
accumulated write data. Rust only had the 128-request limit, allowing
large writes to accumulate unbounded latency.
* Add TTL normalization tests for Go parity verification
Test that fit_ttl_count normalizes 120m→2h, 24h→1d, 7d→1w even
when count fits in a byte, matching Go's fitTtlCount behavior.
* Match Go FindFreeLocation: account for EC shards in free slot calculation
Go's free volume count subtracts both regular volumes and EC volumes
from max_volume_count. Rust was only counting regular volumes, which
could over-report available slots when EC shards are mounted.
* Match Go EC volume: mark deletions in .ecx and replay .ecj at startup
Go's DeleteNeedleFromEcx marks needles as deleted in the .ecx index
in-place (writing TOMBSTONE_FILE_SIZE at the size field) in addition
to appending to the .ecj journal. Go's RebuildEcxFile replays .ecj
entries into .ecx on startup, then removes the .ecj file.
Rust was only appending to .ecj without marking .ecx, which meant
deleted EC needles remained readable via .ecx binary search. This
fix:
- Opens .ecx in read/write mode (was read-only)
- Adds mark_needle_deleted_in_ecx: binary search + in-place write
- Calls it from journal_delete before appending to .ecj
- Adds rebuild_ecx_from_journal: replays .ecj into .ecx on startup
* Match Go check_all_ec_shards_deleted: use MAX_SHARD_COUNT instead of hardcoded 14
Go's TotalShardsCount is DataShardsCount + ParityShardsCount = 14 by
default, but custom EC configs via .vif can have more shards (up to
MaxShardCount = 32). Using MAX_SHARD_COUNT ensures all shard files
are checked regardless of EC configuration.
* Match Go EC locate: subtract 1 from shard size and use datFileSize override
Go's LocateEcShardNeedleInterval passes shard.ecdFileSize-1 to
LocateData (shards are padded, -1 avoids overcounting large block
rows). When datFileSize is known, Go uses datFileSize/DataShards
instead. Rust was passing the raw shard file size without adjustment.
* Fix TTL parsing and DiskStatus field names to match Go exactly
TTL::read: Go's ReadTTL preserves the original unit (7d stays 7d,
not 1w) and errors on count > 255. The previous normalization change
was incorrect — Go only normalizes internally via fitTtlCount, not
during string parsing.
DiskStatus: Go uses encoding/json on protobuf structs, which reads
the json struct tags (snake_case: percent_free, percent_used,
disk_type), not the protobuf JSON names (camelCase). Revert to
snake_case to match Go's actual output.
* Fix heartbeat: check leader != current master before redirect, process duplicated UUIDs first
Match Go's volume_grpc_client_to_master.go behavior:
1. Only trigger leader redirect when the leader address differs from the
current master (prevents unnecessary reconnect loops when master confirms
its own address).
2. Process duplicated_uuids before leader redirect check, matching Go's
ordering where duplicate UUID detection takes priority.
* Remove SetState version check to match Go behavior
Go's SetState unconditionally applies the state without any version
mismatch check. The Rust version had an extra optimistic concurrency
check that would reject valid requests from Go clients that don't
track versions.
* Fix TTL::read() to normalize via fit_ttl_count matching Go's ReadTTL
Go's ReadTTL calls fitTtlCount which converts to seconds and normalizes
to the coarsest unit that fits in a byte count (e.g. 120m->2h, 7d->1w,
24h->1d). The Rust version was preserving the original unit, producing
different binary encodings on disk and in heartbeat messages.
* Always return Content-MD5 header and JSON field on successful writes
Go always sets Content-MD5 in the response regardless of whether the
request included it. The Rust version was conditionally including it
only when the request provided Content-MD5.
* Include name and size in UploadResult JSON even when empty/zero
Go's encoding/json always includes empty strings and zero values in
the upload response. The Rust version was using skip_serializing_if
to omit them, causing JSON structure differences.
* Include deleted needles in scan_raw_needles_from to match Go
Go's ScanVolumeFileFrom visits ALL needles including deleted ones.
Skipping deleted entries during incremental copy would cause tombstones
to not be propagated, making deleted files reappear on the receiving side.
* Match Go NeedleMap.Delete: always write tombstone to idx file
Go's NeedleMap.Delete unconditionally writes a tombstone entry to the
idx file and updates metrics, even if the needle doesn't exist or is
already deleted. This is important for replication where every delete
operation must produce an idx write. The Rust version was skipping the
tombstone write for non-existent or already-deleted needles.
* Limit MIME type to 255 bytes matching Go's CreateNeedleFromRequest
* Title-case Seaweed-* pair keys to match Go HTTP header canonicalization
* Unify DiskType::Hdd into HardDrive to match Go's single HardDriveType
* Skip tombstone entries in walk_ecx_stats total_size matching Go's Raw()
* Return EMPTY TTL when computed seconds is zero matching Go's fitTtlCount
* Include disk-space-low in Volume.is_read_only() matching Go
* Log error on CIDR parse failure in whitelist matching Go's glog.Errorf
* Log cookie mismatch in gRPC Query matching Go's V(0).Infof
* Fix is_expired volume_size comparison to use < matching Go
Go checks `volumeSize < super_block.SuperBlockSize` (strict less-than),
but Rust used `<=`. This meant Rust would fail to expire a volume that
is exactly SUPER_BLOCK_SIZE bytes.
* Apply Go's JWT expiry defaults: 10s write, 60s read
Go calls v.SetDefault("jwt.signing.expires_after_seconds", 10) and
v.SetDefault("jwt.signing.read.expires_after_seconds", 60). Rust
defaulted to 0 for both, which meant tokens would never expire when
security.toml has a signing key but omits expires_after_seconds.
* Stop [grpc.volume].ca from overriding [grpc].ca matching Go
Go reads the gRPC CA file only from config.GetString("grpc.ca"), i.e.
the [grpc] section. The [grpc.volume] section only provides cert and
key. Rust was also reading ca from [grpc.volume] which would silently
override the [grpc].ca value when both were present.
* Fix free_volume_count to use EC shard count matching Go
Was counting EC volumes instead of EC shards, which underestimates EC
space usage. One EC volume with 14 shards uses ~1.4 volume slots, not 1.
Now uses Go's formula: ((max - volumes) * DataShardsCount - ecShardCount) / DataShardsCount.
* Include preallocate in compaction space check matching Go
Go uses max(preallocate, estimatedCompactSize) for the free space check.
Rust was only using the estimated volume size, which could start a
compaction that fails mid-way if preallocate exceeds the volume size.
* Check gzip magic bytes before setting Content-Encoding matching Go
Go checks both Accept-Encoding contains "gzip" AND IsGzippedContent
(data starts with 0x1f 0x8b) before setting Content-Encoding: gzip.
Rust only checked Accept-Encoding, which could incorrectly declare
gzip encoding for non-gzip compressed data.
* Only set upload response name when needle HasName matching Go
Go checks reqNeedle.HasName() before setting ret.Name. Rust always set
the name from the filename variable, which could return the fid portion
of the path as the name for raw PUT requests without a filename.
* Treat MaxVolumeCount==0 as unlimited matching Go's hasFreeDiskLocation
Go's hasFreeDiskLocation returns true immediately when MaxVolumeCount
is 0, treating it as unlimited. Rust was computing effective_free as
<= 0 for max==0, rejecting the location. This could fail volume
creation during early startup before the first heartbeat adjusts max.
* Read lastAppendAtNs from deleted V3 entries in integrity check
Go's doCheckAndFixVolumeData reads AppendAtNs from both live entries
(verifyNeedleIntegrity) and deleted tombstones (verifyDeletedNeedleIntegrity).
Rust was skipping deleted entries, which could result in a stale
last_append_at_ns if the last index entry is a deletion.
* Return empty body for empty/oversized range requests matching Go
Go's ProcessRangeRequest returns nil (empty body, 200 OK) when
parsed ranges are empty or combined range size exceeds total content
size. The Rust buffered path incorrectly returned the full file data
for both cases. The streaming path already handled this correctly.
* Dispatch ScrubEcVolume by mode matching Go's INDEX/LOCAL/FULL
Go's ScrubEcVolume switches on mode: INDEX calls v.ScrubIndex()
(ecx integrity only), LOCAL calls v.ScrubLocal(), FULL calls
vs.store.ScrubEcVolume(). Rust was ignoring the mode and always
running verify_ec_shards. Now INDEX mode checks ecx index integrity
(sorted overlap detection + file size validation) without shard I/O,
while LOCAL/FULL modes run the existing shard verification.
* Fix TTL test expectation: 7d normalizes to 1w matching Go's fitTtlCount
Go's ReadTTL calls fitTtlCount which normalizes to the coarsest unit
that fits: 7 days = 1 week, so "7d" becomes {Count:1, Unit:Week}
which displays as "1w". Both Go and Rust normalize identically.
* Add version mismatch check to SetState matching Go's State.Update
Go's State.Update compares the incoming version with the stored
version and returns "version mismatch" error if they differ. This
provides optimistic concurrency control. The Rust implementation
was accepting any version unconditionally.
* Use unquoted keys in Query JSON output matching Go's json.ToJson
Go's json.ToJson produces records with unquoted keys like
{score:12} not {"score":12}. This is a custom format used
internally by SeaweedFS for query results.
* Fix TTL test expectation in VolumeNeedleStatus: 7d normalizes to 1w
Same normalization as the HTTP test: Go's ReadTTL calls fitTtlCount
which converts 7 days to 1 week.
* Include ETag header in 304 Not Modified responses matching Go behavior
Go sets ETag on the response writer (via SetEtag) before the
If-Modified-Since and If-None-Match conditional checks, so both
304 response paths include the ETag header. The Rust implementation
was only adding ETag to 200 responses.
* Remove needle-name fallback in chunk manifest filename resolution
Go's tryHandleChunkedFile only falls back from URL filename to
manifest name. Rust had an extra fallback to needle.name that
Go does not perform, which could produce different
Content-Disposition filenames for chunk manifests.
* Validate JWT nbf (Not Before) claim matching Go's jwt-go/v5
Go's jwt.ParseWithClaims validates the nbf claim when present,
rejecting tokens whose nbf is in the future. The Rust jsonwebtoken
crate defaults validate_nbf to false, so tokens with future nbf
were incorrectly accepted.
* Set isHeartbeating to true at startup matching Go's VolumeServer init
Go unconditionally sets isHeartbeating: true in the VolumeServer
struct literal. Rust was starting with false when masters are
configured, causing /healthz to return 503 until the first
heartbeat succeeds.
* Call store.close() on shutdown matching Go's Shutdown()
Go's Shutdown() calls vs.store.Close() which closes all volumes
and flushes file handles. The Rust server was relying on process
exit for cleanup, which could leave data unflushed.
* Include server ID in maintenance mode error matching Go's format
Go returns "volume server %s is in maintenance mode" with the
store ID. Rust was returning a generic "maintenance mode" message.
* Fix DiskType test: use HardDrive variant matching Go's HddType=""
Go maps both "" and "hdd" to HardDriveType (empty string). The
Rust enum variant is HardDrive, not Hdd. The test referenced a
nonexistent Hdd variant causing compilation failure.
* Do not include ETag in 304 responses matching Go's GetOrHeadHandler
Go sets ETag at L235 AFTER the If-Modified-Since and If-None-Match
304 return paths, so Go's 304 responses do not include the ETag header.
The Rust code was incorrectly including ETag in both 304 response paths.
* Return 400 on malformed query strings in PostHandler matching Go's ParseForm
Go's r.ParseForm() returns HTTP 400 with "form parse error: ..." when
the query string is malformed. Rust was silently falling back to empty
query params via unwrap_or_default().
* Load EC volume version from .vif matching Go's NewEcVolume
Go sets ev.Version = needle.Version(volumeInfo.Version) from the .vif
file. Rust was always using Version::current() (V3), which would produce
wrong needle actual size calculations for volumes created with V1 or V2.
* Sync .ecx file before close matching Go's EcVolume.Close
Go calls ev.ecxFile.Sync() before closing to ensure in-place deletion
marks are flushed to disk. Without this, deletion marks written via
MarkNeedleDeleted could be lost on crash.
* Validate SuperBlock extra data size matching Go's Bytes() guard
Go checks extraSize > 256*256-2 and calls glog.Fatalf to prevent
corrupt super block headers. Rust was silently truncating via u16 cast,
which would write an incorrect extra_size field.
* Update quinn-proto 0.11.13 -> 0.11.14 to fix GHSA-6xvm-j4wr-6v98
Fixes Dependency Review CI failure: quinn-proto < 0.11.14 is vulnerable
to unauthenticated remote DoS via panic in QUIC transport parameter
parsing.
* Skip TestMultipartUploadUsesFormFieldsForTimestampAndTTL for Go server
Go's r.FormValue() cannot read multipart text fields after
r.MultipartReader() consumes the body, so ts/ttl sent as multipart
form fields only work with the Rust volume server. Skip this test
when VOLUME_SERVER_IMPL != "rust" to fix CI failure.
* Flush .ecx in EC volume sync_to_disk matching Go's Sync()
Go's EcVolume.Sync() flushes both the .ecj journal and the .ecx index
to disk. The Rust version only flushed .ecj, leaving in-place deletion
marks in .ecx unpersisted until close(). This could cause data
inconsistency if the server crashes after marking a needle deleted in
.ecx but before close().
* Remove .vif file in EC volume destroy matching Go's Destroy()
Go's EcVolume.Destroy() removes .ecx, .ecj, and .vif files. The Rust
version only removed .ecx and .ecj, leaving orphaned .vif files on
disk after EC volume destruction (e.g., after TTL expiry).
* Fix is_expired to use <= for SuperBlockSize check matching Go
Go checks contentSize <= SuperBlockSize to detect empty volumes (no
needles). Rust used < which would incorrectly allow a volume with
exactly SuperBlockSize bytes (header only, no data) to proceed to
the TTL expiry check and potentially be marked as expired.
* Fix read_append_at_ns to read timestamps from tombstone entries
Go reads the full needle body for all entries including tombstones
(deleted needles with size=0) to extract the actual AppendAtNs
timestamp. The Rust version returned 0 early for size <= 0 entries,
which would cause the binary search in incremental copy to produce
incorrect results for positions containing deleted needles.
Now uses get_actual_size to compute the on-disk size (which handles
tombstones correctly) and only returns 0 when the actual size is 0.
* Add X-Request-Id response header matching Go's requestIDMiddleware
Go sets both X-Request-Id and x-amz-request-id response headers.
The Rust server only set x-amz-request-id, missing X-Request-Id.
* Add skip_serializing_if for UploadResult name and size fields
Go's UploadResult uses json:"name,omitempty" and json:"size,omitempty",
omitting these fields from JSON when they are zero values (empty
string / 0). The Rust struct always serialized them, producing
"name":"" and "size":0 where Go would omit them.
* Support JSONP/pretty-print for write success responses
Go's writeJsonQuiet checks for callback (JSONP) and pretty query
parameters on all JSON responses including write success. The Rust
write success path used axum::Json directly, bypassing JSONP and
pretty-print support. Now uses json_result_with_query to match Go.
* Include actual limit in file size limit error message
Go returns "file over the limited %d bytes" with the actual limit
value included. Rust returned a generic "file size limit exceeded"
without the limit value, making it harder to debug.
* Extract extension from 2-segment URL paths for image operations
Go's parseURLPath extracts the file extension from all URL formats
including 2-segment paths like /vid,fid.jpg. The Rust version only
handled 3-segment paths (/vid/fid/filename.ext), so extensions in
2-segment paths were lost. This caused image resize/crop operations
requested via query params to be silently skipped for those paths.
* Add size_hint to TrackedBody so throttled downloads get Content-Length
TrackedBody (used for download throttling) did not implement
size_hint(), causing HTTP/1.1 to fall back to chunked transfer
encoding instead of setting Content-Length. Go always sets
Content-Length explicitly for non-range responses.
* Add Last-Modified, pairs, and S3 headers to chunk manifest responses
Go sets Last-Modified, needle pairs, and S3 pass-through headers on
the response writer BEFORE calling tryHandleChunkedFile. Since the
Rust chunk manifest handler created fresh response headers and
returned early, these headers were missing from chunk manifest
responses. Now passes last_modified_str into the chunk manifest
handler and applies pairs and S3 pass-through query params
(response-cache-control, response-content-encoding, etc.) to the
chunk manifest response headers.
* Fix multipart fallback to use first part data when no filename
Go reads the first part's data unconditionally, then looks for a
part with a filename. If none found, Go uses the first part's data
(with empty filename). Rust only captured parts with filenames, so
when no part had a filename it fell back to the raw multipart body
bytes (including boundary delimiters), producing corrupt needle data.
* Set HasName and HasMime flags for empty values matching Go
Go's CreateNeedleFromRequest sets HasName and HasMime flags even when
the filename or MIME type is empty (len < 256 is true for len 0).
Rust skipped empty values, causing the on-disk needle format to
differ: Go-written needles include extra bytes for the empty name/mime
size fields, changing the serialized needle size in the idx entry.
This ensures binary format compatibility between Go and Rust servers.
* Add is_stopping guard to vacuum_volume_commit matching Go
Go's CommitCompactVolume (store_vacuum.go L53-54) checks
s.isStopping before committing compaction to prevent file
swaps during shutdown. The Rust handler was missing this
check, which could allow compaction commits while the
server is stopping.
* Remove disk_type from required status fields since Go omits it
Go's default DiskType is "" (HardDriveType), and protobuf's omitempty
tag causes empty strings to be dropped from JSON output.
* test: honor rust env in dual volume harness
* grpc: notify master after volume lifecycle changes
* http: proxy to replicas before download-limit timeout
* test: pass readMode to rust volume harnesses
* fix store free-location predicate selection
* fix volume copy disk placement and heartbeat notification
* fix chunk manifest delete replication
* fix write replication to survive client disconnects
* fix download limit proxy and wait flow
* fix crop gating for streamed reads
* fix upload limit wait counter behavior
* fix chunk manifest image transforms
* fix has_resize_ops to check width/height > 0 instead of is_some()
Go's shouldResizeImages condition is `width > 0 || height > 0`, so
`?width=0` correctly evaluates to false. Rust was using `is_some()`
which made `?width=0` evaluate to true, unnecessarily disabling
streaming reads for those requests.
* fix Content-MD5 to only compute and return when provided by client
Go only computes the MD5 of uncompressed data when a Content-MD5
header or multipart field is provided. Rust was always computing and
returning it. Also fix the mismatch error message to include size,
matching Go's format.
* fix save_vif to compute ExpireAtSec from TTL
Go's SaveVolumeInfo always computes ExpireAtSec = now + ttlSeconds
when the volume has a TTL. The save_vif path (used by set_read_only
and set_writable) was missing this computation, causing .vif files
to be written without the correct expiration timestamp for TTL volumes.
* fix set_writable to not modify no_write_can_delete
Go's MarkVolumeWritable only sets noWriteOrDelete=false and persists.
Rust was additionally setting no_write_can_delete=has_remote_file,
which could incorrectly change the write mode for remote-file volumes
when the master explicitly asks to make the volume writable.
* fix write_needle_blob_and_index to error on too-small V3 blob
Go returns an error when the needle blob is too small for timestamp
patching. Rust was silently skipping the patch and writing the blob
with a stale/zero timestamp, which could cause data integrity issues
during incremental replication that relies on AppendAtNs ordering.
* fix VolumeEcShardsToVolume to validate dataShards range
Go validates that dataShards is > 0 and <= MaxShardCount before
proceeding with EC-to-volume reconstruction. Without this check,
a zero or excessively large data_shards value could cause confusing
downstream failures.
* fix destroy to use VolumeError::NotEmpty instead of generic Io error
The dedicated NotEmpty variant exists in the enum but was not being
used. This makes error matching consistent with Go's ErrVolumeNotEmpty.
* fix SetState to persist state to disk with rollback on failure
Go's State.Update saves VolumeServerState to a state.pb file after
each SetState call, and rolls back the in-memory state if persistence
fails. Rust was only updating in-memory atomics, so maintenance mode
would be lost on server restart. Now saves protobuf-encoded state.pb
and loads it on startup.
* fix VolumeTierMoveDatToRemote to close local dat backend after upload
Go calls v.LoadRemoteFile() after saving volume info, which closes
the local DataBackend before transitioning to remote storage. Without
this, the volume holds a stale file handle to the deleted local .dat
file, causing reads to fail until server restart.
* fix VolumeTierMoveDatFromRemote to close remote dat backend after download
Go calls v.DataBackend.Close() and sets DataBackend=nil after removing
the remote file reference. Without this, the stale remote backend
state lingers and reads may not discover the newly downloaded local
.dat file until server restart.
* fix redirect to use internal url instead of public_url
Go's proxyReqToTargetServer builds the redirect Location header from
loc.Url (the internal URL), not publicUrl. Using public_url could
cause redirect failures when internal and external URLs differ.
* fix redirect test and add state_file_path to integration test
Update redirect unit test to expect internal url (matching the
previous fix). Add missing state_file_path field to the integration
test VolumeServerState constructor.
* fix FetchAndWriteNeedle to await all writes before checking errors
Go uses a WaitGroup to await all writes (local + replicas) before
checking errors. Rust was short-circuiting on local write failure,
which could leave replica writes in-flight without waiting for
completion.
* fix shutdown to send deregister heartbeat before pre_stop delay
Go's StopHeartbeat() closes stopChan immediately on interrupt, causing
the heartbeat goroutine to send the deregister heartbeat right away,
before the preStopSeconds delay. Rust was only setting is_stopping=true
without waking the heartbeat loop, so the deregister was delayed until
after the pre_stop sleep. Now we call volume_state_notify.notify_one()
to wake the heartbeat immediately.
* fix heartbeat response ordering to check duplicate UUIDs first
Go processes heartbeat responses in this order: DuplicatedUuids first,
then volume options (prealloc/size limit), then leader redirect. Rust
was applying volume options before checking for duplicate UUIDs, which
meant volume option changes would take effect even when the response
contained a duplicate UUID error that should cause an immediate return.
* the test thread was blocked
* fix(deps): update aws-lc-sys 0.38.0 → 0.39.0 to resolve security advisories
Bumps aws-lc-rs 1.16.1 → 1.16.2, pulling in aws-lc-sys 0.39.0 which
fixes GHSA-394x-vwmw-crm3 (X.509 Name Constraints wildcard/unicode
bypass) and GHSA-9f94-5g5w-gf6r (CRL Distribution Point scope check
logic error).
* fix: match Go Content-MD5 mismatch error message format
Go uses "Content-MD5 did not match md5 of file data expected [X]
received [Y] size Z" while Rust had a shorter format. Match the
exact Go error string so clients see identical messages.
* fix: match Go Bearer token length check (> 7, not >= 7)
Go requires len(bearer) > 7 ensuring at least one char after
"Bearer ". Rust used >= 7 which would accept an empty token.
* fix(deps): drop legacy rustls 0.21 to resolve rustls-webpki GHSA-pwjx-qhcg-rvj4
aws-sdk-s3's default "rustls" feature enables tls-rustls in
aws-smithy-runtime, which pulls in legacy-rustls-ring (rustls 0.21
→ rustls-webpki 0.101.7, moderate CRL advisory). Replace with
explicit default-https-client which uses only rustls 0.23 /
rustls-webpki 0.103.9.
* fix: use uploaded filename for auto-compression extension detection
Go extracts the file extension from pu.FileName (the uploaded
filename) for auto-compression decisions. Rust was using the URL
path, which typically has no extension for SeaweedFS file IDs.
* fix: add CRC legacy Value() backward-compat check on needle read
Go double-checks CRC: n.Checksum != crc && uint32(n.Checksum) !=
crc.Value(). The Value() path is a deprecated transform for compat
with seaweed versions prior to commit 056c480eb. Rust had the
legacy_value() method but wasn't using it in validation.
* fix: remove /stats/* endpoints to match Go (commented out since L130)
Go's volume_server.go has the /stats/counter, /stats/memory, and
/stats/disk endpoints commented out (lines 130-134). Remove them
from the Rust router along with the now-unused whitelist_guard
middleware.
* fix: filter application/octet-stream MIME for chunk manifests
Go's tryHandleChunkedFile (L334) filters out application/octet-stream
from chunk manifest MIME types, falling back to extension-based
detection. Rust was returning the stored MIME as-is for manifests.
* fix: VolumeMarkWritable returns error before notifying master
Go returns early at L200 if MarkVolumeWritable fails, before
reaching the master notification at L206. Rust was notifying master
even on failure, creating inconsistent state where master thinks
the volume is writable but local marking failed.
* fix: check volume existence before maintenance in MarkReadonly/Writable
Go's VolumeMarkReadonly (L239-241) and VolumeMarkWritable (L253-255)
look up the volume first, then call makeVolumeReadonly/Writable which
checks maintenance. Rust was checking maintenance first, returning
"maintenance mode" instead of "not found" for missing volumes.
* feat: implement ScrubVolume mark_broken_volumes_readonly (PR #8360)
Add the mark_broken_volumes_readonly flag from PR #8360:
- Sync proto field (tag 3) to local volume_server.proto
- After scrubbing, if flag is set, call makeVolumeReadonly on each
broken volume (notify master, mark local readonly, notify again)
- Collect errors via joined error semantics matching Go's errors.Join
- Factor out make_volume_readonly helper reused by both
VolumeMarkReadonly and ScrubVolume
Also refactors VolumeMarkReadonly to use the shared helper.
* fix(deps): update rustls-webpki 0.103.9 → 0.103.10 (GHSA-pwjx-qhcg-rvj4)
CRL Distribution Point matching logic fix for moderate severity
advisory about CRLs not considered authoritative.
* test: update integration tests for removed /stats/* endpoints
Replace tests that expected /stats/* routes to return 200/401 with
tests confirming they now fall through to the store handler (400),
matching Go's commented-out stats endpoints.
* docs: fix misleading comment about default offset feature
The comment said "4-byte offsets unless explicitly built with 5-byte
support" but the default feature enables 5bytes. This is intentional
for production parity with Go -tags 5BytesOffset builds. Fix the
comment to match reality.
This commit is contained in:
1697
seaweed-volume/src/config.rs
Normal file
1697
seaweed-volume/src/config.rs
Normal file
File diff suppressed because it is too large
Load Diff
275
seaweed-volume/src/images.rs
Normal file
275
seaweed-volume/src/images.rs
Normal file
@@ -0,0 +1,275 @@
|
||||
//! JPEG EXIF orientation auto-fix, matching Go's `FixJpgOrientation`.
|
||||
//!
|
||||
//! Reads the EXIF orientation tag from JPEG data and rotates/flips the image
|
||||
//! to normalize it to orientation 1 (top-left). If EXIF parsing fails or
|
||||
//! orientation is already normal, returns the original data unchanged.
|
||||
|
||||
use std::io::Cursor;
|
||||
|
||||
use image::{DynamicImage, GenericImageView, ImageFormat, RgbaImage};
|
||||
|
||||
/// EXIF orientation tag values.
|
||||
/// See: <http://sylvana.net/jpegcrop/exif_orientation.html>
|
||||
const TOP_LEFT_SIDE: u32 = 1;
|
||||
const TOP_RIGHT_SIDE: u32 = 2;
|
||||
const BOTTOM_RIGHT_SIDE: u32 = 3;
|
||||
const BOTTOM_LEFT_SIDE: u32 = 4;
|
||||
const LEFT_SIDE_TOP: u32 = 5;
|
||||
const RIGHT_SIDE_TOP: u32 = 6;
|
||||
const RIGHT_SIDE_BOTTOM: u32 = 7;
|
||||
const LEFT_SIDE_BOTTOM: u32 = 8;
|
||||
|
||||
/// Fix JPEG orientation based on EXIF data.
|
||||
///
|
||||
/// Reads the EXIF orientation tag and applies the appropriate rotation/flip
|
||||
/// to normalize the image to orientation 1 (top-left). Re-encodes as JPEG.
|
||||
///
|
||||
/// Returns the original data unchanged if:
|
||||
/// - EXIF data cannot be parsed
|
||||
/// - No orientation tag is present
|
||||
/// - Orientation is already 1 (normal)
|
||||
/// - Image decoding or re-encoding fails
|
||||
pub fn fix_jpg_orientation(data: &[u8]) -> Vec<u8> {
|
||||
// Parse EXIF data
|
||||
let orientation = match read_exif_orientation(data) {
|
||||
Some(o) => o,
|
||||
None => return data.to_vec(),
|
||||
};
|
||||
|
||||
// Orientation 1 means normal — no transformation needed
|
||||
if orientation == TOP_LEFT_SIDE {
|
||||
return data.to_vec();
|
||||
}
|
||||
|
||||
// Determine rotation angle and flip mode
|
||||
let (angle, flip_horizontal) = match orientation {
|
||||
TOP_RIGHT_SIDE => (0, true),
|
||||
BOTTOM_RIGHT_SIDE => (180, false),
|
||||
BOTTOM_LEFT_SIDE => (180, true),
|
||||
LEFT_SIDE_TOP => (-90, true),
|
||||
RIGHT_SIDE_TOP => (-90, false),
|
||||
RIGHT_SIDE_BOTTOM => (90, true),
|
||||
LEFT_SIDE_BOTTOM => (90, false),
|
||||
_ => return data.to_vec(),
|
||||
};
|
||||
|
||||
// Decode the image
|
||||
let src_image = match image::load_from_memory_with_format(data, ImageFormat::Jpeg) {
|
||||
Ok(img) => img,
|
||||
Err(_) => return data.to_vec(),
|
||||
};
|
||||
|
||||
// Apply rotation then flip (matching Go's flip(rotate(img, angle), flipMode))
|
||||
let transformed = flip_horizontal_if(rotate(src_image, angle), flip_horizontal);
|
||||
|
||||
// Re-encode as JPEG
|
||||
let mut buf = Cursor::new(Vec::new());
|
||||
match transformed.write_to(&mut buf, ImageFormat::Jpeg) {
|
||||
Ok(_) => buf.into_inner(),
|
||||
Err(_) => data.to_vec(),
|
||||
}
|
||||
}
|
||||
|
||||
/// Read the EXIF orientation tag from JPEG data.
|
||||
/// Returns None if EXIF cannot be parsed or orientation tag is not present.
|
||||
fn read_exif_orientation(data: &[u8]) -> Option<u32> {
|
||||
let exif_reader = exif::Reader::new();
|
||||
let mut cursor = Cursor::new(data);
|
||||
let exif_data = exif_reader.read_from_container(&mut cursor).ok()?;
|
||||
|
||||
let orientation_field = exif_data.get_field(exif::Tag::Orientation, exif::In::PRIMARY)?;
|
||||
match orientation_field.value {
|
||||
exif::Value::Short(ref v) if !v.is_empty() => Some(v[0] as u32),
|
||||
_ => orientation_field.value.get_uint(0),
|
||||
}
|
||||
}
|
||||
|
||||
/// Rotate an image by the given angle (counter-clockwise, in degrees).
|
||||
/// Matches Go's rotate function.
|
||||
fn rotate(img: DynamicImage, angle: i32) -> DynamicImage {
|
||||
let (width, height) = img.dimensions();
|
||||
|
||||
match angle {
|
||||
90 => {
|
||||
// 90 degrees counter-clockwise
|
||||
let new_w = height;
|
||||
let new_h = width;
|
||||
let mut out = RgbaImage::new(new_w, new_h);
|
||||
for y in 0..new_h {
|
||||
for x in 0..new_w {
|
||||
out.put_pixel(x, y, img.get_pixel(new_h - 1 - y, x));
|
||||
}
|
||||
}
|
||||
DynamicImage::ImageRgba8(out)
|
||||
}
|
||||
-90 => {
|
||||
// 90 degrees clockwise (or 270 counter-clockwise)
|
||||
let new_w = height;
|
||||
let new_h = width;
|
||||
let mut out = RgbaImage::new(new_w, new_h);
|
||||
for y in 0..new_h {
|
||||
for x in 0..new_w {
|
||||
out.put_pixel(x, y, img.get_pixel(y, new_w - 1 - x));
|
||||
}
|
||||
}
|
||||
DynamicImage::ImageRgba8(out)
|
||||
}
|
||||
180 | -180 => {
|
||||
let mut out = RgbaImage::new(width, height);
|
||||
for y in 0..height {
|
||||
for x in 0..width {
|
||||
out.put_pixel(x, y, img.get_pixel(width - 1 - x, height - 1 - y));
|
||||
}
|
||||
}
|
||||
DynamicImage::ImageRgba8(out)
|
||||
}
|
||||
_ => img,
|
||||
}
|
||||
}
|
||||
|
||||
/// Flip the image horizontally if requested.
|
||||
/// In Go, flipMode 2 == FlipHorizontal. We simplify since only horizontal flip is used.
|
||||
fn flip_horizontal_if(img: DynamicImage, do_flip: bool) -> DynamicImage {
|
||||
if !do_flip {
|
||||
return img;
|
||||
}
|
||||
let (width, height) = img.dimensions();
|
||||
let mut out = RgbaImage::new(width, height);
|
||||
for y in 0..height {
|
||||
for x in 0..width {
|
||||
out.put_pixel(x, y, img.get_pixel(width - 1 - x, y));
|
||||
}
|
||||
}
|
||||
DynamicImage::ImageRgba8(out)
|
||||
}
|
||||
|
||||
/// Returns true if the given MIME type or file path extension indicates a JPEG file.
|
||||
pub fn is_jpeg(mime_type: &str, path: &str) -> bool {
|
||||
if mime_type == "image/jpeg" {
|
||||
return true;
|
||||
}
|
||||
let lower = path.to_lowercase();
|
||||
lower.ends_with(".jpg") || lower.ends_with(".jpeg")
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn test_non_jpeg_data_returned_unchanged() {
|
||||
let data = b"not a jpeg file at all";
|
||||
let result = fix_jpg_orientation(data);
|
||||
assert_eq!(result, data);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_jpeg_without_exif_returned_unchanged() {
|
||||
// Create a minimal JPEG without EXIF data
|
||||
let img = DynamicImage::ImageRgba8(RgbaImage::new(2, 2));
|
||||
let mut buf = Cursor::new(Vec::new());
|
||||
img.write_to(&mut buf, ImageFormat::Jpeg).unwrap();
|
||||
let jpeg_data = buf.into_inner();
|
||||
|
||||
let result = fix_jpg_orientation(&jpeg_data);
|
||||
// Should return data unchanged (no EXIF orientation tag)
|
||||
// Just verify it's still valid JPEG
|
||||
assert!(!result.is_empty());
|
||||
assert_eq!(&result[0..2], &[0xFF, 0xD8]); // JPEG magic bytes
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_is_jpeg() {
|
||||
assert!(is_jpeg("image/jpeg", ""));
|
||||
assert!(is_jpeg("", "/3,abc.jpg"));
|
||||
assert!(is_jpeg("", "/3,abc.JPEG"));
|
||||
assert!(is_jpeg("application/octet-stream", "/3,abc.JPG"));
|
||||
assert!(!is_jpeg("image/png", "/3,abc.png"));
|
||||
assert!(!is_jpeg("", "/3,abc.png"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_rotate_180() {
|
||||
// Create a 2x2 image with distinct pixel colors
|
||||
let mut img = RgbaImage::new(2, 2);
|
||||
img.put_pixel(0, 0, image::Rgba([255, 0, 0, 255])); // red top-left
|
||||
img.put_pixel(1, 0, image::Rgba([0, 255, 0, 255])); // green top-right
|
||||
img.put_pixel(0, 1, image::Rgba([0, 0, 255, 255])); // blue bottom-left
|
||||
img.put_pixel(1, 1, image::Rgba([255, 255, 0, 255])); // yellow bottom-right
|
||||
let dynamic = DynamicImage::ImageRgba8(img);
|
||||
|
||||
let rotated = rotate(dynamic, 180);
|
||||
let (w, h) = rotated.dimensions();
|
||||
assert_eq!((w, h), (2, 2));
|
||||
// After 180 rotation: top-left should be yellow, top-right should be blue
|
||||
assert_eq!(rotated.get_pixel(0, 0), image::Rgba([255, 255, 0, 255]));
|
||||
assert_eq!(rotated.get_pixel(1, 0), image::Rgba([0, 0, 255, 255]));
|
||||
assert_eq!(rotated.get_pixel(0, 1), image::Rgba([0, 255, 0, 255]));
|
||||
assert_eq!(rotated.get_pixel(1, 1), image::Rgba([255, 0, 0, 255]));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_rotate_90_ccw() {
|
||||
// Create 3x2 image (width=3, height=2)
|
||||
let mut img = RgbaImage::new(3, 2);
|
||||
img.put_pixel(0, 0, image::Rgba([1, 0, 0, 255]));
|
||||
img.put_pixel(1, 0, image::Rgba([2, 0, 0, 255]));
|
||||
img.put_pixel(2, 0, image::Rgba([3, 0, 0, 255]));
|
||||
img.put_pixel(0, 1, image::Rgba([4, 0, 0, 255]));
|
||||
img.put_pixel(1, 1, image::Rgba([5, 0, 0, 255]));
|
||||
img.put_pixel(2, 1, image::Rgba([6, 0, 0, 255]));
|
||||
let dynamic = DynamicImage::ImageRgba8(img);
|
||||
|
||||
let rotated = rotate(dynamic, 90);
|
||||
let (w, h) = rotated.dimensions();
|
||||
// 90 CCW: width=3,height=2 -> new_w=2, new_h=3
|
||||
assert_eq!((w, h), (2, 3));
|
||||
// Top-right (2,0) should move to top-left (0,0) in CCW 90
|
||||
assert_eq!(rotated.get_pixel(0, 0)[0], 3);
|
||||
assert_eq!(rotated.get_pixel(1, 0)[0], 6);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_rotate_neg90_cw() {
|
||||
// Create 3x2 image
|
||||
let mut img = RgbaImage::new(3, 2);
|
||||
img.put_pixel(0, 0, image::Rgba([1, 0, 0, 255]));
|
||||
img.put_pixel(1, 0, image::Rgba([2, 0, 0, 255]));
|
||||
img.put_pixel(2, 0, image::Rgba([3, 0, 0, 255]));
|
||||
img.put_pixel(0, 1, image::Rgba([4, 0, 0, 255]));
|
||||
img.put_pixel(1, 1, image::Rgba([5, 0, 0, 255]));
|
||||
img.put_pixel(2, 1, image::Rgba([6, 0, 0, 255]));
|
||||
let dynamic = DynamicImage::ImageRgba8(img);
|
||||
|
||||
let rotated = rotate(dynamic, -90);
|
||||
let (w, h) = rotated.dimensions();
|
||||
assert_eq!((w, h), (2, 3));
|
||||
// -90 (CW 90): top-left (0,0) should go to top-right
|
||||
assert_eq!(rotated.get_pixel(0, 0)[0], 4);
|
||||
assert_eq!(rotated.get_pixel(1, 0)[0], 1);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_flip_horizontal() {
|
||||
let mut img = RgbaImage::new(2, 1);
|
||||
img.put_pixel(0, 0, image::Rgba([10, 0, 0, 255]));
|
||||
img.put_pixel(1, 0, image::Rgba([20, 0, 0, 255]));
|
||||
let dynamic = DynamicImage::ImageRgba8(img);
|
||||
|
||||
let flipped = flip_horizontal_if(dynamic, true);
|
||||
assert_eq!(flipped.get_pixel(0, 0)[0], 20);
|
||||
assert_eq!(flipped.get_pixel(1, 0)[0], 10);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_flip_horizontal_noop() {
|
||||
let mut img = RgbaImage::new(2, 1);
|
||||
img.put_pixel(0, 0, image::Rgba([10, 0, 0, 255]));
|
||||
img.put_pixel(1, 0, image::Rgba([20, 0, 0, 255]));
|
||||
let dynamic = DynamicImage::ImageRgba8(img);
|
||||
|
||||
let not_flipped = flip_horizontal_if(dynamic, false);
|
||||
assert_eq!(not_flipped.get_pixel(0, 0)[0], 10);
|
||||
assert_eq!(not_flipped.get_pixel(1, 0)[0], 20);
|
||||
}
|
||||
}
|
||||
27
seaweed-volume/src/lib.rs
Normal file
27
seaweed-volume/src/lib.rs
Normal file
@@ -0,0 +1,27 @@
|
||||
pub mod config;
|
||||
pub mod images;
|
||||
pub mod metrics;
|
||||
pub mod remote_storage;
|
||||
pub mod security;
|
||||
pub mod server;
|
||||
pub mod storage;
|
||||
pub mod version;
|
||||
|
||||
/// Generated protobuf modules.
|
||||
pub mod pb {
|
||||
pub const FILE_DESCRIPTOR_SET: &[u8] =
|
||||
tonic::include_file_descriptor_set!("seaweed_descriptor");
|
||||
|
||||
pub mod remote_pb {
|
||||
tonic::include_proto!("remote_pb");
|
||||
}
|
||||
pub mod volume_server_pb {
|
||||
tonic::include_proto!("volume_server_pb");
|
||||
}
|
||||
pub mod master_pb {
|
||||
tonic::include_proto!("master_pb");
|
||||
}
|
||||
pub mod filer_pb {
|
||||
tonic::include_proto!("filer_pb");
|
||||
}
|
||||
}
|
||||
1051
seaweed-volume/src/main.rs
Normal file
1051
seaweed-volume/src/main.rs
Normal file
File diff suppressed because it is too large
Load Diff
448
seaweed-volume/src/metrics.rs
Normal file
448
seaweed-volume/src/metrics.rs
Normal file
@@ -0,0 +1,448 @@
|
||||
//! Prometheus metrics for the volume server.
|
||||
//!
|
||||
//! Mirrors the Go SeaweedFS volume server metrics.
|
||||
|
||||
use prometheus::{
|
||||
self, Encoder, GaugeVec, HistogramOpts, HistogramVec, IntCounterVec, IntGauge, IntGaugeVec,
|
||||
Opts, Registry, TextEncoder,
|
||||
};
|
||||
use std::sync::Once;
|
||||
|
||||
use crate::version;
|
||||
|
||||
#[derive(Clone, Debug, Default, PartialEq, Eq)]
|
||||
pub struct PushGatewayConfig {
|
||||
pub address: String,
|
||||
pub interval_seconds: u32,
|
||||
}
|
||||
|
||||
lazy_static::lazy_static! {
|
||||
pub static ref REGISTRY: Registry = Registry::new();
|
||||
|
||||
// ---- Request metrics (Go: VolumeServerRequestCounter, VolumeServerRequestHistogram) ----
|
||||
|
||||
/// Request counter with labels `type` (HTTP method) and `code` (HTTP status).
|
||||
pub static ref REQUEST_COUNTER: IntCounterVec = IntCounterVec::new(
|
||||
Opts::new("SeaweedFS_volumeServer_request_total", "Volume server requests"),
|
||||
&["type", "code"],
|
||||
).expect("metric can be created");
|
||||
|
||||
/// Request duration histogram with label `type` (HTTP method).
|
||||
pub static ref REQUEST_DURATION: HistogramVec = HistogramVec::new(
|
||||
HistogramOpts::new(
|
||||
"SeaweedFS_volumeServer_request_seconds",
|
||||
"Volume server request duration in seconds",
|
||||
).buckets(exponential_buckets(0.0001, 2.0, 24)),
|
||||
&["type"],
|
||||
).expect("metric can be created");
|
||||
|
||||
// ---- Handler counters (Go: VolumeServerHandlerCounter) ----
|
||||
|
||||
/// Handler-level operation counter with label `type`.
|
||||
pub static ref HANDLER_COUNTER: IntCounterVec = IntCounterVec::new(
|
||||
Opts::new("SeaweedFS_volumeServer_handler_total", "Volume server handler counters"),
|
||||
&["type"],
|
||||
).expect("metric can be created");
|
||||
|
||||
// ---- Vacuuming metrics (Go: VolumeServerVacuuming*) ----
|
||||
|
||||
/// Vacuuming compact counter with label `success` (true/false).
|
||||
pub static ref VACUUMING_COMPACT_COUNTER: IntCounterVec = IntCounterVec::new(
|
||||
Opts::new("SeaweedFS_volumeServer_vacuuming_compact_count", "Counter of volume vacuuming Compact counter"),
|
||||
&["success"],
|
||||
).expect("metric can be created");
|
||||
|
||||
/// Vacuuming commit counter with label `success` (true/false).
|
||||
pub static ref VACUUMING_COMMIT_COUNTER: IntCounterVec = IntCounterVec::new(
|
||||
Opts::new("SeaweedFS_volumeServer_vacuuming_commit_count", "Counter of volume vacuuming commit counter"),
|
||||
&["success"],
|
||||
).expect("metric can be created");
|
||||
|
||||
/// Vacuuming duration histogram with label `type` (compact/commit).
|
||||
pub static ref VACUUMING_HISTOGRAM: HistogramVec = HistogramVec::new(
|
||||
HistogramOpts::new(
|
||||
"SeaweedFS_volumeServer_vacuuming_seconds",
|
||||
"Volume vacuuming duration in seconds",
|
||||
).buckets(exponential_buckets(0.0001, 2.0, 24)),
|
||||
&["type"],
|
||||
).expect("metric can be created");
|
||||
|
||||
// ---- Volume gauges (Go: VolumeServerVolumeGauge, VolumeServerReadOnlyVolumeGauge) ----
|
||||
|
||||
/// Volumes per collection and type (volume/ec_shards).
|
||||
pub static ref VOLUME_GAUGE: GaugeVec = GaugeVec::new(
|
||||
Opts::new("SeaweedFS_volumeServer_volumes", "Number of volumes"),
|
||||
&["collection", "type"],
|
||||
).expect("metric can be created");
|
||||
|
||||
/// Read-only volumes per collection and type.
|
||||
pub static ref READ_ONLY_VOLUME_GAUGE: GaugeVec = GaugeVec::new(
|
||||
Opts::new("SeaweedFS_volumeServer_read_only_volumes", "Number of read-only volumes."),
|
||||
&["collection", "type"],
|
||||
).expect("metric can be created");
|
||||
|
||||
/// Maximum number of volumes this server can hold.
|
||||
pub static ref MAX_VOLUMES: IntGauge = IntGauge::new(
|
||||
"SeaweedFS_volumeServer_max_volumes",
|
||||
"Maximum number of volumes",
|
||||
).expect("metric can be created");
|
||||
|
||||
// ---- Disk size gauges (Go: VolumeServerDiskSizeGauge) ----
|
||||
|
||||
/// Actual disk size used by volumes per collection and type (normal/deleted_bytes/ec).
|
||||
pub static ref DISK_SIZE_GAUGE: GaugeVec = GaugeVec::new(
|
||||
Opts::new("SeaweedFS_volumeServer_total_disk_size", "Actual disk size used by volumes"),
|
||||
&["collection", "type"],
|
||||
).expect("metric can be created");
|
||||
|
||||
// ---- Resource gauges (Go: VolumeServerResourceGauge) ----
|
||||
|
||||
/// Disk resource usage per directory and type (all/used/free/avail).
|
||||
pub static ref RESOURCE_GAUGE: GaugeVec = GaugeVec::new(
|
||||
Opts::new("SeaweedFS_volumeServer_resource", "Server resource usage"),
|
||||
&["name", "type"],
|
||||
).expect("metric can be created");
|
||||
|
||||
// ---- In-flight gauges (Go: VolumeServerInFlightRequestsGauge, InFlightDownload/UploadSize) ----
|
||||
|
||||
/// In-flight requests per HTTP method.
|
||||
pub static ref INFLIGHT_REQUESTS_GAUGE: IntGaugeVec = IntGaugeVec::new(
|
||||
Opts::new("SeaweedFS_volumeServer_in_flight_requests", "Current number of in-flight requests being handled by volume server."),
|
||||
&["type"],
|
||||
).expect("metric can be created");
|
||||
|
||||
/// Concurrent download limit in bytes.
|
||||
pub static ref CONCURRENT_DOWNLOAD_LIMIT: IntGauge = IntGauge::new(
|
||||
"SeaweedFS_volumeServer_concurrent_download_limit",
|
||||
"Limit for total concurrent download size in bytes",
|
||||
).expect("metric can be created");
|
||||
|
||||
/// Concurrent upload limit in bytes.
|
||||
pub static ref CONCURRENT_UPLOAD_LIMIT: IntGauge = IntGauge::new(
|
||||
"SeaweedFS_volumeServer_concurrent_upload_limit",
|
||||
"Limit for total concurrent upload size in bytes",
|
||||
).expect("metric can be created");
|
||||
|
||||
/// Current in-flight download bytes.
|
||||
pub static ref INFLIGHT_DOWNLOAD_SIZE: IntGauge = IntGauge::new(
|
||||
"SeaweedFS_volumeServer_in_flight_download_size",
|
||||
"In flight total download size.",
|
||||
).expect("metric can be created");
|
||||
|
||||
/// Current in-flight upload bytes.
|
||||
pub static ref INFLIGHT_UPLOAD_SIZE: IntGauge = IntGauge::new(
|
||||
"SeaweedFS_volumeServer_in_flight_upload_size",
|
||||
"In flight total upload size.",
|
||||
).expect("metric can be created");
|
||||
|
||||
// ---- Legacy aliases for backward compat with existing code ----
|
||||
|
||||
/// Total number of volumes on this server (flat gauge).
|
||||
pub static ref VOLUMES_TOTAL: IntGauge = IntGauge::new(
|
||||
"volume_server_volumes_total",
|
||||
"Total number of volumes",
|
||||
).expect("metric can be created");
|
||||
|
||||
/// Disk size in bytes per directory.
|
||||
pub static ref DISK_SIZE_BYTES: IntGaugeVec = IntGaugeVec::new(
|
||||
Opts::new("volume_server_disk_size_bytes", "Disk size in bytes"),
|
||||
&["dir"],
|
||||
).expect("metric can be created");
|
||||
|
||||
/// Disk free bytes per directory.
|
||||
pub static ref DISK_FREE_BYTES: IntGaugeVec = IntGaugeVec::new(
|
||||
Opts::new("volume_server_disk_free_bytes", "Disk free space in bytes"),
|
||||
&["dir"],
|
||||
).expect("metric can be created");
|
||||
|
||||
/// Current number of in-flight requests (flat gauge).
|
||||
pub static ref INFLIGHT_REQUESTS: IntGauge = IntGauge::new(
|
||||
"volume_server_inflight_requests",
|
||||
"Current number of in-flight requests",
|
||||
).expect("metric can be created");
|
||||
|
||||
/// Total number of files stored across all volumes.
|
||||
pub static ref VOLUME_FILE_COUNT: IntGauge = IntGauge::new(
|
||||
"volume_server_volume_file_count",
|
||||
"Total number of files stored across all volumes",
|
||||
).expect("metric can be created");
|
||||
|
||||
// ---- Build info (Go: BuildInfo) ----
|
||||
|
||||
/// Build information gauge, always set to 1. Matches Go:
|
||||
/// Namespace="SeaweedFS", Subsystem="build", Name="info",
|
||||
/// labels: version, commit, sizelimit, goos, goarch.
|
||||
pub static ref BUILD_INFO: GaugeVec = GaugeVec::new(
|
||||
Opts::new("SeaweedFS_build_info", "A metric with a constant '1' value labeled by version, commit, sizelimit, goos, and goarch from which SeaweedFS was built."),
|
||||
&["version", "commit", "sizelimit", "goos", "goarch"],
|
||||
).expect("metric can be created");
|
||||
}
|
||||
|
||||
/// Generate exponential bucket boundaries for histograms.
|
||||
fn exponential_buckets(start: f64, factor: f64, count: usize) -> Vec<f64> {
|
||||
let mut buckets = Vec::with_capacity(count);
|
||||
let mut val = start;
|
||||
for _ in 0..count {
|
||||
buckets.push(val);
|
||||
val *= factor;
|
||||
}
|
||||
buckets
|
||||
}
|
||||
|
||||
// Handler counter type constants (matches Go's metrics_names.go).
|
||||
pub const WRITE_TO_LOCAL_DISK: &str = "writeToLocalDisk";
|
||||
pub const WRITE_TO_REPLICAS: &str = "writeToReplicas";
|
||||
pub const DOWNLOAD_LIMIT_COND: &str = "downloadLimitCondition";
|
||||
pub const UPLOAD_LIMIT_COND: &str = "uploadLimitCondition";
|
||||
pub const READ_PROXY_REQ: &str = "readProxyRequest";
|
||||
pub const READ_REDIRECT_REQ: &str = "readRedirectRequest";
|
||||
pub const EMPTY_READ_PROXY_LOC: &str = "emptyReadProxyLocaction";
|
||||
pub const FAILED_READ_PROXY_REQ: &str = "failedReadProxyRequest";
|
||||
|
||||
// Error metric name constants.
|
||||
pub const ERROR_SIZE_MISMATCH_OFFSET_SIZE: &str = "errorSizeMismatchOffsetSize";
|
||||
pub const ERROR_SIZE_MISMATCH: &str = "errorSizeMismatch";
|
||||
pub const ERROR_CRC: &str = "errorCRC";
|
||||
pub const ERROR_INDEX_OUT_OF_RANGE: &str = "errorIndexOutOfRange";
|
||||
pub const ERROR_GET_NOT_FOUND: &str = "errorGetNotFound";
|
||||
pub const ERROR_GET_INTERNAL: &str = "errorGetInternal";
|
||||
pub const ERROR_WRITE_TO_LOCAL_DISK: &str = "errorWriteToLocalDisk";
|
||||
pub const ERROR_UNMARSHAL_PAIRS: &str = "errorUnmarshalPairs";
|
||||
pub const ERROR_WRITE_TO_REPLICAS: &str = "errorWriteToReplicas";
|
||||
|
||||
// Go volume heartbeat metric label values.
|
||||
pub const READ_ONLY_LABEL_IS_READ_ONLY: &str = "IsReadOnly";
|
||||
pub const READ_ONLY_LABEL_NO_WRITE_OR_DELETE: &str = "noWriteOrDelete";
|
||||
pub const READ_ONLY_LABEL_NO_WRITE_CAN_DELETE: &str = "noWriteCanDelete";
|
||||
pub const READ_ONLY_LABEL_IS_DISK_SPACE_LOW: &str = "isDiskSpaceLow";
|
||||
pub const DISK_SIZE_LABEL_NORMAL: &str = "normal";
|
||||
pub const DISK_SIZE_LABEL_DELETED_BYTES: &str = "deleted_bytes";
|
||||
pub const DISK_SIZE_LABEL_EC: &str = "ec";
|
||||
|
||||
static REGISTER_METRICS: Once = Once::new();
|
||||
|
||||
/// Register all metrics with the custom registry.
|
||||
/// Call this once at startup.
|
||||
pub fn register_metrics() {
|
||||
REGISTER_METRICS.call_once(|| {
|
||||
let metrics: Vec<Box<dyn prometheus::core::Collector>> = vec![
|
||||
// New Go-compatible metrics
|
||||
Box::new(REQUEST_COUNTER.clone()),
|
||||
Box::new(REQUEST_DURATION.clone()),
|
||||
Box::new(HANDLER_COUNTER.clone()),
|
||||
Box::new(VACUUMING_COMPACT_COUNTER.clone()),
|
||||
Box::new(VACUUMING_COMMIT_COUNTER.clone()),
|
||||
Box::new(VACUUMING_HISTOGRAM.clone()),
|
||||
Box::new(VOLUME_GAUGE.clone()),
|
||||
Box::new(READ_ONLY_VOLUME_GAUGE.clone()),
|
||||
Box::new(MAX_VOLUMES.clone()),
|
||||
Box::new(DISK_SIZE_GAUGE.clone()),
|
||||
Box::new(RESOURCE_GAUGE.clone()),
|
||||
Box::new(INFLIGHT_REQUESTS_GAUGE.clone()),
|
||||
Box::new(CONCURRENT_DOWNLOAD_LIMIT.clone()),
|
||||
Box::new(CONCURRENT_UPLOAD_LIMIT.clone()),
|
||||
Box::new(INFLIGHT_DOWNLOAD_SIZE.clone()),
|
||||
Box::new(INFLIGHT_UPLOAD_SIZE.clone()),
|
||||
// Legacy metrics
|
||||
Box::new(VOLUMES_TOTAL.clone()),
|
||||
Box::new(DISK_SIZE_BYTES.clone()),
|
||||
Box::new(DISK_FREE_BYTES.clone()),
|
||||
Box::new(INFLIGHT_REQUESTS.clone()),
|
||||
Box::new(VOLUME_FILE_COUNT.clone()),
|
||||
// Build info
|
||||
Box::new(BUILD_INFO.clone()),
|
||||
];
|
||||
for m in metrics {
|
||||
REGISTRY.register(m).expect("metric registered");
|
||||
}
|
||||
|
||||
// Set build info gauge to 1 with version/commit/sizelimit/os/arch labels (matches Go).
|
||||
BUILD_INFO
|
||||
.with_label_values(&[
|
||||
version::version(),
|
||||
version::commit(),
|
||||
version::size_limit(),
|
||||
std::env::consts::OS,
|
||||
std::env::consts::ARCH,
|
||||
])
|
||||
.set(1.0);
|
||||
});
|
||||
}
|
||||
|
||||
/// Gather all metrics and encode them in Prometheus text exposition format.
|
||||
pub fn gather_metrics() -> String {
|
||||
let encoder = TextEncoder::new();
|
||||
let metric_families = REGISTRY.gather();
|
||||
let mut buffer = Vec::new();
|
||||
encoder
|
||||
.encode(&metric_families, &mut buffer)
|
||||
.expect("encoding metrics");
|
||||
String::from_utf8(buffer).expect("metrics are valid UTF-8")
|
||||
}
|
||||
|
||||
pub fn delete_collection_metrics(collection: &str) {
|
||||
// Mirrors Go's DeletePartialMatch(prometheus.Labels{"collection": collection})
|
||||
// which removes ALL metric entries matching the collection label, regardless
|
||||
// of other label values (like "type"). We gather the metric families to discover
|
||||
// all type values dynamically, matching Go's partial-match behavior.
|
||||
delete_partial_match_collection(&VOLUME_GAUGE, collection);
|
||||
delete_partial_match_collection(&READ_ONLY_VOLUME_GAUGE, collection);
|
||||
delete_partial_match_collection(&DISK_SIZE_GAUGE, collection);
|
||||
}
|
||||
|
||||
/// Remove all metric entries from a GaugeVec where the "collection" label matches.
|
||||
/// This emulates Go's `DeletePartialMatch(prometheus.Labels{"collection": collection})`.
|
||||
fn delete_partial_match_collection(gauge: &GaugeVec, collection: &str) {
|
||||
use prometheus::core::Collector;
|
||||
let families = gauge.collect();
|
||||
for family in &families {
|
||||
for metric in family.get_metric() {
|
||||
let labels = metric.get_label();
|
||||
let mut matches_collection = false;
|
||||
let mut type_value = None;
|
||||
for label in labels {
|
||||
if label.get_name() == "collection" && label.get_value() == collection {
|
||||
matches_collection = true;
|
||||
}
|
||||
if label.get_name() == "type" {
|
||||
type_value = Some(label.get_value().to_string());
|
||||
}
|
||||
}
|
||||
if matches_collection {
|
||||
if let Some(ref tv) = type_value {
|
||||
let _ = gauge.remove_label_values(&[collection, tv]);
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
pub fn build_pushgateway_url(address: &str, job: &str, instance: &str) -> String {
|
||||
let base = if address.starts_with("http://") || address.starts_with("https://") {
|
||||
address.to_string()
|
||||
} else {
|
||||
format!("http://{}", address)
|
||||
};
|
||||
let base = base.trim_end_matches('/');
|
||||
format!("{}/metrics/job/{}/instance/{}", base, job, instance)
|
||||
}
|
||||
|
||||
pub async fn push_metrics_once(
|
||||
client: &reqwest::Client,
|
||||
address: &str,
|
||||
job: &str,
|
||||
instance: &str,
|
||||
) -> Result<(), String> {
|
||||
let url = build_pushgateway_url(address, job, instance);
|
||||
let response = client
|
||||
.put(&url)
|
||||
.header(
|
||||
reqwest::header::CONTENT_TYPE,
|
||||
"text/plain; version=0.0.4; charset=utf-8",
|
||||
)
|
||||
.body(gather_metrics())
|
||||
.send()
|
||||
.await
|
||||
.map_err(|e| format!("push metrics request failed: {}", e))?;
|
||||
|
||||
if response.status().is_success() {
|
||||
Ok(())
|
||||
} else {
|
||||
Err(format!(
|
||||
"push metrics failed with status {}",
|
||||
response.status()
|
||||
))
|
||||
}
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
use axum::{routing::put, Router};
|
||||
use std::sync::{Arc, Mutex};
|
||||
|
||||
#[test]
|
||||
fn test_gather_metrics_returns_text() {
|
||||
register_metrics();
|
||||
REQUEST_COUNTER.with_label_values(&["GET", "200"]).inc();
|
||||
let output = gather_metrics();
|
||||
assert!(output.contains("SeaweedFS_volumeServer_request_total"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_build_pushgateway_url() {
|
||||
assert_eq!(
|
||||
build_pushgateway_url("localhost:9091", "volumeServer", "test-instance"),
|
||||
"http://localhost:9091/metrics/job/volumeServer/instance/test-instance"
|
||||
);
|
||||
assert_eq!(
|
||||
build_pushgateway_url("https://push.example", "volumeServer", "node-a"),
|
||||
"https://push.example/metrics/job/volumeServer/instance/node-a"
|
||||
);
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn test_push_metrics_once() {
|
||||
register_metrics();
|
||||
|
||||
let captured = Arc::new(Mutex::new(None::<String>));
|
||||
let captured_clone = captured.clone();
|
||||
|
||||
let app = Router::new().route(
|
||||
"/metrics/job/volumeServer/instance/test-instance",
|
||||
put(move |body: String| {
|
||||
let captured = captured_clone.clone();
|
||||
async move {
|
||||
*captured.lock().unwrap() = Some(body);
|
||||
"ok"
|
||||
}
|
||||
}),
|
||||
);
|
||||
|
||||
let listener = tokio::net::TcpListener::bind("127.0.0.1:0").await.unwrap();
|
||||
let addr = listener.local_addr().unwrap();
|
||||
let server = tokio::spawn(async move {
|
||||
axum::serve(listener, app).await.unwrap();
|
||||
});
|
||||
|
||||
let client = reqwest::Client::new();
|
||||
push_metrics_once(
|
||||
&client,
|
||||
&format!("127.0.0.1:{}", addr.port()),
|
||||
"volumeServer",
|
||||
"test-instance",
|
||||
)
|
||||
.await
|
||||
.unwrap();
|
||||
|
||||
let body = captured.lock().unwrap().clone().unwrap();
|
||||
assert!(body.contains("SeaweedFS_volumeServer_request_total"));
|
||||
|
||||
server.abort();
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_delete_collection_metrics_removes_collection_labelsets() {
|
||||
register_metrics();
|
||||
|
||||
VOLUME_GAUGE.with_label_values(&["pics", "volume"]).set(2.0);
|
||||
VOLUME_GAUGE.with_label_values(&["pics", "ec_shards"]).set(3.0);
|
||||
READ_ONLY_VOLUME_GAUGE
|
||||
.with_label_values(&["pics", "volume"])
|
||||
.set(1.0);
|
||||
DISK_SIZE_GAUGE
|
||||
.with_label_values(&["pics", "normal"])
|
||||
.set(10.0);
|
||||
DISK_SIZE_GAUGE
|
||||
.with_label_values(&["pics", "deleted_bytes"])
|
||||
.set(4.0);
|
||||
|
||||
delete_collection_metrics("pics");
|
||||
|
||||
let output = gather_metrics();
|
||||
assert!(!output.contains("collection=\"pics\",type=\"volume\""));
|
||||
assert!(!output.contains("collection=\"pics\",type=\"ec_shards\""));
|
||||
assert!(!output.contains("collection=\"pics\",type=\"normal\""));
|
||||
assert!(!output.contains("collection=\"pics\",type=\"deleted_bytes\""));
|
||||
}
|
||||
}
|
||||
157
seaweed-volume/src/remote_storage/mod.rs
Normal file
157
seaweed-volume/src/remote_storage/mod.rs
Normal file
@@ -0,0 +1,157 @@
|
||||
//! Remote storage backends for tiered storage support.
|
||||
//!
|
||||
//! Provides a trait-based abstraction over cloud storage providers (S3, GCS, Azure, etc.)
|
||||
//! and a registry to create clients from protobuf RemoteConf messages.
|
||||
|
||||
pub mod s3;
|
||||
pub mod s3_tier;
|
||||
|
||||
use crate::pb::remote_pb::{RemoteConf, RemoteStorageLocation};
|
||||
|
||||
/// Error type for remote storage operations.
|
||||
#[derive(Debug, thiserror::Error)]
|
||||
pub enum RemoteStorageError {
|
||||
#[error("remote storage type {0} not found")]
|
||||
TypeNotFound(String),
|
||||
#[error("remote object not found: {0}")]
|
||||
ObjectNotFound(String),
|
||||
#[error("remote storage error: {0}")]
|
||||
Other(String),
|
||||
#[error("io error: {0}")]
|
||||
Io(#[from] std::io::Error),
|
||||
}
|
||||
|
||||
/// Metadata about a remote file entry.
|
||||
#[derive(Debug, Clone)]
|
||||
pub struct RemoteEntry {
|
||||
pub size: i64,
|
||||
pub last_modified_at: i64, // Unix seconds
|
||||
pub e_tag: String,
|
||||
pub storage_name: String,
|
||||
}
|
||||
|
||||
/// Trait for remote storage clients. Matches Go's RemoteStorageClient interface.
|
||||
#[async_trait::async_trait]
|
||||
pub trait RemoteStorageClient: Send + Sync {
|
||||
/// Read (part of) a file from remote storage.
|
||||
async fn read_file(
|
||||
&self,
|
||||
loc: &RemoteStorageLocation,
|
||||
offset: i64,
|
||||
size: i64,
|
||||
) -> Result<Vec<u8>, RemoteStorageError>;
|
||||
|
||||
/// Write a file to remote storage.
|
||||
async fn write_file(
|
||||
&self,
|
||||
loc: &RemoteStorageLocation,
|
||||
data: &[u8],
|
||||
) -> Result<RemoteEntry, RemoteStorageError>;
|
||||
|
||||
/// Get metadata for a file in remote storage.
|
||||
async fn stat_file(
|
||||
&self,
|
||||
loc: &RemoteStorageLocation,
|
||||
) -> Result<RemoteEntry, RemoteStorageError>;
|
||||
|
||||
/// Delete a file from remote storage.
|
||||
async fn delete_file(&self, loc: &RemoteStorageLocation) -> Result<(), RemoteStorageError>;
|
||||
|
||||
/// List all buckets.
|
||||
async fn list_buckets(&self) -> Result<Vec<String>, RemoteStorageError>;
|
||||
|
||||
/// The RemoteConf used to create this client.
|
||||
fn remote_conf(&self) -> &RemoteConf;
|
||||
}
|
||||
|
||||
/// Create a new remote storage client from a RemoteConf.
|
||||
pub fn make_remote_storage_client(
|
||||
conf: &RemoteConf,
|
||||
) -> Result<Box<dyn RemoteStorageClient>, RemoteStorageError> {
|
||||
match conf.r#type.as_str() {
|
||||
// All S3-compatible backends use the same client with different credentials
|
||||
"s3" | "wasabi" | "backblaze" | "aliyun" | "tencent" | "baidu" | "filebase" | "storj"
|
||||
| "contabo" => {
|
||||
let (access_key, secret_key, endpoint, region) = extract_s3_credentials(conf);
|
||||
Ok(Box::new(s3::S3RemoteStorageClient::new(
|
||||
conf.clone(),
|
||||
&access_key,
|
||||
&secret_key,
|
||||
®ion,
|
||||
&endpoint,
|
||||
conf.s3_force_path_style,
|
||||
)))
|
||||
}
|
||||
other => Err(RemoteStorageError::TypeNotFound(other.to_string())),
|
||||
}
|
||||
}
|
||||
|
||||
/// Extract S3-compatible credentials from a RemoteConf based on its type.
|
||||
fn extract_s3_credentials(conf: &RemoteConf) -> (String, String, String, String) {
|
||||
match conf.r#type.as_str() {
|
||||
"s3" => (
|
||||
conf.s3_access_key.clone(),
|
||||
conf.s3_secret_key.clone(),
|
||||
conf.s3_endpoint.clone(),
|
||||
if conf.s3_region.is_empty() {
|
||||
"us-east-1".to_string()
|
||||
} else {
|
||||
conf.s3_region.clone()
|
||||
},
|
||||
),
|
||||
"wasabi" => (
|
||||
conf.wasabi_access_key.clone(),
|
||||
conf.wasabi_secret_key.clone(),
|
||||
conf.wasabi_endpoint.clone(),
|
||||
conf.wasabi_region.clone(),
|
||||
),
|
||||
"backblaze" => (
|
||||
conf.backblaze_key_id.clone(),
|
||||
conf.backblaze_application_key.clone(),
|
||||
conf.backblaze_endpoint.clone(),
|
||||
conf.backblaze_region.clone(),
|
||||
),
|
||||
"aliyun" => (
|
||||
conf.aliyun_access_key.clone(),
|
||||
conf.aliyun_secret_key.clone(),
|
||||
conf.aliyun_endpoint.clone(),
|
||||
conf.aliyun_region.clone(),
|
||||
),
|
||||
"tencent" => (
|
||||
conf.tencent_secret_id.clone(),
|
||||
conf.tencent_secret_key.clone(),
|
||||
conf.tencent_endpoint.clone(),
|
||||
String::new(),
|
||||
),
|
||||
"baidu" => (
|
||||
conf.baidu_access_key.clone(),
|
||||
conf.baidu_secret_key.clone(),
|
||||
conf.baidu_endpoint.clone(),
|
||||
conf.baidu_region.clone(),
|
||||
),
|
||||
"filebase" => (
|
||||
conf.filebase_access_key.clone(),
|
||||
conf.filebase_secret_key.clone(),
|
||||
conf.filebase_endpoint.clone(),
|
||||
String::new(),
|
||||
),
|
||||
"storj" => (
|
||||
conf.storj_access_key.clone(),
|
||||
conf.storj_secret_key.clone(),
|
||||
conf.storj_endpoint.clone(),
|
||||
String::new(),
|
||||
),
|
||||
"contabo" => (
|
||||
conf.contabo_access_key.clone(),
|
||||
conf.contabo_secret_key.clone(),
|
||||
conf.contabo_endpoint.clone(),
|
||||
conf.contabo_region.clone(),
|
||||
),
|
||||
_ => (
|
||||
conf.s3_access_key.clone(),
|
||||
conf.s3_secret_key.clone(),
|
||||
conf.s3_endpoint.clone(),
|
||||
conf.s3_region.clone(),
|
||||
),
|
||||
}
|
||||
}
|
||||
186
seaweed-volume/src/remote_storage/s3.rs
Normal file
186
seaweed-volume/src/remote_storage/s3.rs
Normal file
@@ -0,0 +1,186 @@
|
||||
//! S3-compatible remote storage client.
|
||||
//!
|
||||
//! Works with AWS S3, MinIO, SeaweedFS S3, and all S3-compatible providers.
|
||||
|
||||
use aws_sdk_s3::config::{BehaviorVersion, Credentials, Region};
|
||||
use aws_sdk_s3::primitives::ByteStream;
|
||||
use aws_sdk_s3::Client;
|
||||
|
||||
use super::{RemoteEntry, RemoteStorageClient, RemoteStorageError};
|
||||
use crate::pb::remote_pb::{RemoteConf, RemoteStorageLocation};
|
||||
|
||||
/// S3-compatible remote storage client.
|
||||
pub struct S3RemoteStorageClient {
|
||||
client: Client,
|
||||
conf: RemoteConf,
|
||||
}
|
||||
|
||||
impl S3RemoteStorageClient {
|
||||
/// Create a new S3 client from credentials and endpoint configuration.
|
||||
pub fn new(
|
||||
conf: RemoteConf,
|
||||
access_key: &str,
|
||||
secret_key: &str,
|
||||
region: &str,
|
||||
endpoint: &str,
|
||||
force_path_style: bool,
|
||||
) -> Self {
|
||||
let region = if region.is_empty() {
|
||||
"us-east-1"
|
||||
} else {
|
||||
region
|
||||
};
|
||||
|
||||
let credentials = Credentials::new(
|
||||
access_key,
|
||||
secret_key,
|
||||
None, // session token
|
||||
None, // expiry
|
||||
"seaweedfs-volume",
|
||||
);
|
||||
|
||||
let mut s3_config = aws_sdk_s3::Config::builder()
|
||||
.behavior_version(BehaviorVersion::latest())
|
||||
.region(Region::new(region.to_string()))
|
||||
.credentials_provider(credentials)
|
||||
.force_path_style(force_path_style);
|
||||
|
||||
if !endpoint.is_empty() {
|
||||
s3_config = s3_config.endpoint_url(endpoint);
|
||||
}
|
||||
|
||||
let client = Client::from_conf(s3_config.build());
|
||||
|
||||
S3RemoteStorageClient { client, conf }
|
||||
}
|
||||
}
|
||||
|
||||
#[async_trait::async_trait]
|
||||
impl RemoteStorageClient for S3RemoteStorageClient {
|
||||
async fn read_file(
|
||||
&self,
|
||||
loc: &RemoteStorageLocation,
|
||||
offset: i64,
|
||||
size: i64,
|
||||
) -> Result<Vec<u8>, RemoteStorageError> {
|
||||
let key = loc.path.trim_start_matches('/');
|
||||
|
||||
let mut req = self.client.get_object().bucket(&loc.bucket).key(key);
|
||||
|
||||
// Set byte range if specified
|
||||
if size > 0 {
|
||||
let end = offset + size - 1;
|
||||
req = req.range(format!("bytes={}-{}", offset, end));
|
||||
} else if offset > 0 {
|
||||
req = req.range(format!("bytes={}-", offset));
|
||||
}
|
||||
|
||||
let resp = req.send().await.map_err(|e| {
|
||||
let msg = format!("{}", e);
|
||||
if msg.contains("NoSuchKey") || msg.contains("404") {
|
||||
RemoteStorageError::ObjectNotFound(format!("{}/{}", loc.bucket, key))
|
||||
} else {
|
||||
RemoteStorageError::Other(format!("s3 get object: {}", e))
|
||||
}
|
||||
})?;
|
||||
|
||||
let data = resp
|
||||
.body
|
||||
.collect()
|
||||
.await
|
||||
.map_err(|e| RemoteStorageError::Other(format!("s3 read body: {}", e)))?;
|
||||
|
||||
Ok(data.into_bytes().to_vec())
|
||||
}
|
||||
|
||||
async fn write_file(
|
||||
&self,
|
||||
loc: &RemoteStorageLocation,
|
||||
data: &[u8],
|
||||
) -> Result<RemoteEntry, RemoteStorageError> {
|
||||
let key = loc.path.trim_start_matches('/');
|
||||
|
||||
let resp = self
|
||||
.client
|
||||
.put_object()
|
||||
.bucket(&loc.bucket)
|
||||
.key(key)
|
||||
.body(ByteStream::from(data.to_vec()))
|
||||
.send()
|
||||
.await
|
||||
.map_err(|e| RemoteStorageError::Other(format!("s3 put object: {}", e)))?;
|
||||
|
||||
Ok(RemoteEntry {
|
||||
size: data.len() as i64,
|
||||
last_modified_at: std::time::SystemTime::now()
|
||||
.duration_since(std::time::UNIX_EPOCH)
|
||||
.unwrap_or_default()
|
||||
.as_secs() as i64,
|
||||
e_tag: resp.e_tag().unwrap_or_default().to_string(),
|
||||
storage_name: loc.name.clone(),
|
||||
})
|
||||
}
|
||||
|
||||
async fn stat_file(
|
||||
&self,
|
||||
loc: &RemoteStorageLocation,
|
||||
) -> Result<RemoteEntry, RemoteStorageError> {
|
||||
let key = loc.path.trim_start_matches('/');
|
||||
|
||||
let resp = self
|
||||
.client
|
||||
.head_object()
|
||||
.bucket(&loc.bucket)
|
||||
.key(key)
|
||||
.send()
|
||||
.await
|
||||
.map_err(|e| {
|
||||
let msg = format!("{}", e);
|
||||
if msg.contains("404") || msg.contains("NotFound") {
|
||||
RemoteStorageError::ObjectNotFound(format!("{}/{}", loc.bucket, key))
|
||||
} else {
|
||||
RemoteStorageError::Other(format!("s3 head object: {}", e))
|
||||
}
|
||||
})?;
|
||||
|
||||
Ok(RemoteEntry {
|
||||
size: resp.content_length().unwrap_or(0),
|
||||
last_modified_at: resp.last_modified().map(|t| t.secs()).unwrap_or(0),
|
||||
e_tag: resp.e_tag().unwrap_or_default().to_string(),
|
||||
storage_name: loc.name.clone(),
|
||||
})
|
||||
}
|
||||
|
||||
async fn delete_file(&self, loc: &RemoteStorageLocation) -> Result<(), RemoteStorageError> {
|
||||
let key = loc.path.trim_start_matches('/');
|
||||
|
||||
self.client
|
||||
.delete_object()
|
||||
.bucket(&loc.bucket)
|
||||
.key(key)
|
||||
.send()
|
||||
.await
|
||||
.map_err(|e| RemoteStorageError::Other(format!("s3 delete object: {}", e)))?;
|
||||
|
||||
Ok(())
|
||||
}
|
||||
|
||||
async fn list_buckets(&self) -> Result<Vec<String>, RemoteStorageError> {
|
||||
let resp = self
|
||||
.client
|
||||
.list_buckets()
|
||||
.send()
|
||||
.await
|
||||
.map_err(|e| RemoteStorageError::Other(format!("s3 list buckets: {}", e)))?;
|
||||
|
||||
Ok(resp
|
||||
.buckets()
|
||||
.iter()
|
||||
.filter_map(|b| b.name().map(String::from))
|
||||
.collect())
|
||||
}
|
||||
|
||||
fn remote_conf(&self) -> &RemoteConf {
|
||||
&self.conf
|
||||
}
|
||||
}
|
||||
514
seaweed-volume/src/remote_storage/s3_tier.rs
Normal file
514
seaweed-volume/src/remote_storage/s3_tier.rs
Normal file
@@ -0,0 +1,514 @@
|
||||
//! S3-compatible tiered storage backend for volume .dat file upload/download.
|
||||
//!
|
||||
//! Provides multipart upload and concurrent download with progress callbacks,
|
||||
//! matching the Go SeaweedFS S3 backend behavior.
|
||||
|
||||
use std::collections::HashMap;
|
||||
use std::future::Future;
|
||||
use std::sync::{Arc, OnceLock, RwLock};
|
||||
|
||||
use aws_sdk_s3::config::{BehaviorVersion, Credentials, Region};
|
||||
use aws_sdk_s3::types::{CompletedMultipartUpload, CompletedPart};
|
||||
use aws_sdk_s3::Client;
|
||||
use tokio::io::{AsyncReadExt, AsyncSeekExt, AsyncWriteExt};
|
||||
use tokio::sync::Semaphore;
|
||||
|
||||
/// Concurrency limit for multipart upload/download (matches Go's s3manager).
|
||||
const CONCURRENCY: usize = 5;
|
||||
|
||||
/// Configuration for an S3 tier backend.
|
||||
#[derive(Debug, Clone)]
|
||||
pub struct S3TierConfig {
|
||||
pub access_key: String,
|
||||
pub secret_key: String,
|
||||
pub region: String,
|
||||
pub bucket: String,
|
||||
pub endpoint: String,
|
||||
pub storage_class: String,
|
||||
pub force_path_style: bool,
|
||||
}
|
||||
|
||||
/// S3 tier backend for uploading/downloading volume .dat files.
|
||||
pub struct S3TierBackend {
|
||||
client: Client,
|
||||
pub bucket: String,
|
||||
pub storage_class: String,
|
||||
}
|
||||
|
||||
impl S3TierBackend {
|
||||
/// Create a new S3 tier backend from configuration.
|
||||
pub fn new(config: &S3TierConfig) -> Self {
|
||||
let region = if config.region.is_empty() {
|
||||
"us-east-1"
|
||||
} else {
|
||||
&config.region
|
||||
};
|
||||
|
||||
let credentials = Credentials::new(
|
||||
&config.access_key,
|
||||
&config.secret_key,
|
||||
None,
|
||||
None,
|
||||
"seaweedfs-volume-tier",
|
||||
);
|
||||
|
||||
let mut s3_config = aws_sdk_s3::Config::builder()
|
||||
.behavior_version(BehaviorVersion::latest())
|
||||
.region(Region::new(region.to_string()))
|
||||
.credentials_provider(credentials)
|
||||
.force_path_style(config.force_path_style);
|
||||
|
||||
if !config.endpoint.is_empty() {
|
||||
s3_config = s3_config.endpoint_url(&config.endpoint);
|
||||
}
|
||||
|
||||
let client = Client::from_conf(s3_config.build());
|
||||
|
||||
S3TierBackend {
|
||||
client,
|
||||
bucket: config.bucket.clone(),
|
||||
storage_class: if config.storage_class.is_empty() {
|
||||
"STANDARD_IA".to_string()
|
||||
} else {
|
||||
config.storage_class.clone()
|
||||
},
|
||||
}
|
||||
}
|
||||
|
||||
/// Upload a local file to S3 using multipart upload with concurrent parts
|
||||
/// and progress reporting.
|
||||
///
|
||||
/// Returns (s3_key, file_size) on success.
|
||||
/// The progress callback receives (bytes_uploaded, percentage).
|
||||
/// Uses 64MB part size and 5 concurrent uploads (matches Go s3manager).
|
||||
pub async fn upload_file<F>(
|
||||
&self,
|
||||
file_path: &str,
|
||||
progress_fn: F,
|
||||
) -> Result<(String, u64), String>
|
||||
where
|
||||
F: FnMut(i64, f32) + Send + Sync + 'static,
|
||||
{
|
||||
let key = uuid::Uuid::new_v4().to_string();
|
||||
|
||||
let metadata = tokio::fs::metadata(file_path)
|
||||
.await
|
||||
.map_err(|e| format!("failed to stat file {}: {}", file_path, e))?;
|
||||
let file_size = metadata.len();
|
||||
|
||||
// Calculate part size: start at 64MB, scale up for very large files (matches Go)
|
||||
let mut part_size: u64 = 64 * 1024 * 1024;
|
||||
while part_size * 1000 < file_size {
|
||||
part_size *= 4;
|
||||
}
|
||||
|
||||
// Initiate multipart upload
|
||||
let create_resp = self
|
||||
.client
|
||||
.create_multipart_upload()
|
||||
.bucket(&self.bucket)
|
||||
.key(&key)
|
||||
.storage_class(
|
||||
self.storage_class
|
||||
.parse()
|
||||
.unwrap_or(aws_sdk_s3::types::StorageClass::StandardIa),
|
||||
)
|
||||
.send()
|
||||
.await
|
||||
.map_err(|e| format!("failed to create multipart upload: {}", e))?;
|
||||
|
||||
let upload_id = create_resp
|
||||
.upload_id()
|
||||
.ok_or_else(|| "no upload_id in multipart upload response".to_string())?
|
||||
.to_string();
|
||||
|
||||
// Build list of (part_number, offset, size) for all parts
|
||||
let mut parts_plan: Vec<(i32, u64, usize)> = Vec::new();
|
||||
let mut offset: u64 = 0;
|
||||
let mut part_number: i32 = 1;
|
||||
while offset < file_size {
|
||||
let remaining = file_size - offset;
|
||||
let this_part_size = std::cmp::min(part_size, remaining) as usize;
|
||||
parts_plan.push((part_number, offset, this_part_size));
|
||||
offset += this_part_size as u64;
|
||||
part_number += 1;
|
||||
}
|
||||
|
||||
// Upload parts concurrently with a semaphore limiting to CONCURRENCY
|
||||
let semaphore = Arc::new(Semaphore::new(CONCURRENCY));
|
||||
let client = &self.client;
|
||||
let bucket = &self.bucket;
|
||||
let file_path_owned = file_path.to_string();
|
||||
let progress = Arc::new(std::sync::Mutex::new((0u64, progress_fn)));
|
||||
|
||||
let mut handles = Vec::with_capacity(parts_plan.len());
|
||||
for (pn, off, size) in parts_plan {
|
||||
let sem = semaphore.clone();
|
||||
let client = client.clone();
|
||||
let bucket = bucket.clone();
|
||||
let key = key.clone();
|
||||
let upload_id = upload_id.clone();
|
||||
let fp = file_path_owned.clone();
|
||||
let progress = progress.clone();
|
||||
|
||||
handles.push(tokio::spawn(async move {
|
||||
let _permit = sem
|
||||
.acquire()
|
||||
.await
|
||||
.map_err(|e| format!("semaphore error: {}", e))?;
|
||||
|
||||
// Read this part's data from the file at the correct offset
|
||||
let mut file = tokio::fs::File::open(&fp)
|
||||
.await
|
||||
.map_err(|e| format!("failed to open file {}: {}", fp, e))?;
|
||||
file.seek(std::io::SeekFrom::Start(off))
|
||||
.await
|
||||
.map_err(|e| format!("failed to seek to offset {}: {}", off, e))?;
|
||||
let mut buf = vec![0u8; size];
|
||||
file.read_exact(&mut buf)
|
||||
.await
|
||||
.map_err(|e| format!("failed to read file at offset {}: {}", off, e))?;
|
||||
|
||||
let upload_part_resp = client
|
||||
.upload_part()
|
||||
.bucket(&bucket)
|
||||
.key(&key)
|
||||
.upload_id(&upload_id)
|
||||
.part_number(pn)
|
||||
.body(buf.into())
|
||||
.send()
|
||||
.await
|
||||
.map_err(|e| {
|
||||
format!("failed to upload part {} at offset {}: {}", pn, off, e)
|
||||
})?;
|
||||
|
||||
let e_tag = upload_part_resp.e_tag().unwrap_or_default().to_string();
|
||||
|
||||
// Report progress
|
||||
{
|
||||
let mut guard = progress.lock().unwrap();
|
||||
guard.0 += size as u64;
|
||||
let uploaded = guard.0;
|
||||
let pct = if file_size > 0 {
|
||||
(uploaded as f32 * 100.0) / file_size as f32
|
||||
} else {
|
||||
100.0
|
||||
};
|
||||
(guard.1)(uploaded as i64, pct);
|
||||
}
|
||||
|
||||
Ok::<_, String>(
|
||||
CompletedPart::builder()
|
||||
.e_tag(e_tag)
|
||||
.part_number(pn)
|
||||
.build(),
|
||||
)
|
||||
}));
|
||||
}
|
||||
|
||||
// Collect results, preserving part order
|
||||
let mut completed_parts = Vec::with_capacity(handles.len());
|
||||
for handle in handles {
|
||||
let part = handle
|
||||
.await
|
||||
.map_err(|e| format!("upload task panicked: {}", e))??;
|
||||
completed_parts.push(part);
|
||||
}
|
||||
|
||||
// Complete multipart upload
|
||||
let completed_upload = CompletedMultipartUpload::builder()
|
||||
.set_parts(Some(completed_parts))
|
||||
.build();
|
||||
|
||||
self.client
|
||||
.complete_multipart_upload()
|
||||
.bucket(&self.bucket)
|
||||
.key(&key)
|
||||
.upload_id(&upload_id)
|
||||
.multipart_upload(completed_upload)
|
||||
.send()
|
||||
.await
|
||||
.map_err(|e| format!("failed to complete multipart upload: {}", e))?;
|
||||
|
||||
Ok((key, file_size))
|
||||
}
|
||||
|
||||
/// Download a file from S3 to a local path with concurrent range requests
|
||||
/// and progress reporting.
|
||||
///
|
||||
/// Returns the file size on success.
|
||||
/// Uses 64MB part size and 5 concurrent downloads (matches Go s3manager).
|
||||
pub async fn download_file<F>(
|
||||
&self,
|
||||
dest_path: &str,
|
||||
key: &str,
|
||||
progress_fn: F,
|
||||
) -> Result<u64, String>
|
||||
where
|
||||
F: FnMut(i64, f32) + Send + Sync + 'static,
|
||||
{
|
||||
// Get file size first
|
||||
let head_resp = self
|
||||
.client
|
||||
.head_object()
|
||||
.bucket(&self.bucket)
|
||||
.key(key)
|
||||
.send()
|
||||
.await
|
||||
.map_err(|e| format!("failed to head object {}: {}", key, e))?;
|
||||
|
||||
let file_size = head_resp.content_length().unwrap_or(0) as u64;
|
||||
|
||||
// Pre-allocate file to full size so concurrent WriteAt-style writes work
|
||||
{
|
||||
let file = tokio::fs::OpenOptions::new()
|
||||
.write(true)
|
||||
.create(true)
|
||||
.truncate(true)
|
||||
.open(dest_path)
|
||||
.await
|
||||
.map_err(|e| format!("failed to open dest file {}: {}", dest_path, e))?;
|
||||
file.set_len(file_size)
|
||||
.await
|
||||
.map_err(|e| format!("failed to set file length: {}", e))?;
|
||||
}
|
||||
|
||||
let part_size: u64 = 64 * 1024 * 1024;
|
||||
|
||||
// Build list of (offset, size) for all parts
|
||||
let mut parts_plan: Vec<(u64, u64)> = Vec::new();
|
||||
let mut offset: u64 = 0;
|
||||
while offset < file_size {
|
||||
let remaining = file_size - offset;
|
||||
let this_part_size = std::cmp::min(part_size, remaining);
|
||||
parts_plan.push((offset, this_part_size));
|
||||
offset += this_part_size;
|
||||
}
|
||||
|
||||
// Download parts concurrently with a semaphore limiting to CONCURRENCY
|
||||
let semaphore = Arc::new(Semaphore::new(CONCURRENCY));
|
||||
let client = &self.client;
|
||||
let bucket = &self.bucket;
|
||||
let dest_path_owned = dest_path.to_string();
|
||||
let key_owned = key.to_string();
|
||||
let progress = Arc::new(std::sync::Mutex::new((0u64, progress_fn)));
|
||||
|
||||
let mut handles = Vec::with_capacity(parts_plan.len());
|
||||
for (off, size) in parts_plan {
|
||||
let sem = semaphore.clone();
|
||||
let client = client.clone();
|
||||
let bucket = bucket.clone();
|
||||
let key = key_owned.clone();
|
||||
let dp = dest_path_owned.clone();
|
||||
let progress = progress.clone();
|
||||
|
||||
handles.push(tokio::spawn(async move {
|
||||
let _permit = sem
|
||||
.acquire()
|
||||
.await
|
||||
.map_err(|e| format!("semaphore error: {}", e))?;
|
||||
|
||||
let end = off + size - 1;
|
||||
let range = format!("bytes={}-{}", off, end);
|
||||
|
||||
let get_resp = client
|
||||
.get_object()
|
||||
.bucket(&bucket)
|
||||
.key(&key)
|
||||
.range(&range)
|
||||
.send()
|
||||
.await
|
||||
.map_err(|e| format!("failed to get object {} range {}: {}", key, range, e))?;
|
||||
|
||||
let body = get_resp
|
||||
.body
|
||||
.collect()
|
||||
.await
|
||||
.map_err(|e| format!("failed to read body: {}", e))?;
|
||||
let bytes = body.into_bytes();
|
||||
|
||||
// Write at the correct offset (like Go's WriteAt)
|
||||
let mut file = tokio::fs::OpenOptions::new()
|
||||
.write(true)
|
||||
.open(&dp)
|
||||
.await
|
||||
.map_err(|e| format!("failed to open dest file {}: {}", dp, e))?;
|
||||
file.seek(std::io::SeekFrom::Start(off))
|
||||
.await
|
||||
.map_err(|e| format!("failed to seek to offset {}: {}", off, e))?;
|
||||
file.write_all(&bytes)
|
||||
.await
|
||||
.map_err(|e| format!("failed to write to {}: {}", dp, e))?;
|
||||
|
||||
// Report progress
|
||||
{
|
||||
let mut guard = progress.lock().unwrap();
|
||||
guard.0 += bytes.len() as u64;
|
||||
let downloaded = guard.0;
|
||||
let pct = if file_size > 0 {
|
||||
(downloaded as f32 * 100.0) / file_size as f32
|
||||
} else {
|
||||
100.0
|
||||
};
|
||||
(guard.1)(downloaded as i64, pct);
|
||||
}
|
||||
|
||||
Ok::<_, String>(())
|
||||
}));
|
||||
}
|
||||
|
||||
// Wait for all download tasks
|
||||
for handle in handles {
|
||||
handle
|
||||
.await
|
||||
.map_err(|e| format!("download task panicked: {}", e))??;
|
||||
}
|
||||
|
||||
Ok(file_size)
|
||||
}
|
||||
|
||||
pub async fn read_range(&self, key: &str, offset: u64, size: usize) -> Result<Vec<u8>, String> {
|
||||
let end = offset + (size as u64).saturating_sub(1);
|
||||
let range = format!("bytes={}-{}", offset, end);
|
||||
let resp = self
|
||||
.client
|
||||
.get_object()
|
||||
.bucket(&self.bucket)
|
||||
.key(key)
|
||||
.range(&range)
|
||||
.send()
|
||||
.await
|
||||
.map_err(|e| format!("failed to get object {} range {}: {}", key, range, e))?;
|
||||
|
||||
let body = resp
|
||||
.body
|
||||
.collect()
|
||||
.await
|
||||
.map_err(|e| format!("failed to read object {} body: {}", key, e))?;
|
||||
Ok(body.into_bytes().to_vec())
|
||||
}
|
||||
|
||||
/// Delete a file from S3.
|
||||
pub async fn delete_file(&self, key: &str) -> Result<(), String> {
|
||||
self.client
|
||||
.delete_object()
|
||||
.bucket(&self.bucket)
|
||||
.key(key)
|
||||
.send()
|
||||
.await
|
||||
.map_err(|e| format!("failed to delete object {}: {}", key, e))?;
|
||||
Ok(())
|
||||
}
|
||||
|
||||
pub fn delete_file_blocking(&self, key: &str) -> Result<(), String> {
|
||||
let client = self.client.clone();
|
||||
let bucket = self.bucket.clone();
|
||||
let key = key.to_string();
|
||||
block_on_tier_future(async move {
|
||||
client
|
||||
.delete_object()
|
||||
.bucket(&bucket)
|
||||
.key(&key)
|
||||
.send()
|
||||
.await
|
||||
.map_err(|e| format!("failed to delete object {}: {}", key, e))?;
|
||||
Ok(())
|
||||
})
|
||||
}
|
||||
|
||||
pub fn read_range_blocking(
|
||||
&self,
|
||||
key: &str,
|
||||
offset: u64,
|
||||
size: usize,
|
||||
) -> Result<Vec<u8>, String> {
|
||||
let client = self.client.clone();
|
||||
let bucket = self.bucket.clone();
|
||||
let key = key.to_string();
|
||||
block_on_tier_future(async move {
|
||||
let end = offset + (size as u64).saturating_sub(1);
|
||||
let range = format!("bytes={}-{}", offset, end);
|
||||
let resp = client
|
||||
.get_object()
|
||||
.bucket(&bucket)
|
||||
.key(&key)
|
||||
.range(&range)
|
||||
.send()
|
||||
.await
|
||||
.map_err(|e| format!("failed to get object {} range {}: {}", key, range, e))?;
|
||||
|
||||
let body = resp
|
||||
.body
|
||||
.collect()
|
||||
.await
|
||||
.map_err(|e| format!("failed to read object {} body: {}", key, e))?;
|
||||
Ok(body.into_bytes().to_vec())
|
||||
})
|
||||
}
|
||||
}
|
||||
|
||||
/// Parse a backend name like "s3" or "s3.default" into (backend_type, backend_id).
|
||||
/// Matches Go's `BackendNameToTypeId`.
|
||||
pub fn backend_name_to_type_id(backend_name: &str) -> (String, String) {
|
||||
let parts: Vec<&str> = backend_name.split('.').collect();
|
||||
match parts.len() {
|
||||
1 => (backend_name.to_string(), "default".to_string()),
|
||||
2 => (parts[0].to_string(), parts[1].to_string()),
|
||||
_ => (String::new(), String::new()),
|
||||
}
|
||||
}
|
||||
|
||||
/// A registry of configured S3 tier backends, keyed by backend name (e.g., "s3.default").
|
||||
#[derive(Default)]
|
||||
pub struct S3TierRegistry {
|
||||
backends: HashMap<String, Arc<S3TierBackend>>,
|
||||
}
|
||||
|
||||
impl S3TierRegistry {
|
||||
pub fn new() -> Self {
|
||||
Self {
|
||||
backends: HashMap::new(),
|
||||
}
|
||||
}
|
||||
|
||||
/// Register a backend with the given name.
|
||||
pub fn register(&mut self, name: String, backend: S3TierBackend) {
|
||||
self.backends.insert(name, Arc::new(backend));
|
||||
}
|
||||
|
||||
/// Look up a backend by name.
|
||||
pub fn get(&self, name: &str) -> Option<Arc<S3TierBackend>> {
|
||||
self.backends.get(name).cloned()
|
||||
}
|
||||
|
||||
/// List all registered backend names.
|
||||
pub fn names(&self) -> Vec<String> {
|
||||
self.backends.keys().cloned().collect()
|
||||
}
|
||||
|
||||
pub fn clear(&mut self) {
|
||||
self.backends.clear();
|
||||
}
|
||||
}
|
||||
|
||||
static GLOBAL_S3_TIER_REGISTRY: OnceLock<RwLock<S3TierRegistry>> = OnceLock::new();
|
||||
|
||||
pub fn global_s3_tier_registry() -> &'static RwLock<S3TierRegistry> {
|
||||
GLOBAL_S3_TIER_REGISTRY.get_or_init(|| RwLock::new(S3TierRegistry::new()))
|
||||
}
|
||||
|
||||
fn block_on_tier_future<F, T>(future: F) -> Result<T, String>
|
||||
where
|
||||
F: Future<Output = Result<T, String>> + Send + 'static,
|
||||
T: Send + 'static,
|
||||
{
|
||||
std::thread::spawn(move || {
|
||||
let runtime = tokio::runtime::Builder::new_current_thread()
|
||||
.enable_all()
|
||||
.build()
|
||||
.map_err(|e| format!("failed to build tokio runtime: {}", e))?;
|
||||
runtime.block_on(future)
|
||||
})
|
||||
.join()
|
||||
.map_err(|_| "tier runtime thread panicked".to_string())?
|
||||
}
|
||||
481
seaweed-volume/src/security.rs
Normal file
481
seaweed-volume/src/security.rs
Normal file
@@ -0,0 +1,481 @@
|
||||
//! Security: JWT validation and IP whitelist checking.
|
||||
//!
|
||||
//! Matches Go's security/guard.go and security/jwt.go.
|
||||
//! - Guard: combines whitelist IP checking with JWT token validation
|
||||
//! - JWT: HS256 HMAC signing with file-id claims
|
||||
|
||||
pub mod tls;
|
||||
|
||||
use std::collections::HashSet;
|
||||
use std::net::IpAddr;
|
||||
use std::time::{SystemTime, UNIX_EPOCH};
|
||||
|
||||
use jsonwebtoken::{decode, encode, Algorithm, DecodingKey, EncodingKey, Header, Validation};
|
||||
use serde::{Deserialize, Serialize};
|
||||
|
||||
// ============================================================================
|
||||
// JWT Claims
|
||||
// ============================================================================
|
||||
|
||||
/// Claims for volume server file access tokens.
|
||||
/// Matches Go's `SeaweedFileIdClaims`.
|
||||
#[derive(Debug, Serialize, Deserialize)]
|
||||
pub struct FileIdClaims {
|
||||
/// File ID this token grants access to (e.g., "3,01637037d6").
|
||||
#[serde(skip_serializing_if = "Option::is_none")]
|
||||
pub fid: Option<String>,
|
||||
|
||||
/// Expiration time (Unix timestamp).
|
||||
#[serde(skip_serializing_if = "Option::is_none")]
|
||||
pub exp: Option<u64>,
|
||||
|
||||
/// Not before (Unix timestamp).
|
||||
#[serde(skip_serializing_if = "Option::is_none")]
|
||||
pub nbf: Option<u64>,
|
||||
}
|
||||
|
||||
/// Signing key wrapper (empty = security disabled).
|
||||
#[derive(Clone)]
|
||||
pub struct SigningKey(pub Vec<u8>);
|
||||
|
||||
impl SigningKey {
|
||||
pub fn is_empty(&self) -> bool {
|
||||
self.0.is_empty()
|
||||
}
|
||||
|
||||
pub fn from_string(s: &str) -> Self {
|
||||
SigningKey(s.as_bytes().to_vec())
|
||||
}
|
||||
}
|
||||
|
||||
/// Generate a JWT token for file access.
|
||||
pub fn gen_jwt(
|
||||
signing_key: &SigningKey,
|
||||
expires_after_sec: i64,
|
||||
file_id: &str,
|
||||
) -> Result<String, JwtError> {
|
||||
if signing_key.is_empty() {
|
||||
return Err(JwtError::NoSigningKey);
|
||||
}
|
||||
|
||||
let now = SystemTime::now()
|
||||
.duration_since(UNIX_EPOCH)
|
||||
.unwrap_or_default()
|
||||
.as_secs();
|
||||
|
||||
let claims = FileIdClaims {
|
||||
fid: Some(file_id.to_string()),
|
||||
exp: if expires_after_sec > 0 {
|
||||
Some(now + expires_after_sec as u64)
|
||||
} else {
|
||||
None
|
||||
},
|
||||
nbf: None,
|
||||
};
|
||||
|
||||
let token = encode(
|
||||
&Header::new(Algorithm::HS256),
|
||||
&claims,
|
||||
&EncodingKey::from_secret(&signing_key.0),
|
||||
)?;
|
||||
|
||||
Ok(token)
|
||||
}
|
||||
|
||||
/// Decode and validate a JWT token.
|
||||
pub fn decode_jwt(signing_key: &SigningKey, token: &str) -> Result<FileIdClaims, JwtError> {
|
||||
if signing_key.is_empty() {
|
||||
return Err(JwtError::NoSigningKey);
|
||||
}
|
||||
|
||||
let mut validation = Validation::new(Algorithm::HS256);
|
||||
// Match Go behavior: tokens without exp are accepted (Go's jwt-go does not require exp)
|
||||
// But if exp IS present, it must be valid (not expired).
|
||||
validation.required_spec_claims.clear();
|
||||
validation.validate_exp = true;
|
||||
// Go's jwt-go/v5 validates nbf when present
|
||||
validation.validate_nbf = true;
|
||||
validation.leeway = 0;
|
||||
|
||||
let data = decode::<FileIdClaims>(
|
||||
token,
|
||||
&DecodingKey::from_secret(&signing_key.0),
|
||||
&validation,
|
||||
)?;
|
||||
|
||||
Ok(data.claims)
|
||||
}
|
||||
|
||||
// ============================================================================
|
||||
// Guard
|
||||
// ============================================================================
|
||||
|
||||
/// Security guard: IP whitelist + JWT token validation.
|
||||
pub struct Guard {
|
||||
whitelist_ips: HashSet<String>,
|
||||
whitelist_cidrs: Vec<(IpAddr, u8)>, // (network, prefix_len)
|
||||
pub signing_key: SigningKey,
|
||||
pub expires_after_sec: i64,
|
||||
pub read_signing_key: SigningKey,
|
||||
pub read_expires_after_sec: i64,
|
||||
/// Combined flag: true when whitelist is non-empty OR signing key is present.
|
||||
/// Matches Go's `isWriteActive = !isEmptyWhiteList || len(SigningKey) != 0`.
|
||||
is_write_active: bool,
|
||||
}
|
||||
|
||||
impl Guard {
|
||||
pub fn new(
|
||||
whitelist: &[String],
|
||||
signing_key: SigningKey,
|
||||
expires_after_sec: i64,
|
||||
read_signing_key: SigningKey,
|
||||
read_expires_after_sec: i64,
|
||||
) -> Self {
|
||||
let mut guard = Guard {
|
||||
whitelist_ips: HashSet::new(),
|
||||
whitelist_cidrs: Vec::new(),
|
||||
signing_key,
|
||||
expires_after_sec,
|
||||
read_signing_key,
|
||||
read_expires_after_sec,
|
||||
is_write_active: false,
|
||||
};
|
||||
guard.update_whitelist(whitelist);
|
||||
guard
|
||||
}
|
||||
|
||||
/// Update the IP whitelist.
|
||||
pub fn update_whitelist(&mut self, entries: &[String]) {
|
||||
self.whitelist_ips.clear();
|
||||
self.whitelist_cidrs.clear();
|
||||
|
||||
for entry in entries {
|
||||
let entry = entry.trim();
|
||||
if entry.is_empty() {
|
||||
continue;
|
||||
}
|
||||
if entry.contains('/') {
|
||||
// CIDR range
|
||||
if let Some((ip, prefix)) = parse_cidr(entry) {
|
||||
self.whitelist_cidrs.push((ip, prefix));
|
||||
} else {
|
||||
tracing::error!("Parse CIDR {} in whitelist failed", entry);
|
||||
}
|
||||
} else {
|
||||
// Exact IP/hostname
|
||||
self.whitelist_ips.insert(entry.to_string());
|
||||
}
|
||||
}
|
||||
|
||||
// Match Go: isWriteActive = !isEmptyWhiteList || len(SigningKey) != 0
|
||||
let is_empty_whitelist = self.whitelist_ips.is_empty() && self.whitelist_cidrs.is_empty();
|
||||
self.is_write_active = !is_empty_whitelist || !self.signing_key.is_empty();
|
||||
}
|
||||
|
||||
/// Check if a remote IP is in the whitelist.
|
||||
/// Returns true if write security is inactive (no whitelist and no signing key),
|
||||
/// if the whitelist is empty, or if the IP matches.
|
||||
pub fn check_whitelist(&self, remote_addr: &str) -> bool {
|
||||
if !self.is_write_active {
|
||||
return true;
|
||||
}
|
||||
if self.whitelist_ips.is_empty() && self.whitelist_cidrs.is_empty() {
|
||||
return true;
|
||||
}
|
||||
|
||||
let host = extract_host(remote_addr);
|
||||
|
||||
// Check exact match
|
||||
if self.whitelist_ips.contains(&host) {
|
||||
return true;
|
||||
}
|
||||
|
||||
// Check CIDR ranges
|
||||
if let Ok(ip) = host.parse::<IpAddr>() {
|
||||
for &(ref network, prefix_len) in &self.whitelist_cidrs {
|
||||
if ip_in_cidr(&ip, network, prefix_len) {
|
||||
return true;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
false
|
||||
}
|
||||
|
||||
/// Check if a read signing key is configured.
|
||||
pub fn has_read_signing_key(&self) -> bool {
|
||||
!self.read_signing_key.is_empty()
|
||||
}
|
||||
|
||||
/// Validate a request's JWT token.
|
||||
/// `is_write` determines which signing key to use.
|
||||
/// Returns Ok(()) if valid, or if security is disabled.
|
||||
pub fn check_jwt(&self, token: Option<&str>, is_write: bool) -> Result<(), JwtError> {
|
||||
let key = if is_write {
|
||||
&self.signing_key
|
||||
} else {
|
||||
&self.read_signing_key
|
||||
};
|
||||
|
||||
if key.is_empty() {
|
||||
return Ok(()); // Security disabled for this operation type
|
||||
}
|
||||
|
||||
let token = token.ok_or(JwtError::MissingToken)?;
|
||||
decode_jwt(key, token)?;
|
||||
Ok(())
|
||||
}
|
||||
|
||||
/// Check JWT and validate the file ID claim matches.
|
||||
pub fn check_jwt_for_file(
|
||||
&self,
|
||||
token: Option<&str>,
|
||||
expected_fid: &str,
|
||||
is_write: bool,
|
||||
) -> Result<(), JwtError> {
|
||||
let key = if is_write {
|
||||
&self.signing_key
|
||||
} else {
|
||||
&self.read_signing_key
|
||||
};
|
||||
|
||||
if key.is_empty() {
|
||||
return Ok(());
|
||||
}
|
||||
|
||||
let token = token.ok_or(JwtError::MissingToken)?;
|
||||
let claims = decode_jwt(key, token)?;
|
||||
|
||||
match claims.fid {
|
||||
None => {
|
||||
return Err(JwtError::MissingFileIdClaim);
|
||||
}
|
||||
Some(ref fid) if fid != expected_fid => {
|
||||
return Err(JwtError::FileIdMismatch {
|
||||
expected: expected_fid.to_string(),
|
||||
got: fid.to_string(),
|
||||
});
|
||||
}
|
||||
_ => {}
|
||||
}
|
||||
|
||||
Ok(())
|
||||
}
|
||||
}
|
||||
|
||||
// ============================================================================
|
||||
// Helpers
|
||||
// ============================================================================
|
||||
|
||||
/// Extract host from "host:port" or "[::1]:port" format.
|
||||
fn extract_host(addr: &str) -> String {
|
||||
// Handle IPv6 with brackets
|
||||
if addr.starts_with('[') {
|
||||
if let Some(end) = addr.find(']') {
|
||||
return addr[1..end].to_string();
|
||||
}
|
||||
}
|
||||
// Handle host:port
|
||||
if let Some(pos) = addr.rfind(':') {
|
||||
return addr[..pos].to_string();
|
||||
}
|
||||
addr.to_string()
|
||||
}
|
||||
|
||||
/// Parse CIDR notation "192.168.1.0/24" into (IpAddr, prefix_len).
|
||||
fn parse_cidr(cidr: &str) -> Option<(IpAddr, u8)> {
|
||||
let parts: Vec<&str> = cidr.split('/').collect();
|
||||
if parts.len() != 2 {
|
||||
return None;
|
||||
}
|
||||
let ip: IpAddr = parts[0].parse().ok()?;
|
||||
let prefix: u8 = parts[1].parse().ok()?;
|
||||
Some((ip, prefix))
|
||||
}
|
||||
|
||||
/// Check if an IP is within a CIDR range.
|
||||
fn ip_in_cidr(ip: &IpAddr, network: &IpAddr, prefix_len: u8) -> bool {
|
||||
match (ip, network) {
|
||||
(IpAddr::V4(ip), IpAddr::V4(net)) => {
|
||||
let ip_bits = u32::from(*ip);
|
||||
let net_bits = u32::from(*net);
|
||||
let mask = if prefix_len == 0 {
|
||||
0
|
||||
} else if prefix_len >= 32 {
|
||||
u32::MAX
|
||||
} else {
|
||||
u32::MAX << (32 - prefix_len)
|
||||
};
|
||||
(ip_bits & mask) == (net_bits & mask)
|
||||
}
|
||||
(IpAddr::V6(ip), IpAddr::V6(net)) => {
|
||||
let ip_bits = u128::from(*ip);
|
||||
let net_bits = u128::from(*net);
|
||||
let mask = if prefix_len == 0 {
|
||||
0
|
||||
} else if prefix_len >= 128 {
|
||||
u128::MAX
|
||||
} else {
|
||||
u128::MAX << (128 - prefix_len)
|
||||
};
|
||||
(ip_bits & mask) == (net_bits & mask)
|
||||
}
|
||||
_ => false, // V4/V6 mismatch
|
||||
}
|
||||
}
|
||||
|
||||
// ============================================================================
|
||||
// Errors
|
||||
// ============================================================================
|
||||
|
||||
#[derive(Debug, thiserror::Error)]
|
||||
pub enum JwtError {
|
||||
#[error("no signing key configured")]
|
||||
NoSigningKey,
|
||||
|
||||
#[error("missing JWT token")]
|
||||
MissingToken,
|
||||
|
||||
#[error("JWT error: {0}")]
|
||||
Jwt(#[from] jsonwebtoken::errors::Error),
|
||||
|
||||
#[error("JWT token missing required fid claim")]
|
||||
MissingFileIdClaim,
|
||||
|
||||
#[error("file ID mismatch: expected {expected}, got {got}")]
|
||||
FileIdMismatch { expected: String, got: String },
|
||||
}
|
||||
|
||||
// ============================================================================
|
||||
// Tests
|
||||
// ============================================================================
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn test_jwt_round_trip() {
|
||||
let key = SigningKey::from_string("test-secret-key");
|
||||
let token = gen_jwt(&key, 3600, "3,01637037d6").unwrap();
|
||||
let claims = decode_jwt(&key, &token).unwrap();
|
||||
assert_eq!(claims.fid, Some("3,01637037d6".to_string()));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_jwt_no_signing_key() {
|
||||
let key = SigningKey(vec![]);
|
||||
assert!(gen_jwt(&key, 3600, "1,abc").is_err());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_jwt_invalid_token() {
|
||||
let key = SigningKey::from_string("secret");
|
||||
let result = decode_jwt(&key, "invalid.token.here");
|
||||
assert!(result.is_err());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_jwt_wrong_key() {
|
||||
let key1 = SigningKey::from_string("secret1");
|
||||
let key2 = SigningKey::from_string("secret2");
|
||||
let token = gen_jwt(&key1, 3600, "1,abc").unwrap();
|
||||
assert!(decode_jwt(&key2, &token).is_err());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_guard_empty_whitelist() {
|
||||
let guard = Guard::new(&[], SigningKey(vec![]), 0, SigningKey(vec![]), 0);
|
||||
assert!(guard.check_whitelist("192.168.1.1:8080"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_guard_whitelist_exact() {
|
||||
let guard = Guard::new(
|
||||
&["192.168.1.1".to_string(), "10.0.0.1".to_string()],
|
||||
SigningKey(vec![]),
|
||||
0,
|
||||
SigningKey(vec![]),
|
||||
0,
|
||||
);
|
||||
assert!(guard.check_whitelist("192.168.1.1:8080"));
|
||||
assert!(guard.check_whitelist("10.0.0.1:1234"));
|
||||
assert!(!guard.check_whitelist("172.16.0.1:8080"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_guard_whitelist_cidr() {
|
||||
let guard = Guard::new(
|
||||
&["10.0.0.0/8".to_string()],
|
||||
SigningKey(vec![]),
|
||||
0,
|
||||
SigningKey(vec![]),
|
||||
0,
|
||||
);
|
||||
assert!(guard.check_whitelist("10.1.2.3:8080"));
|
||||
assert!(guard.check_whitelist("10.255.255.255:80"));
|
||||
assert!(!guard.check_whitelist("11.0.0.1:80"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_guard_check_jwt_disabled() {
|
||||
let guard = Guard::new(&[], SigningKey(vec![]), 0, SigningKey(vec![]), 0);
|
||||
// No signing key = security disabled
|
||||
assert!(guard.check_jwt(None, true).is_ok());
|
||||
assert!(guard.check_jwt(None, false).is_ok());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_guard_check_jwt_enabled() {
|
||||
let key = SigningKey::from_string("write-secret");
|
||||
let read_key = SigningKey::from_string("read-secret");
|
||||
let guard = Guard::new(&[], key.clone(), 3600, read_key.clone(), 3600);
|
||||
|
||||
// Missing token
|
||||
assert!(guard.check_jwt(None, true).is_err());
|
||||
|
||||
// Valid write token
|
||||
let token = gen_jwt(&key, 3600, "1,abc").unwrap();
|
||||
assert!(guard.check_jwt(Some(&token), true).is_ok());
|
||||
|
||||
// Write token for read should fail (different key)
|
||||
assert!(guard.check_jwt(Some(&token), false).is_err());
|
||||
|
||||
// Valid read token
|
||||
let read_token = gen_jwt(&read_key, 3600, "1,abc").unwrap();
|
||||
assert!(guard.check_jwt(Some(&read_token), false).is_ok());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_guard_check_jwt_file_id() {
|
||||
let key = SigningKey::from_string("secret");
|
||||
let guard = Guard::new(&[], key.clone(), 3600, SigningKey(vec![]), 0);
|
||||
|
||||
let token = gen_jwt(&key, 3600, "3,01637037d6").unwrap();
|
||||
|
||||
// Correct file ID
|
||||
assert!(guard
|
||||
.check_jwt_for_file(Some(&token), "3,01637037d6", true)
|
||||
.is_ok());
|
||||
|
||||
// Wrong file ID
|
||||
let err = guard.check_jwt_for_file(Some(&token), "4,deadbeef", true);
|
||||
assert!(matches!(err, Err(JwtError::FileIdMismatch { .. })));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_extract_host() {
|
||||
assert_eq!(extract_host("192.168.1.1:8080"), "192.168.1.1");
|
||||
assert_eq!(extract_host("[::1]:8080"), "::1");
|
||||
assert_eq!(extract_host("localhost"), "localhost");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_ip_in_cidr() {
|
||||
let net: IpAddr = "10.0.0.0".parse().unwrap();
|
||||
let ip1: IpAddr = "10.1.2.3".parse().unwrap();
|
||||
let ip2: IpAddr = "11.0.0.1".parse().unwrap();
|
||||
assert!(ip_in_cidr(&ip1, &net, 8));
|
||||
assert!(!ip_in_cidr(&ip2, &net, 8));
|
||||
}
|
||||
}
|
||||
437
seaweed-volume/src/security/tls.rs
Normal file
437
seaweed-volume/src/security/tls.rs
Normal file
@@ -0,0 +1,437 @@
|
||||
use std::collections::HashSet;
|
||||
use std::fmt;
|
||||
use std::sync::Arc;
|
||||
|
||||
use rustls::client::danger::HandshakeSignatureValid;
|
||||
use rustls::crypto::aws_lc_rs;
|
||||
use rustls::crypto::CryptoProvider;
|
||||
use rustls::pki_types::UnixTime;
|
||||
use rustls::pki_types::{CertificateDer, PrivateKeyDer};
|
||||
use rustls::server::danger::{ClientCertVerified, ClientCertVerifier};
|
||||
use rustls::server::WebPkiClientVerifier;
|
||||
use rustls::{
|
||||
CipherSuite, DigitallySignedStruct, DistinguishedName, RootCertStore, ServerConfig,
|
||||
SignatureScheme, SupportedCipherSuite, SupportedProtocolVersion,
|
||||
};
|
||||
use x509_parser::prelude::{FromDer, X509Certificate};
|
||||
|
||||
#[derive(Clone, Debug, Default, PartialEq, Eq)]
|
||||
pub struct TlsPolicy {
|
||||
pub min_version: String,
|
||||
pub max_version: String,
|
||||
pub cipher_suites: String,
|
||||
}
|
||||
|
||||
#[derive(Clone, Debug, Default, PartialEq, Eq)]
|
||||
pub struct GrpcClientAuthPolicy {
|
||||
pub allowed_common_names: Vec<String>,
|
||||
pub allowed_wildcard_domain: String,
|
||||
}
|
||||
|
||||
#[derive(Debug, Clone, PartialEq, Eq)]
|
||||
pub struct TlsPolicyError(String);
|
||||
|
||||
impl fmt::Display for TlsPolicyError {
|
||||
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
|
||||
f.write_str(&self.0)
|
||||
}
|
||||
}
|
||||
|
||||
impl std::error::Error for TlsPolicyError {}
|
||||
|
||||
#[derive(Clone, Copy, Debug, PartialEq, Eq, PartialOrd, Ord)]
|
||||
enum GoTlsVersion {
|
||||
Ssl3,
|
||||
Tls10,
|
||||
Tls11,
|
||||
Tls12,
|
||||
Tls13,
|
||||
}
|
||||
|
||||
#[derive(Debug)]
|
||||
struct CommonNameVerifier {
|
||||
inner: Arc<dyn ClientCertVerifier>,
|
||||
allowed_common_names: HashSet<String>,
|
||||
allowed_wildcard_domain: String,
|
||||
}
|
||||
|
||||
impl ClientCertVerifier for CommonNameVerifier {
|
||||
fn offer_client_auth(&self) -> bool {
|
||||
self.inner.offer_client_auth()
|
||||
}
|
||||
|
||||
fn client_auth_mandatory(&self) -> bool {
|
||||
self.inner.client_auth_mandatory()
|
||||
}
|
||||
|
||||
fn root_hint_subjects(&self) -> &[DistinguishedName] {
|
||||
self.inner.root_hint_subjects()
|
||||
}
|
||||
|
||||
fn verify_client_cert(
|
||||
&self,
|
||||
end_entity: &CertificateDer<'_>,
|
||||
intermediates: &[CertificateDer<'_>],
|
||||
now: UnixTime,
|
||||
) -> Result<ClientCertVerified, rustls::Error> {
|
||||
self.inner
|
||||
.verify_client_cert(end_entity, intermediates, now)?;
|
||||
let common_name = parse_common_name(end_entity).map_err(|e| {
|
||||
rustls::Error::General(format!(
|
||||
"parse client certificate common name failed: {}",
|
||||
e
|
||||
))
|
||||
})?;
|
||||
if common_name_is_allowed(
|
||||
&common_name,
|
||||
&self.allowed_common_names,
|
||||
&self.allowed_wildcard_domain,
|
||||
) {
|
||||
return Ok(ClientCertVerified::assertion());
|
||||
}
|
||||
Err(rustls::Error::General(format!(
|
||||
"Authenticate: invalid subject client common name: {}",
|
||||
common_name
|
||||
)))
|
||||
}
|
||||
|
||||
fn verify_tls12_signature(
|
||||
&self,
|
||||
message: &[u8],
|
||||
cert: &CertificateDer<'_>,
|
||||
dss: &DigitallySignedStruct,
|
||||
) -> Result<HandshakeSignatureValid, rustls::Error> {
|
||||
self.inner.verify_tls12_signature(message, cert, dss)
|
||||
}
|
||||
|
||||
fn verify_tls13_signature(
|
||||
&self,
|
||||
message: &[u8],
|
||||
cert: &CertificateDer<'_>,
|
||||
dss: &DigitallySignedStruct,
|
||||
) -> Result<HandshakeSignatureValid, rustls::Error> {
|
||||
self.inner.verify_tls13_signature(message, cert, dss)
|
||||
}
|
||||
|
||||
fn supported_verify_schemes(&self) -> Vec<SignatureScheme> {
|
||||
self.inner.supported_verify_schemes()
|
||||
}
|
||||
}
|
||||
|
||||
pub fn build_rustls_server_config(
|
||||
cert_path: &str,
|
||||
key_path: &str,
|
||||
ca_path: &str,
|
||||
policy: &TlsPolicy,
|
||||
) -> Result<ServerConfig, TlsPolicyError> {
|
||||
build_rustls_server_config_with_client_auth(cert_path, key_path, ca_path, policy, None)
|
||||
}
|
||||
|
||||
pub fn build_rustls_server_config_with_grpc_client_auth(
|
||||
cert_path: &str,
|
||||
key_path: &str,
|
||||
ca_path: &str,
|
||||
policy: &TlsPolicy,
|
||||
client_auth_policy: &GrpcClientAuthPolicy,
|
||||
) -> Result<ServerConfig, TlsPolicyError> {
|
||||
build_rustls_server_config_with_client_auth(
|
||||
cert_path,
|
||||
key_path,
|
||||
ca_path,
|
||||
policy,
|
||||
Some(client_auth_policy),
|
||||
)
|
||||
}
|
||||
|
||||
fn build_rustls_server_config_with_client_auth(
|
||||
cert_path: &str,
|
||||
key_path: &str,
|
||||
ca_path: &str,
|
||||
policy: &TlsPolicy,
|
||||
client_auth_policy: Option<&GrpcClientAuthPolicy>,
|
||||
) -> Result<ServerConfig, TlsPolicyError> {
|
||||
let cert_chain = read_cert_chain(cert_path)?;
|
||||
let private_key = read_private_key(key_path)?;
|
||||
let provider = build_crypto_provider(policy)?;
|
||||
let versions = build_supported_versions(policy)?;
|
||||
|
||||
let builder = ServerConfig::builder_with_provider(provider.clone())
|
||||
.with_protocol_versions(&versions)
|
||||
.map_err(|e| TlsPolicyError(format!("invalid TLS version policy: {}", e)))?;
|
||||
|
||||
let builder = if ca_path.is_empty() {
|
||||
builder.with_no_client_auth()
|
||||
} else {
|
||||
let roots = read_root_store(ca_path)?;
|
||||
let verifier =
|
||||
WebPkiClientVerifier::builder_with_provider(Arc::new(roots), provider.clone())
|
||||
.build()
|
||||
.map_err(|e| TlsPolicyError(format!("build client verifier failed: {}", e)))?;
|
||||
let verifier: Arc<dyn ClientCertVerifier> = if let Some(client_auth_policy) =
|
||||
client_auth_policy.filter(|policy| {
|
||||
!policy.allowed_common_names.is_empty()
|
||||
|| !policy.allowed_wildcard_domain.is_empty()
|
||||
}) {
|
||||
Arc::new(CommonNameVerifier {
|
||||
inner: verifier,
|
||||
allowed_common_names: client_auth_policy
|
||||
.allowed_common_names
|
||||
.iter()
|
||||
.cloned()
|
||||
.collect(),
|
||||
allowed_wildcard_domain: client_auth_policy.allowed_wildcard_domain.clone(),
|
||||
})
|
||||
} else {
|
||||
verifier
|
||||
};
|
||||
builder.with_client_cert_verifier(verifier)
|
||||
};
|
||||
|
||||
builder
|
||||
.with_single_cert(cert_chain, private_key)
|
||||
.map_err(|e| TlsPolicyError(format!("build rustls server config failed: {}", e)))
|
||||
}
|
||||
|
||||
fn read_cert_chain(cert_path: &str) -> Result<Vec<CertificateDer<'static>>, TlsPolicyError> {
|
||||
let cert_pem = std::fs::read(cert_path).map_err(|e| {
|
||||
TlsPolicyError(format!(
|
||||
"Failed to read TLS cert file '{}': {}",
|
||||
cert_path, e
|
||||
))
|
||||
})?;
|
||||
rustls_pemfile::certs(&mut &cert_pem[..])
|
||||
.collect::<Result<Vec<_>, _>>()
|
||||
.map_err(|e| {
|
||||
TlsPolicyError(format!(
|
||||
"Failed to parse TLS cert PEM '{}': {}",
|
||||
cert_path, e
|
||||
))
|
||||
})
|
||||
}
|
||||
|
||||
fn read_private_key(key_path: &str) -> Result<PrivateKeyDer<'static>, TlsPolicyError> {
|
||||
let key_pem = std::fs::read(key_path).map_err(|e| {
|
||||
TlsPolicyError(format!("Failed to read TLS key file '{}': {}", key_path, e))
|
||||
})?;
|
||||
rustls_pemfile::private_key(&mut &key_pem[..])
|
||||
.map_err(|e| TlsPolicyError(format!("Failed to parse TLS key PEM '{}': {}", key_path, e)))?
|
||||
.ok_or_else(|| TlsPolicyError(format!("No private key found in '{}'", key_path)))
|
||||
}
|
||||
|
||||
fn read_root_store(ca_path: &str) -> Result<RootCertStore, TlsPolicyError> {
|
||||
let ca_pem = std::fs::read(ca_path)
|
||||
.map_err(|e| TlsPolicyError(format!("Failed to read TLS CA file '{}': {}", ca_path, e)))?;
|
||||
let ca_certs = rustls_pemfile::certs(&mut &ca_pem[..])
|
||||
.collect::<Result<Vec<_>, _>>()
|
||||
.map_err(|e| TlsPolicyError(format!("Failed to parse TLS CA PEM '{}': {}", ca_path, e)))?;
|
||||
let mut roots = RootCertStore::empty();
|
||||
for cert in ca_certs {
|
||||
roots
|
||||
.add(cert)
|
||||
.map_err(|e| TlsPolicyError(format!("Failed to add CA cert '{}': {}", ca_path, e)))?;
|
||||
}
|
||||
Ok(roots)
|
||||
}
|
||||
|
||||
fn build_crypto_provider(policy: &TlsPolicy) -> Result<Arc<CryptoProvider>, TlsPolicyError> {
|
||||
let mut provider = aws_lc_rs::default_provider();
|
||||
let cipher_suites = parse_cipher_suites(&provider.cipher_suites, &policy.cipher_suites)?;
|
||||
if !cipher_suites.is_empty() {
|
||||
provider.cipher_suites = cipher_suites;
|
||||
}
|
||||
Ok(Arc::new(provider))
|
||||
}
|
||||
|
||||
pub fn build_supported_versions(
|
||||
policy: &TlsPolicy,
|
||||
) -> Result<Vec<&'static SupportedProtocolVersion>, TlsPolicyError> {
|
||||
let min_version = parse_go_tls_version(&policy.min_version)?;
|
||||
let max_version = parse_go_tls_version(&policy.max_version)?;
|
||||
let versions = [&rustls::version::TLS13, &rustls::version::TLS12]
|
||||
.into_iter()
|
||||
.filter(|version| {
|
||||
let current = go_tls_version_for_supported(version);
|
||||
min_version.map(|min| current >= min).unwrap_or(true)
|
||||
&& max_version.map(|max| current <= max).unwrap_or(true)
|
||||
})
|
||||
.collect::<Vec<_>>();
|
||||
|
||||
if versions.is_empty() {
|
||||
return Err(TlsPolicyError(format!(
|
||||
"TLS version range min='{}' max='{}' is unsupported by rustls",
|
||||
policy.min_version, policy.max_version
|
||||
)));
|
||||
}
|
||||
|
||||
Ok(versions)
|
||||
}
|
||||
|
||||
fn parse_go_tls_version(value: &str) -> Result<Option<GoTlsVersion>, TlsPolicyError> {
|
||||
match value.trim() {
|
||||
"" => Ok(None),
|
||||
"SSLv3" => Ok(Some(GoTlsVersion::Ssl3)),
|
||||
"TLS 1.0" => Ok(Some(GoTlsVersion::Tls10)),
|
||||
"TLS 1.1" => Ok(Some(GoTlsVersion::Tls11)),
|
||||
"TLS 1.2" => Ok(Some(GoTlsVersion::Tls12)),
|
||||
"TLS 1.3" => Ok(Some(GoTlsVersion::Tls13)),
|
||||
other => Err(TlsPolicyError(format!("invalid TLS version {}", other))),
|
||||
}
|
||||
}
|
||||
|
||||
fn parse_cipher_suites(
|
||||
available: &[SupportedCipherSuite],
|
||||
value: &str,
|
||||
) -> Result<Vec<SupportedCipherSuite>, TlsPolicyError> {
|
||||
let trimmed = value.trim();
|
||||
if trimmed.is_empty() {
|
||||
return Ok(Vec::new());
|
||||
}
|
||||
|
||||
trimmed
|
||||
.split(',')
|
||||
.map(|name| {
|
||||
let suite = parse_cipher_suite_name(name.trim())?;
|
||||
available
|
||||
.iter()
|
||||
.copied()
|
||||
.find(|candidate| candidate.suite() == suite)
|
||||
.ok_or_else(|| {
|
||||
TlsPolicyError(format!(
|
||||
"TLS cipher suite '{}' is unsupported by the Rust implementation",
|
||||
name.trim()
|
||||
))
|
||||
})
|
||||
})
|
||||
.collect()
|
||||
}
|
||||
|
||||
fn parse_cipher_suite_name(value: &str) -> Result<CipherSuite, TlsPolicyError> {
|
||||
match value {
|
||||
"TLS_AES_128_GCM_SHA256" => Ok(CipherSuite::TLS13_AES_128_GCM_SHA256),
|
||||
"TLS_AES_256_GCM_SHA384" => Ok(CipherSuite::TLS13_AES_256_GCM_SHA384),
|
||||
"TLS_CHACHA20_POLY1305_SHA256" => Ok(CipherSuite::TLS13_CHACHA20_POLY1305_SHA256),
|
||||
"TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256" => {
|
||||
Ok(CipherSuite::TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256)
|
||||
}
|
||||
"TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384" => {
|
||||
Ok(CipherSuite::TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384)
|
||||
}
|
||||
"TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256" => {
|
||||
Ok(CipherSuite::TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256)
|
||||
}
|
||||
"TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256" => {
|
||||
Ok(CipherSuite::TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256)
|
||||
}
|
||||
"TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384" => {
|
||||
Ok(CipherSuite::TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384)
|
||||
}
|
||||
"TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256" => {
|
||||
Ok(CipherSuite::TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256)
|
||||
}
|
||||
other => Err(TlsPolicyError(format!(
|
||||
"TLS cipher suite '{}' is unsupported by the Rust implementation",
|
||||
other
|
||||
))),
|
||||
}
|
||||
}
|
||||
|
||||
fn parse_common_name(cert: &CertificateDer<'_>) -> Result<String, TlsPolicyError> {
|
||||
let (_, certificate) = X509Certificate::from_der(cert.as_ref())
|
||||
.map_err(|e| TlsPolicyError(format!("parse X.509 certificate failed: {}", e)))?;
|
||||
let common_name = certificate
|
||||
.subject()
|
||||
.iter_common_name()
|
||||
.next()
|
||||
.and_then(|common_name| common_name.as_str().ok())
|
||||
.map(str::to_string);
|
||||
match common_name {
|
||||
Some(common_name) => Ok(common_name),
|
||||
None => Ok(String::new()),
|
||||
}
|
||||
}
|
||||
|
||||
fn common_name_is_allowed(
|
||||
common_name: &str,
|
||||
allowed_common_names: &HashSet<String>,
|
||||
allowed_wildcard_domain: &str,
|
||||
) -> bool {
|
||||
(!allowed_wildcard_domain.is_empty() && common_name.ends_with(allowed_wildcard_domain))
|
||||
|| allowed_common_names.contains(common_name)
|
||||
}
|
||||
|
||||
fn go_tls_version_for_supported(version: &SupportedProtocolVersion) -> GoTlsVersion {
|
||||
match version.version {
|
||||
rustls::ProtocolVersion::TLSv1_2 => GoTlsVersion::Tls12,
|
||||
rustls::ProtocolVersion::TLSv1_3 => GoTlsVersion::Tls13,
|
||||
_ => unreachable!("rustls only exposes TLS 1.2 and 1.3"),
|
||||
}
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::{build_supported_versions, common_name_is_allowed, parse_cipher_suites, TlsPolicy};
|
||||
use rustls::crypto::aws_lc_rs;
|
||||
use std::collections::HashSet;
|
||||
|
||||
#[test]
|
||||
fn test_build_supported_versions_defaults_to_tls12_and_tls13() {
|
||||
let versions = build_supported_versions(&TlsPolicy::default()).unwrap();
|
||||
assert_eq!(
|
||||
versions,
|
||||
vec![&rustls::version::TLS13, &rustls::version::TLS12]
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_build_supported_versions_filters_to_tls13() {
|
||||
let versions = build_supported_versions(&TlsPolicy {
|
||||
min_version: "TLS 1.3".to_string(),
|
||||
max_version: "TLS 1.3".to_string(),
|
||||
cipher_suites: String::new(),
|
||||
})
|
||||
.unwrap();
|
||||
assert_eq!(versions, vec![&rustls::version::TLS13]);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_build_supported_versions_rejects_unsupported_legacy_range() {
|
||||
let err = build_supported_versions(&TlsPolicy {
|
||||
min_version: "TLS 1.0".to_string(),
|
||||
max_version: "TLS 1.1".to_string(),
|
||||
cipher_suites: String::new(),
|
||||
})
|
||||
.unwrap_err();
|
||||
assert!(err.to_string().contains("unsupported by rustls"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_parse_cipher_suites_accepts_go_names() {
|
||||
let cipher_suites = parse_cipher_suites(
|
||||
&aws_lc_rs::default_provider().cipher_suites,
|
||||
"TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_AES_128_GCM_SHA256",
|
||||
)
|
||||
.unwrap();
|
||||
assert_eq!(cipher_suites.len(), 2);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_common_name_is_allowed_matches_exact_and_wildcard() {
|
||||
let allowed_common_names =
|
||||
HashSet::from([String::from("volume-a.internal"), String::from("worker-7")]);
|
||||
assert!(common_name_is_allowed(
|
||||
"volume-a.internal",
|
||||
&allowed_common_names,
|
||||
"",
|
||||
));
|
||||
assert!(common_name_is_allowed(
|
||||
"node.prod.example.com",
|
||||
&allowed_common_names,
|
||||
".example.com",
|
||||
));
|
||||
assert!(!common_name_is_allowed(
|
||||
"node.prod.other.net",
|
||||
&allowed_common_names,
|
||||
".example.com",
|
||||
));
|
||||
}
|
||||
}
|
||||
159
seaweed-volume/src/server/debug.rs
Normal file
159
seaweed-volume/src/server/debug.rs
Normal file
@@ -0,0 +1,159 @@
|
||||
use axum::body::Body;
|
||||
use axum::extract::Query;
|
||||
use axum::http::{header, StatusCode};
|
||||
use axum::response::{IntoResponse, Response};
|
||||
use axum::routing::{any, get};
|
||||
use axum::Router;
|
||||
use pprof::protos::Message;
|
||||
use serde::Deserialize;
|
||||
|
||||
#[derive(Deserialize, Default)]
|
||||
struct ProfileQuery {
|
||||
seconds: Option<u64>,
|
||||
}
|
||||
|
||||
pub fn build_debug_router() -> Router {
|
||||
Router::new()
|
||||
.route("/debug/pprof/", get(pprof_index_handler))
|
||||
.route("/debug/pprof/cmdline", get(pprof_cmdline_handler))
|
||||
.route("/debug/pprof/profile", get(pprof_profile_handler))
|
||||
.route("/debug/pprof/symbol", any(pprof_symbol_handler))
|
||||
.route("/debug/pprof/trace", get(pprof_trace_handler))
|
||||
}
|
||||
|
||||
async fn pprof_index_handler() -> Response {
|
||||
let body = concat!(
|
||||
"<html><head><title>/debug/pprof/</title></head><body>",
|
||||
"<a href=\"cmdline\">cmdline</a><br>",
|
||||
"<a href=\"profile\">profile</a><br>",
|
||||
"<a href=\"symbol\">symbol</a><br>",
|
||||
"<a href=\"trace\">trace</a><br>",
|
||||
"</body></html>",
|
||||
);
|
||||
(
|
||||
StatusCode::OK,
|
||||
[(header::CONTENT_TYPE, "text/html; charset=utf-8")],
|
||||
body,
|
||||
)
|
||||
.into_response()
|
||||
}
|
||||
|
||||
async fn pprof_cmdline_handler() -> Response {
|
||||
let body = std::env::args().collect::<Vec<_>>().join("\0");
|
||||
(
|
||||
StatusCode::OK,
|
||||
[(header::CONTENT_TYPE, "text/plain; charset=utf-8")],
|
||||
body,
|
||||
)
|
||||
.into_response()
|
||||
}
|
||||
|
||||
async fn pprof_profile_handler(Query(query): Query<ProfileQuery>) -> Response {
|
||||
let seconds = query.seconds.unwrap_or(30).clamp(1, 300);
|
||||
let guard = match pprof::ProfilerGuard::new(100) {
|
||||
Ok(guard) => guard,
|
||||
Err(e) => {
|
||||
return (
|
||||
StatusCode::INTERNAL_SERVER_ERROR,
|
||||
format!("failed to start profiler: {}", e),
|
||||
)
|
||||
.into_response();
|
||||
}
|
||||
};
|
||||
|
||||
tokio::time::sleep(std::time::Duration::from_secs(seconds)).await;
|
||||
|
||||
let report = match guard.report().build() {
|
||||
Ok(report) => report,
|
||||
Err(e) => {
|
||||
return (
|
||||
StatusCode::INTERNAL_SERVER_ERROR,
|
||||
format!("failed to build profile report: {}", e),
|
||||
)
|
||||
.into_response();
|
||||
}
|
||||
};
|
||||
|
||||
let profile = match report.pprof() {
|
||||
Ok(profile) => profile,
|
||||
Err(e) => {
|
||||
return (
|
||||
StatusCode::INTERNAL_SERVER_ERROR,
|
||||
format!("failed to encode profile: {}", e),
|
||||
)
|
||||
.into_response();
|
||||
}
|
||||
};
|
||||
|
||||
let mut bytes = Vec::new();
|
||||
if let Err(e) = profile.encode(&mut bytes) {
|
||||
return (
|
||||
StatusCode::INTERNAL_SERVER_ERROR,
|
||||
format!("failed to serialize profile: {}", e),
|
||||
)
|
||||
.into_response();
|
||||
}
|
||||
|
||||
(
|
||||
StatusCode::OK,
|
||||
[(header::CONTENT_TYPE, "application/octet-stream")],
|
||||
bytes,
|
||||
)
|
||||
.into_response()
|
||||
}
|
||||
|
||||
async fn pprof_symbol_handler() -> Response {
|
||||
(
|
||||
StatusCode::OK,
|
||||
[(header::CONTENT_TYPE, "text/plain; charset=utf-8")],
|
||||
"num_symbols: 0\n",
|
||||
)
|
||||
.into_response()
|
||||
}
|
||||
|
||||
async fn pprof_trace_handler(Query(query): Query<ProfileQuery>) -> Response {
|
||||
let seconds = query.seconds.unwrap_or(1).clamp(1, 30);
|
||||
tokio::time::sleep(std::time::Duration::from_secs(seconds)).await;
|
||||
Response::builder()
|
||||
.status(StatusCode::OK)
|
||||
.header(header::CONTENT_TYPE, "application/octet-stream")
|
||||
.body(Body::from(Vec::<u8>::new()))
|
||||
.unwrap()
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
use axum::http::Request;
|
||||
use tower::ServiceExt;
|
||||
|
||||
#[tokio::test]
|
||||
async fn test_debug_index_route() {
|
||||
let app = build_debug_router();
|
||||
let response = app
|
||||
.oneshot(
|
||||
Request::builder()
|
||||
.uri("/debug/pprof/")
|
||||
.body(Body::empty())
|
||||
.unwrap(),
|
||||
)
|
||||
.await
|
||||
.unwrap();
|
||||
assert_eq!(response.status(), StatusCode::OK);
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn test_debug_cmdline_route() {
|
||||
let app = build_debug_router();
|
||||
let response = app
|
||||
.oneshot(
|
||||
Request::builder()
|
||||
.uri("/debug/pprof/cmdline")
|
||||
.body(Body::empty())
|
||||
.unwrap(),
|
||||
)
|
||||
.await
|
||||
.unwrap();
|
||||
assert_eq!(response.status(), StatusCode::OK);
|
||||
}
|
||||
}
|
||||
BIN
seaweed-volume/src/server/favicon.ico
Normal file
BIN
seaweed-volume/src/server/favicon.ico
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 70 B |
206
seaweed-volume/src/server/grpc_client.rs
Normal file
206
seaweed-volume/src/server/grpc_client.rs
Normal file
@@ -0,0 +1,206 @@
|
||||
use std::error::Error;
|
||||
use std::fmt;
|
||||
use std::time::Duration;
|
||||
|
||||
use hyper::http::Uri;
|
||||
use tonic::transport::{Certificate, Channel, ClientTlsConfig, Endpoint, Identity};
|
||||
|
||||
use crate::config::VolumeServerConfig;
|
||||
|
||||
pub const GRPC_MAX_MESSAGE_SIZE: usize = 1 << 30;
|
||||
const GRPC_KEEPALIVE_INTERVAL: Duration = Duration::from_secs(60);
|
||||
const GRPC_KEEPALIVE_TIMEOUT: Duration = Duration::from_secs(20);
|
||||
const GRPC_INITIAL_WINDOW_SIZE: u32 = 16 * 1024 * 1024;
|
||||
|
||||
#[derive(Clone, Debug)]
|
||||
pub struct OutgoingGrpcTlsConfig {
|
||||
cert_pem: String,
|
||||
key_pem: String,
|
||||
ca_pem: String,
|
||||
}
|
||||
|
||||
#[derive(Debug)]
|
||||
pub struct GrpcClientError(String);
|
||||
|
||||
impl fmt::Display for GrpcClientError {
|
||||
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
|
||||
f.write_str(&self.0)
|
||||
}
|
||||
}
|
||||
|
||||
impl Error for GrpcClientError {}
|
||||
|
||||
pub fn load_outgoing_grpc_tls(
|
||||
config: &VolumeServerConfig,
|
||||
) -> Result<Option<OutgoingGrpcTlsConfig>, GrpcClientError> {
|
||||
if config.grpc_cert_file.is_empty()
|
||||
|| config.grpc_key_file.is_empty()
|
||||
|| config.grpc_ca_file.is_empty()
|
||||
{
|
||||
return Ok(None);
|
||||
}
|
||||
|
||||
let cert_pem = std::fs::read_to_string(&config.grpc_cert_file).map_err(|e| {
|
||||
GrpcClientError(format!(
|
||||
"Failed to read outgoing gRPC cert '{}': {}",
|
||||
config.grpc_cert_file, e
|
||||
))
|
||||
})?;
|
||||
let key_pem = std::fs::read_to_string(&config.grpc_key_file).map_err(|e| {
|
||||
GrpcClientError(format!(
|
||||
"Failed to read outgoing gRPC key '{}': {}",
|
||||
config.grpc_key_file, e
|
||||
))
|
||||
})?;
|
||||
let ca_pem = std::fs::read_to_string(&config.grpc_ca_file).map_err(|e| {
|
||||
GrpcClientError(format!(
|
||||
"Failed to read outgoing gRPC CA '{}': {}",
|
||||
config.grpc_ca_file, e
|
||||
))
|
||||
})?;
|
||||
|
||||
Ok(Some(OutgoingGrpcTlsConfig {
|
||||
cert_pem,
|
||||
key_pem,
|
||||
ca_pem,
|
||||
}))
|
||||
}
|
||||
|
||||
pub fn grpc_endpoint_uri(grpc_host_port: &str, tls: Option<&OutgoingGrpcTlsConfig>) -> String {
|
||||
let scheme = if tls.is_some() { "https" } else { "http" };
|
||||
format!("{}://{}", scheme, grpc_host_port)
|
||||
}
|
||||
|
||||
pub fn build_grpc_endpoint(
|
||||
grpc_host_port: &str,
|
||||
tls: Option<&OutgoingGrpcTlsConfig>,
|
||||
) -> Result<Endpoint, GrpcClientError> {
|
||||
let uri = grpc_endpoint_uri(grpc_host_port, tls);
|
||||
let mut endpoint = Channel::from_shared(uri.clone())
|
||||
.map_err(|e| GrpcClientError(format!("invalid gRPC endpoint {}: {}", uri, e)))?
|
||||
.http2_keep_alive_interval(GRPC_KEEPALIVE_INTERVAL)
|
||||
.keep_alive_timeout(GRPC_KEEPALIVE_TIMEOUT)
|
||||
.keep_alive_while_idle(false)
|
||||
.initial_stream_window_size(Some(GRPC_INITIAL_WINDOW_SIZE))
|
||||
.initial_connection_window_size(Some(GRPC_INITIAL_WINDOW_SIZE))
|
||||
.http2_adaptive_window(false);
|
||||
|
||||
if let Some(tls) = tls {
|
||||
let parsed = uri
|
||||
.parse::<Uri>()
|
||||
.map_err(|e| GrpcClientError(format!("invalid gRPC endpoint {}: {}", uri, e)))?;
|
||||
let host = parsed
|
||||
.host()
|
||||
.ok_or_else(|| GrpcClientError(format!("missing host in gRPC endpoint {}", uri)))?;
|
||||
let tls_config = ClientTlsConfig::new()
|
||||
.identity(Identity::from_pem(
|
||||
tls.cert_pem.clone(),
|
||||
tls.key_pem.clone(),
|
||||
))
|
||||
.ca_certificate(Certificate::from_pem(tls.ca_pem.clone()))
|
||||
.domain_name(host.to_string());
|
||||
endpoint = endpoint.tls_config(tls_config).map_err(|e| {
|
||||
GrpcClientError(format!("configure gRPC TLS for {} failed: {}", uri, e))
|
||||
})?;
|
||||
}
|
||||
|
||||
Ok(endpoint)
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::{build_grpc_endpoint, grpc_endpoint_uri, load_outgoing_grpc_tls};
|
||||
use crate::config::{NeedleMapKind, ReadMode, VolumeServerConfig};
|
||||
use crate::security::tls::TlsPolicy;
|
||||
|
||||
fn sample_config() -> VolumeServerConfig {
|
||||
VolumeServerConfig {
|
||||
port: 8080,
|
||||
grpc_port: 18080,
|
||||
public_port: 8080,
|
||||
ip: "127.0.0.1".to_string(),
|
||||
bind_ip: String::new(),
|
||||
public_url: "127.0.0.1:8080".to_string(),
|
||||
id: String::new(),
|
||||
masters: vec![],
|
||||
pre_stop_seconds: 0,
|
||||
idle_timeout: 0,
|
||||
data_center: String::new(),
|
||||
rack: String::new(),
|
||||
index_type: NeedleMapKind::InMemory,
|
||||
disk_type: String::new(),
|
||||
folders: vec![],
|
||||
folder_max_limits: vec![],
|
||||
folder_tags: vec![],
|
||||
min_free_spaces: vec![],
|
||||
disk_types: vec![],
|
||||
idx_folder: String::new(),
|
||||
white_list: vec![],
|
||||
fix_jpg_orientation: false,
|
||||
read_mode: ReadMode::Local,
|
||||
cpu_profile: String::new(),
|
||||
mem_profile: String::new(),
|
||||
compaction_byte_per_second: 0,
|
||||
maintenance_byte_per_second: 0,
|
||||
file_size_limit_bytes: 0,
|
||||
concurrent_upload_limit: 0,
|
||||
concurrent_download_limit: 0,
|
||||
inflight_upload_data_timeout: std::time::Duration::from_secs(0),
|
||||
inflight_download_data_timeout: std::time::Duration::from_secs(0),
|
||||
has_slow_read: false,
|
||||
read_buffer_size_mb: 0,
|
||||
ldb_timeout: 0,
|
||||
pprof: false,
|
||||
metrics_port: 0,
|
||||
metrics_ip: String::new(),
|
||||
debug: false,
|
||||
debug_port: 0,
|
||||
ui_enabled: false,
|
||||
jwt_signing_key: vec![],
|
||||
jwt_signing_expires_seconds: 0,
|
||||
jwt_read_signing_key: vec![],
|
||||
jwt_read_signing_expires_seconds: 0,
|
||||
https_cert_file: String::new(),
|
||||
https_key_file: String::new(),
|
||||
https_ca_file: String::new(),
|
||||
https_client_enabled: false,
|
||||
https_client_cert_file: String::new(),
|
||||
https_client_key_file: String::new(),
|
||||
https_client_ca_file: String::new(),
|
||||
grpc_cert_file: String::new(),
|
||||
grpc_key_file: String::new(),
|
||||
grpc_ca_file: String::new(),
|
||||
grpc_allowed_wildcard_domain: String::new(),
|
||||
grpc_volume_allowed_common_names: vec![],
|
||||
tls_policy: TlsPolicy::default(),
|
||||
enable_write_queue: false,
|
||||
security_file: String::new(),
|
||||
}
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_grpc_endpoint_uri_uses_https_when_tls_enabled() {
|
||||
let tls = super::OutgoingGrpcTlsConfig {
|
||||
cert_pem: "cert".to_string(),
|
||||
key_pem: "key".to_string(),
|
||||
ca_pem: "ca".to_string(),
|
||||
};
|
||||
assert_eq!(
|
||||
grpc_endpoint_uri("master.example.com:19333", Some(&tls)),
|
||||
"https://master.example.com:19333"
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_load_outgoing_grpc_tls_requires_cert_key_and_ca() {
|
||||
let mut config = sample_config();
|
||||
config.grpc_cert_file = "/tmp/client.pem".to_string();
|
||||
assert!(load_outgoing_grpc_tls(&config).unwrap().is_none());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_build_grpc_endpoint_without_tls_uses_http_scheme() {
|
||||
let endpoint = build_grpc_endpoint("127.0.0.1:19333", None).unwrap();
|
||||
assert_eq!(endpoint.uri().scheme_str(), Some("http"));
|
||||
}
|
||||
}
|
||||
4536
seaweed-volume/src/server/grpc_server.rs
Normal file
4536
seaweed-volume/src/server/grpc_server.rs
Normal file
File diff suppressed because it is too large
Load Diff
3913
seaweed-volume/src/server/handlers.rs
Normal file
3913
seaweed-volume/src/server/handlers.rs
Normal file
File diff suppressed because it is too large
Load Diff
1576
seaweed-volume/src/server/heartbeat.rs
Normal file
1576
seaweed-volume/src/server/heartbeat.rs
Normal file
File diff suppressed because it is too large
Load Diff
102
seaweed-volume/src/server/memory_status.rs
Normal file
102
seaweed-volume/src/server/memory_status.rs
Normal file
@@ -0,0 +1,102 @@
|
||||
use crate::pb::volume_server_pb;
|
||||
|
||||
pub fn collect_mem_status() -> volume_server_pb::MemStatus {
|
||||
#[allow(unused_mut)]
|
||||
let mut mem = volume_server_pb::MemStatus {
|
||||
goroutines: 1,
|
||||
..Default::default()
|
||||
};
|
||||
|
||||
#[cfg(target_os = "linux")]
|
||||
{
|
||||
if let Some((all, free)) = get_system_memory_linux() {
|
||||
mem.all = all;
|
||||
mem.free = free;
|
||||
mem.used = all.saturating_sub(free);
|
||||
}
|
||||
|
||||
if let Some(status) = read_process_status_linux() {
|
||||
if status.threads > 0 {
|
||||
mem.goroutines = status.threads as i32;
|
||||
}
|
||||
if let Some(rss) = status.rss {
|
||||
mem.self_ = rss;
|
||||
}
|
||||
if let Some(heap) = status.data.or(status.rss) {
|
||||
mem.heap = heap;
|
||||
}
|
||||
if let Some(stack) = status.stack {
|
||||
mem.stack = stack;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
mem
|
||||
}
|
||||
|
||||
#[cfg(target_os = "linux")]
|
||||
fn get_system_memory_linux() -> Option<(u64, u64)> {
|
||||
unsafe {
|
||||
let mut info: libc::sysinfo = std::mem::zeroed();
|
||||
if libc::sysinfo(&mut info) == 0 {
|
||||
let unit = info.mem_unit as u64;
|
||||
let total = info.totalram as u64 * unit;
|
||||
let free = info.freeram as u64 * unit;
|
||||
return Some((total, free));
|
||||
}
|
||||
}
|
||||
None
|
||||
}
|
||||
|
||||
#[cfg(target_os = "linux")]
|
||||
#[derive(Default)]
|
||||
struct ProcessStatus {
|
||||
threads: u64,
|
||||
rss: Option<u64>,
|
||||
data: Option<u64>,
|
||||
stack: Option<u64>,
|
||||
}
|
||||
|
||||
#[cfg(target_os = "linux")]
|
||||
fn read_process_status_linux() -> Option<ProcessStatus> {
|
||||
let status = std::fs::read_to_string("/proc/self/status").ok()?;
|
||||
let mut out = ProcessStatus::default();
|
||||
|
||||
for line in status.lines() {
|
||||
if let Some(value) = line.strip_prefix("Threads:") {
|
||||
out.threads = value.trim().parse().ok()?;
|
||||
continue;
|
||||
}
|
||||
if let Some(value) = parse_proc_status_kib_field(line, "VmRSS:") {
|
||||
out.rss = Some(value);
|
||||
continue;
|
||||
}
|
||||
if let Some(value) = parse_proc_status_kib_field(line, "VmData:") {
|
||||
out.data = Some(value);
|
||||
continue;
|
||||
}
|
||||
if let Some(value) = parse_proc_status_kib_field(line, "VmStk:") {
|
||||
out.stack = Some(value);
|
||||
}
|
||||
}
|
||||
|
||||
Some(out)
|
||||
}
|
||||
|
||||
#[cfg(target_os = "linux")]
|
||||
fn parse_proc_status_kib_field(line: &str, prefix: &str) -> Option<u64> {
|
||||
let raw = line.strip_prefix(prefix)?.trim();
|
||||
let value = raw.strip_suffix(" kB").unwrap_or(raw).trim();
|
||||
value.parse::<u64>().ok().map(|kib| kib * 1024)
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::collect_mem_status;
|
||||
|
||||
#[test]
|
||||
fn test_collect_mem_status_reports_live_process_state() {
|
||||
let mem = collect_mem_status();
|
||||
assert!(mem.goroutines > 0);
|
||||
}
|
||||
}
|
||||
12
seaweed-volume/src/server/mod.rs
Normal file
12
seaweed-volume/src/server/mod.rs
Normal file
@@ -0,0 +1,12 @@
|
||||
pub mod debug;
|
||||
pub mod grpc_client;
|
||||
pub mod grpc_server;
|
||||
pub mod handlers;
|
||||
pub mod heartbeat;
|
||||
pub mod memory_status;
|
||||
pub mod profiling;
|
||||
pub mod request_id;
|
||||
pub mod server_stats;
|
||||
pub mod ui;
|
||||
pub mod volume_server;
|
||||
pub mod write_queue;
|
||||
187
seaweed-volume/src/server/profiling.rs
Normal file
187
seaweed-volume/src/server/profiling.rs
Normal file
@@ -0,0 +1,187 @@
|
||||
use std::fs::File;
|
||||
use std::io::Write;
|
||||
use std::path::PathBuf;
|
||||
|
||||
use pprof::protos::Message;
|
||||
|
||||
use crate::config::VolumeServerConfig;
|
||||
|
||||
const GO_CPU_PROFILE_FREQUENCY: i32 = 100;
|
||||
const GO_PPROF_BLOCKLIST: [&str; 4] = ["libc", "libgcc", "pthread", "vdso"];
|
||||
|
||||
pub struct CpuProfileSession {
|
||||
output_path: PathBuf,
|
||||
guard: pprof::ProfilerGuard<'static>,
|
||||
}
|
||||
|
||||
impl CpuProfileSession {
|
||||
pub fn start(config: &VolumeServerConfig) -> Result<Option<Self>, String> {
|
||||
if config.cpu_profile.is_empty() {
|
||||
if !config.mem_profile.is_empty() && !config.pprof {
|
||||
tracing::warn!(
|
||||
"--memprofile is not yet supported in the Rust volume server; ignoring '{}'",
|
||||
config.mem_profile
|
||||
);
|
||||
}
|
||||
return Ok(None);
|
||||
}
|
||||
|
||||
if config.pprof {
|
||||
tracing::info!(
|
||||
"--pprof is enabled; ignoring --cpuprofile '{}' and --memprofile '{}'",
|
||||
config.cpu_profile,
|
||||
config.mem_profile
|
||||
);
|
||||
return Ok(None);
|
||||
}
|
||||
|
||||
if !config.mem_profile.is_empty() {
|
||||
tracing::warn!(
|
||||
"--memprofile is not yet supported in the Rust volume server; only --cpuprofile '{}' will be written",
|
||||
config.cpu_profile
|
||||
);
|
||||
}
|
||||
|
||||
let guard = pprof::ProfilerGuardBuilder::default()
|
||||
.frequency(GO_CPU_PROFILE_FREQUENCY)
|
||||
.blocklist(&GO_PPROF_BLOCKLIST)
|
||||
.build()
|
||||
.map_err(|e| {
|
||||
format!(
|
||||
"Failed to start CPU profiler '{}': {}",
|
||||
config.cpu_profile, e
|
||||
)
|
||||
})?;
|
||||
|
||||
Ok(Some(Self {
|
||||
output_path: PathBuf::from(&config.cpu_profile),
|
||||
guard,
|
||||
}))
|
||||
}
|
||||
|
||||
pub fn finish(self) -> Result<(), String> {
|
||||
let report = self
|
||||
.guard
|
||||
.report()
|
||||
.build()
|
||||
.map_err(|e| format!("Failed to build CPU profile report: {}", e))?;
|
||||
let profile = report
|
||||
.pprof()
|
||||
.map_err(|e| format!("Failed to encode CPU profile report: {}", e))?;
|
||||
|
||||
let mut bytes = Vec::new();
|
||||
profile
|
||||
.encode(&mut bytes)
|
||||
.map_err(|e| format!("Failed to serialize CPU profile report: {}", e))?;
|
||||
|
||||
let mut file = File::create(&self.output_path).map_err(|e| {
|
||||
format!(
|
||||
"Failed to create CPU profile '{}': {}",
|
||||
self.output_path.display(),
|
||||
e
|
||||
)
|
||||
})?;
|
||||
file.write_all(&bytes).map_err(|e| {
|
||||
format!(
|
||||
"Failed to write CPU profile '{}': {}",
|
||||
self.output_path.display(),
|
||||
e
|
||||
)
|
||||
})?;
|
||||
file.flush().map_err(|e| {
|
||||
format!(
|
||||
"Failed to flush CPU profile '{}': {}",
|
||||
self.output_path.display(),
|
||||
e
|
||||
)
|
||||
})?;
|
||||
|
||||
Ok(())
|
||||
}
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::CpuProfileSession;
|
||||
use crate::config::{NeedleMapKind, ReadMode, VolumeServerConfig};
|
||||
use crate::security::tls::TlsPolicy;
|
||||
|
||||
fn sample_config() -> VolumeServerConfig {
|
||||
VolumeServerConfig {
|
||||
port: 8080,
|
||||
grpc_port: 18080,
|
||||
public_port: 8080,
|
||||
ip: "127.0.0.1".to_string(),
|
||||
bind_ip: "127.0.0.1".to_string(),
|
||||
public_url: "127.0.0.1:8080".to_string(),
|
||||
id: "127.0.0.1:8080".to_string(),
|
||||
masters: vec![],
|
||||
pre_stop_seconds: 0,
|
||||
idle_timeout: 0,
|
||||
data_center: String::new(),
|
||||
rack: String::new(),
|
||||
index_type: NeedleMapKind::InMemory,
|
||||
disk_type: String::new(),
|
||||
folders: vec!["/tmp".to_string()],
|
||||
folder_max_limits: vec![8],
|
||||
folder_tags: vec![vec![]],
|
||||
min_free_spaces: vec![],
|
||||
disk_types: vec![String::new()],
|
||||
idx_folder: String::new(),
|
||||
white_list: vec![],
|
||||
fix_jpg_orientation: false,
|
||||
read_mode: ReadMode::Local,
|
||||
cpu_profile: String::new(),
|
||||
mem_profile: String::new(),
|
||||
compaction_byte_per_second: 0,
|
||||
maintenance_byte_per_second: 0,
|
||||
file_size_limit_bytes: 0,
|
||||
concurrent_upload_limit: 0,
|
||||
concurrent_download_limit: 0,
|
||||
inflight_upload_data_timeout: std::time::Duration::from_secs(0),
|
||||
inflight_download_data_timeout: std::time::Duration::from_secs(0),
|
||||
has_slow_read: false,
|
||||
read_buffer_size_mb: 4,
|
||||
ldb_timeout: 0,
|
||||
pprof: false,
|
||||
metrics_port: 0,
|
||||
metrics_ip: String::new(),
|
||||
debug: false,
|
||||
debug_port: 0,
|
||||
ui_enabled: false,
|
||||
jwt_signing_key: vec![],
|
||||
jwt_signing_expires_seconds: 0,
|
||||
jwt_read_signing_key: vec![],
|
||||
jwt_read_signing_expires_seconds: 0,
|
||||
https_cert_file: String::new(),
|
||||
https_key_file: String::new(),
|
||||
https_ca_file: String::new(),
|
||||
https_client_enabled: false,
|
||||
https_client_cert_file: String::new(),
|
||||
https_client_key_file: String::new(),
|
||||
https_client_ca_file: String::new(),
|
||||
grpc_cert_file: String::new(),
|
||||
grpc_key_file: String::new(),
|
||||
grpc_ca_file: String::new(),
|
||||
grpc_allowed_wildcard_domain: String::new(),
|
||||
grpc_volume_allowed_common_names: vec![],
|
||||
tls_policy: TlsPolicy::default(),
|
||||
enable_write_queue: false,
|
||||
security_file: String::new(),
|
||||
}
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_cpu_profile_session_skips_when_disabled() {
|
||||
let config = sample_config();
|
||||
assert!(CpuProfileSession::start(&config).unwrap().is_none());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_cpu_profile_session_skips_when_pprof_enabled() {
|
||||
let mut config = sample_config();
|
||||
config.cpu_profile = "/tmp/cpu.pb".to_string();
|
||||
config.pprof = true;
|
||||
assert!(CpuProfileSession::start(&config).unwrap().is_none());
|
||||
}
|
||||
}
|
||||
137
seaweed-volume/src/server/request_id.rs
Normal file
137
seaweed-volume/src/server/request_id.rs
Normal file
@@ -0,0 +1,137 @@
|
||||
use std::future::Future;
|
||||
use std::pin::Pin;
|
||||
use std::task::{Context, Poll};
|
||||
|
||||
use hyper::http::{self, HeaderValue};
|
||||
use tonic::metadata::MetadataValue;
|
||||
use tonic::{Request, Status};
|
||||
use tower::{Layer, Service};
|
||||
|
||||
tokio::task_local! {
|
||||
static CURRENT_REQUEST_ID: String;
|
||||
}
|
||||
|
||||
#[derive(Clone, Debug, Default)]
|
||||
pub struct GrpcRequestIdLayer;
|
||||
|
||||
#[derive(Clone, Debug)]
|
||||
pub struct GrpcRequestIdService<S> {
|
||||
inner: S,
|
||||
}
|
||||
|
||||
impl<S> Layer<S> for GrpcRequestIdLayer {
|
||||
type Service = GrpcRequestIdService<S>;
|
||||
|
||||
fn layer(&self, inner: S) -> Self::Service {
|
||||
GrpcRequestIdService { inner }
|
||||
}
|
||||
}
|
||||
|
||||
impl<S, B> Service<http::Request<B>> for GrpcRequestIdService<S>
|
||||
where
|
||||
S: Service<http::Request<B>, Response = http::Response<tonic::body::BoxBody>> + Send + 'static,
|
||||
S::Future: Send + 'static,
|
||||
B: Send + 'static,
|
||||
{
|
||||
type Response = http::Response<tonic::body::BoxBody>;
|
||||
type Error = S::Error;
|
||||
type Future = Pin<Box<dyn Future<Output = Result<Self::Response, Self::Error>> + Send>>;
|
||||
|
||||
fn poll_ready(&mut self, cx: &mut Context<'_>) -> Poll<Result<(), Self::Error>> {
|
||||
self.inner.poll_ready(cx)
|
||||
}
|
||||
|
||||
fn call(&mut self, mut request: http::Request<B>) -> Self::Future {
|
||||
let request_id = match request.headers().get("x-amz-request-id") {
|
||||
Some(value) => match value.to_str() {
|
||||
Ok(value) if !value.is_empty() => value.to_owned(),
|
||||
_ => generate_grpc_request_id(),
|
||||
},
|
||||
None => generate_grpc_request_id(),
|
||||
};
|
||||
|
||||
if let Ok(value) = HeaderValue::from_str(&request_id) {
|
||||
request.headers_mut().insert("x-amz-request-id", value);
|
||||
}
|
||||
|
||||
let future = self.inner.call(request);
|
||||
|
||||
Box::pin(async move {
|
||||
let mut response: http::Response<tonic::body::BoxBody> =
|
||||
scope_request_id(request_id.clone(), future).await?;
|
||||
if let Ok(value) = HeaderValue::from_str(&request_id) {
|
||||
response.headers_mut().insert("x-amz-request-id", value);
|
||||
}
|
||||
Ok(response)
|
||||
})
|
||||
}
|
||||
}
|
||||
|
||||
pub async fn scope_request_id<F, T>(request_id: String, future: F) -> T
|
||||
where
|
||||
F: Future<Output = T>,
|
||||
{
|
||||
CURRENT_REQUEST_ID.scope(request_id, future).await
|
||||
}
|
||||
|
||||
pub fn current_request_id() -> Option<String> {
|
||||
CURRENT_REQUEST_ID.try_with(Clone::clone).ok()
|
||||
}
|
||||
|
||||
pub fn outgoing_request_id_interceptor(mut request: Request<()>) -> Result<Request<()>, Status> {
|
||||
if let Some(request_id) = current_request_id() {
|
||||
let value = MetadataValue::try_from(request_id.as_str())
|
||||
.map_err(|_| Status::internal("invalid scoped request id"))?;
|
||||
request.metadata_mut().insert("x-amz-request-id", value);
|
||||
}
|
||||
Ok(request)
|
||||
}
|
||||
|
||||
pub fn generate_http_request_id() -> String {
|
||||
use rand::Rng;
|
||||
|
||||
let nanos = std::time::SystemTime::now()
|
||||
.duration_since(std::time::UNIX_EPOCH)
|
||||
.unwrap_or_default()
|
||||
.as_nanos() as u64;
|
||||
let rand_val: u32 = rand::thread_rng().gen();
|
||||
format!("{:X}{:08X}", nanos, rand_val)
|
||||
}
|
||||
|
||||
fn generate_grpc_request_id() -> String {
|
||||
uuid::Uuid::new_v4().to_string()
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::{current_request_id, outgoing_request_id_interceptor, scope_request_id};
|
||||
use tonic::Request;
|
||||
|
||||
#[tokio::test]
|
||||
async fn test_scope_request_id_exposes_current_value() {
|
||||
let request_id = "req-123".to_string();
|
||||
let current = scope_request_id(
|
||||
request_id.clone(),
|
||||
async move { current_request_id().unwrap() },
|
||||
)
|
||||
.await;
|
||||
assert_eq!(current, request_id);
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn test_outgoing_request_id_interceptor_propagates_scope() {
|
||||
let request = scope_request_id("req-456".to_string(), async move {
|
||||
outgoing_request_id_interceptor(Request::new(())).unwrap()
|
||||
})
|
||||
.await;
|
||||
assert_eq!(
|
||||
request
|
||||
.metadata()
|
||||
.get("x-amz-request-id")
|
||||
.unwrap()
|
||||
.to_str()
|
||||
.unwrap(),
|
||||
"req-456"
|
||||
);
|
||||
}
|
||||
}
|
||||
248
seaweed-volume/src/server/server_stats.rs
Normal file
248
seaweed-volume/src/server/server_stats.rs
Normal file
@@ -0,0 +1,248 @@
|
||||
use chrono::{Datelike, Local, Timelike};
|
||||
use serde::Serialize;
|
||||
use std::sync::{LazyLock, Mutex};
|
||||
use std::time::Instant;
|
||||
|
||||
static START_TIME: LazyLock<Instant> = LazyLock::new(Instant::now);
|
||||
static SERVER_STATS: LazyLock<ServerStats> = LazyLock::new(ServerStats::default);
|
||||
|
||||
#[derive(Default)]
|
||||
pub struct ServerStats {
|
||||
inner: Mutex<ServerStatsInner>,
|
||||
}
|
||||
|
||||
#[derive(Default)]
|
||||
struct ServerStatsInner {
|
||||
requests: DurationCounter,
|
||||
connections: DurationCounter,
|
||||
assign_requests: DurationCounter,
|
||||
read_requests: DurationCounter,
|
||||
write_requests: DurationCounter,
|
||||
delete_requests: DurationCounter,
|
||||
bytes_in: DurationCounter,
|
||||
bytes_out: DurationCounter,
|
||||
}
|
||||
|
||||
#[derive(Clone, Serialize)]
|
||||
#[serde(rename_all = "PascalCase")]
|
||||
pub struct ServerStatsSnapshot {
|
||||
pub requests: DurationCounterSnapshot,
|
||||
pub connections: DurationCounterSnapshot,
|
||||
pub assign_requests: DurationCounterSnapshot,
|
||||
pub read_requests: DurationCounterSnapshot,
|
||||
pub write_requests: DurationCounterSnapshot,
|
||||
pub delete_requests: DurationCounterSnapshot,
|
||||
pub bytes_in: DurationCounterSnapshot,
|
||||
pub bytes_out: DurationCounterSnapshot,
|
||||
}
|
||||
|
||||
#[derive(Clone, Serialize)]
|
||||
#[serde(rename_all = "PascalCase")]
|
||||
pub struct DurationCounterSnapshot {
|
||||
pub minute_counter: RoundRobinCounterSnapshot,
|
||||
pub hour_counter: RoundRobinCounterSnapshot,
|
||||
pub day_counter: RoundRobinCounterSnapshot,
|
||||
pub week_counter: RoundRobinCounterSnapshot,
|
||||
}
|
||||
|
||||
#[derive(Clone, Serialize)]
|
||||
#[serde(rename_all = "PascalCase")]
|
||||
pub struct RoundRobinCounterSnapshot {
|
||||
pub last_index: i32,
|
||||
pub values: Vec<i64>,
|
||||
pub counts: Vec<i64>,
|
||||
}
|
||||
|
||||
#[derive(Clone)]
|
||||
struct DurationCounter {
|
||||
minute_counter: RoundRobinCounter,
|
||||
hour_counter: RoundRobinCounter,
|
||||
day_counter: RoundRobinCounter,
|
||||
week_counter: RoundRobinCounter,
|
||||
}
|
||||
|
||||
#[derive(Clone)]
|
||||
struct RoundRobinCounter {
|
||||
last_index: i32,
|
||||
values: Vec<i64>,
|
||||
counts: Vec<i64>,
|
||||
}
|
||||
|
||||
impl Default for DurationCounter {
|
||||
fn default() -> Self {
|
||||
Self {
|
||||
minute_counter: RoundRobinCounter::new(60),
|
||||
hour_counter: RoundRobinCounter::new(60),
|
||||
day_counter: RoundRobinCounter::new(24),
|
||||
week_counter: RoundRobinCounter::new(7),
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
impl RoundRobinCounter {
|
||||
fn new(slots: usize) -> Self {
|
||||
Self {
|
||||
last_index: -1,
|
||||
values: vec![0; slots],
|
||||
counts: vec![0; slots],
|
||||
}
|
||||
}
|
||||
|
||||
fn add(&mut self, index: usize, val: i64) {
|
||||
if index >= self.values.len() {
|
||||
return;
|
||||
}
|
||||
while self.last_index != index as i32 {
|
||||
self.last_index = (self.last_index + 1).rem_euclid(self.values.len() as i32);
|
||||
self.values[self.last_index as usize] = 0;
|
||||
self.counts[self.last_index as usize] = 0;
|
||||
}
|
||||
self.values[index] += val;
|
||||
self.counts[index] += 1;
|
||||
}
|
||||
|
||||
fn snapshot(&self) -> RoundRobinCounterSnapshot {
|
||||
RoundRobinCounterSnapshot {
|
||||
last_index: self.last_index,
|
||||
values: self.values.clone(),
|
||||
counts: self.counts.clone(),
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
impl DurationCounter {
|
||||
fn add_now(&mut self, val: i64) {
|
||||
let now = Local::now();
|
||||
self.minute_counter.add(now.second() as usize, val);
|
||||
self.hour_counter.add(now.minute() as usize, val);
|
||||
self.day_counter.add(now.hour() as usize, val);
|
||||
self.week_counter
|
||||
.add(now.weekday().num_days_from_sunday() as usize, val);
|
||||
}
|
||||
|
||||
fn snapshot(&self) -> DurationCounterSnapshot {
|
||||
DurationCounterSnapshot {
|
||||
minute_counter: self.minute_counter.snapshot(),
|
||||
hour_counter: self.hour_counter.snapshot(),
|
||||
day_counter: self.day_counter.snapshot(),
|
||||
week_counter: self.week_counter.snapshot(),
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
impl ServerStatsInner {
|
||||
fn snapshot(&self) -> ServerStatsSnapshot {
|
||||
ServerStatsSnapshot {
|
||||
requests: self.requests.snapshot(),
|
||||
connections: self.connections.snapshot(),
|
||||
assign_requests: self.assign_requests.snapshot(),
|
||||
read_requests: self.read_requests.snapshot(),
|
||||
write_requests: self.write_requests.snapshot(),
|
||||
delete_requests: self.delete_requests.snapshot(),
|
||||
bytes_in: self.bytes_in.snapshot(),
|
||||
bytes_out: self.bytes_out.snapshot(),
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
impl ServerStats {
|
||||
fn update<F>(&self, update: F)
|
||||
where
|
||||
F: FnOnce(&mut ServerStatsInner),
|
||||
{
|
||||
let mut inner = self.inner.lock().unwrap();
|
||||
update(&mut inner);
|
||||
}
|
||||
|
||||
fn snapshot(&self) -> ServerStatsSnapshot {
|
||||
self.inner.lock().unwrap().snapshot()
|
||||
}
|
||||
}
|
||||
|
||||
impl RoundRobinCounterSnapshot {
|
||||
pub fn to_list(&self) -> Vec<i64> {
|
||||
if self.values.is_empty() {
|
||||
return Vec::new();
|
||||
}
|
||||
let mut ret = Vec::with_capacity(self.values.len());
|
||||
let mut index = self.last_index;
|
||||
let mut step = self.values.len();
|
||||
while step > 0 {
|
||||
step -= 1;
|
||||
index += 1;
|
||||
if index >= self.values.len() as i32 {
|
||||
index = 0;
|
||||
}
|
||||
ret.push(self.values[index as usize]);
|
||||
}
|
||||
ret
|
||||
}
|
||||
}
|
||||
|
||||
pub fn init_process_start() {
|
||||
LazyLock::force(&START_TIME);
|
||||
LazyLock::force(&SERVER_STATS);
|
||||
}
|
||||
|
||||
pub fn uptime_string() -> String {
|
||||
let secs = START_TIME.elapsed().as_secs();
|
||||
let hours = secs / 3600;
|
||||
let minutes = (secs % 3600) / 60;
|
||||
let seconds = secs % 60;
|
||||
let mut out = String::new();
|
||||
if hours > 0 {
|
||||
out.push_str(&format!("{}h", hours));
|
||||
}
|
||||
if hours > 0 || minutes > 0 {
|
||||
out.push_str(&format!("{}m", minutes));
|
||||
}
|
||||
out.push_str(&format!("{}s", seconds));
|
||||
out
|
||||
}
|
||||
|
||||
pub fn snapshot() -> ServerStatsSnapshot {
|
||||
SERVER_STATS.snapshot()
|
||||
}
|
||||
|
||||
pub fn record_request_open() {
|
||||
SERVER_STATS.update(|inner| inner.requests.add_now(1));
|
||||
}
|
||||
|
||||
pub fn record_request_close() {
|
||||
SERVER_STATS.update(|inner| inner.requests.add_now(-1));
|
||||
}
|
||||
|
||||
pub fn record_connection_open() {
|
||||
SERVER_STATS.update(|inner| inner.connections.add_now(1));
|
||||
}
|
||||
|
||||
pub fn record_connection_close() {
|
||||
SERVER_STATS.update(|inner| inner.connections.add_now(-1));
|
||||
}
|
||||
|
||||
pub fn record_read_request() {
|
||||
SERVER_STATS.update(|inner| inner.read_requests.add_now(1));
|
||||
}
|
||||
|
||||
pub fn record_write_request() {
|
||||
SERVER_STATS.update(|inner| inner.write_requests.add_now(1));
|
||||
}
|
||||
|
||||
pub fn record_delete_request() {
|
||||
SERVER_STATS.update(|inner| inner.delete_requests.add_now(1));
|
||||
}
|
||||
|
||||
pub fn record_bytes_in(bytes: i64) {
|
||||
SERVER_STATS.update(|inner| inner.bytes_in.add_now(bytes));
|
||||
}
|
||||
|
||||
pub fn record_bytes_out(bytes: i64) {
|
||||
SERVER_STATS.update(|inner| inner.bytes_out.add_now(bytes));
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
pub fn reset_for_tests() {
|
||||
LazyLock::force(&START_TIME);
|
||||
let mut inner = SERVER_STATS.inner.lock().unwrap();
|
||||
*inner = ServerStatsInner::default();
|
||||
}
|
||||
507
seaweed-volume/src/server/ui.rs
Normal file
507
seaweed-volume/src/server/ui.rs
Normal file
@@ -0,0 +1,507 @@
|
||||
use std::fmt::Write as _;
|
||||
|
||||
use crate::server::server_stats;
|
||||
use crate::server::volume_server::VolumeServerState;
|
||||
use crate::storage::store::Store;
|
||||
|
||||
pub struct EmbeddedAsset {
|
||||
pub content_type: &'static str,
|
||||
pub bytes: &'static [u8],
|
||||
}
|
||||
|
||||
struct UiDiskRow {
|
||||
dir: String,
|
||||
disk_type: String,
|
||||
all: u64,
|
||||
free: u64,
|
||||
used: u64,
|
||||
}
|
||||
|
||||
struct UiVolumeRow {
|
||||
id: u32,
|
||||
collection: String,
|
||||
disk_type: String,
|
||||
size: u64,
|
||||
file_count: i64,
|
||||
delete_count: i64,
|
||||
deleted_byte_count: u64,
|
||||
ttl: String,
|
||||
read_only: bool,
|
||||
version: u32,
|
||||
remote_storage_name: String,
|
||||
remote_storage_key: String,
|
||||
}
|
||||
|
||||
struct UiEcShardRow {
|
||||
shard_id: u8,
|
||||
size: u64,
|
||||
}
|
||||
|
||||
struct UiEcVolumeRow {
|
||||
volume_id: u32,
|
||||
collection: String,
|
||||
size: u64,
|
||||
shards: Vec<UiEcShardRow>,
|
||||
created_at: String,
|
||||
}
|
||||
|
||||
pub fn favicon_asset() -> EmbeddedAsset {
|
||||
EmbeddedAsset {
|
||||
content_type: "image/x-icon",
|
||||
bytes: include_bytes!(concat!(
|
||||
env!("CARGO_MANIFEST_DIR"),
|
||||
"/../weed/static/favicon.ico"
|
||||
)),
|
||||
}
|
||||
}
|
||||
|
||||
pub fn lookup_static_asset(path: &str) -> Option<EmbeddedAsset> {
|
||||
let path = path.trim_start_matches('/');
|
||||
let asset = match path {
|
||||
"bootstrap/3.3.1/css/bootstrap.min.css" => EmbeddedAsset {
|
||||
content_type: "text/css; charset=utf-8",
|
||||
bytes: include_bytes!(concat!(
|
||||
env!("CARGO_MANIFEST_DIR"),
|
||||
"/../weed/static/bootstrap/3.3.1/css/bootstrap.min.css"
|
||||
)),
|
||||
},
|
||||
"bootstrap/3.3.1/fonts/glyphicons-halflings-regular.eot" => EmbeddedAsset {
|
||||
content_type: "application/vnd.ms-fontobject",
|
||||
bytes: include_bytes!(concat!(
|
||||
env!("CARGO_MANIFEST_DIR"),
|
||||
"/../weed/static/bootstrap/3.3.1/fonts/glyphicons-halflings-regular.eot"
|
||||
)),
|
||||
},
|
||||
"bootstrap/3.3.1/fonts/glyphicons-halflings-regular.svg" => EmbeddedAsset {
|
||||
content_type: "image/svg+xml",
|
||||
bytes: include_bytes!(concat!(
|
||||
env!("CARGO_MANIFEST_DIR"),
|
||||
"/../weed/static/bootstrap/3.3.1/fonts/glyphicons-halflings-regular.svg"
|
||||
)),
|
||||
},
|
||||
"bootstrap/3.3.1/fonts/glyphicons-halflings-regular.ttf" => EmbeddedAsset {
|
||||
content_type: "font/ttf",
|
||||
bytes: include_bytes!(concat!(
|
||||
env!("CARGO_MANIFEST_DIR"),
|
||||
"/../weed/static/bootstrap/3.3.1/fonts/glyphicons-halflings-regular.ttf"
|
||||
)),
|
||||
},
|
||||
"bootstrap/3.3.1/fonts/glyphicons-halflings-regular.woff" => EmbeddedAsset {
|
||||
content_type: "font/woff",
|
||||
bytes: include_bytes!(concat!(
|
||||
env!("CARGO_MANIFEST_DIR"),
|
||||
"/../weed/static/bootstrap/3.3.1/fonts/glyphicons-halflings-regular.woff"
|
||||
)),
|
||||
},
|
||||
"images/folder.gif" => EmbeddedAsset {
|
||||
content_type: "image/gif",
|
||||
bytes: include_bytes!(concat!(
|
||||
env!("CARGO_MANIFEST_DIR"),
|
||||
"/../weed/static/images/folder.gif"
|
||||
)),
|
||||
},
|
||||
"javascript/jquery-3.6.0.min.js" => EmbeddedAsset {
|
||||
content_type: "application/javascript; charset=utf-8",
|
||||
bytes: include_bytes!(concat!(
|
||||
env!("CARGO_MANIFEST_DIR"),
|
||||
"/../weed/static/javascript/jquery-3.6.0.min.js"
|
||||
)),
|
||||
},
|
||||
"javascript/jquery-sparklines/2.1.2/jquery.sparkline.min.js" => EmbeddedAsset {
|
||||
content_type: "application/javascript; charset=utf-8",
|
||||
bytes: include_bytes!(concat!(
|
||||
env!("CARGO_MANIFEST_DIR"),
|
||||
"/../weed/static/javascript/jquery-sparklines/2.1.2/jquery.sparkline.min.js"
|
||||
)),
|
||||
},
|
||||
"seaweed50x50.png" => EmbeddedAsset {
|
||||
content_type: "image/png",
|
||||
bytes: include_bytes!(concat!(
|
||||
env!("CARGO_MANIFEST_DIR"),
|
||||
"/../weed/static/seaweed50x50.png"
|
||||
)),
|
||||
},
|
||||
_ => return None,
|
||||
};
|
||||
Some(asset)
|
||||
}
|
||||
|
||||
pub fn render_volume_server_html(state: &VolumeServerState) -> String {
|
||||
let counters = server_stats::snapshot();
|
||||
let (disk_rows, volume_rows, remote_volume_rows, ec_volume_rows) = {
|
||||
let store = state.store.read().unwrap();
|
||||
collect_ui_data(&store)
|
||||
};
|
||||
|
||||
let masters = if state.master_urls.is_empty() {
|
||||
"[]".to_string()
|
||||
} else {
|
||||
format!("[{}]", state.master_urls.join(" "))
|
||||
};
|
||||
let uptime = server_stats::uptime_string();
|
||||
let read_week = join_i64(&counters.read_requests.week_counter.to_list());
|
||||
let read_day = join_i64(&counters.read_requests.day_counter.to_list());
|
||||
let read_hour = join_i64(&counters.read_requests.hour_counter.to_list());
|
||||
let read_minute = join_i64(&counters.read_requests.minute_counter.to_list());
|
||||
|
||||
let mut disk_rows_html = String::new();
|
||||
for disk in &disk_rows {
|
||||
let _ = write!(
|
||||
disk_rows_html,
|
||||
"<tr><td>{}</td><td>{}</td><td>{}</td><td>{}</td><td>{:.2}%</td></tr>",
|
||||
escape_html(&disk.dir),
|
||||
escape_html(&disk.disk_type),
|
||||
bytes_to_human_readable(disk.all),
|
||||
bytes_to_human_readable(disk.free),
|
||||
percent_from(disk.all, disk.used),
|
||||
);
|
||||
}
|
||||
|
||||
let mut volume_rows_html = String::new();
|
||||
for volume in &volume_rows {
|
||||
let _ = write!(
|
||||
volume_rows_html,
|
||||
"<tr><td><code>{}</code></td><td>{}</td><td>{}</td><td>{}</td><td>{}</td><td>{} / {}</td><td>{}</td><td>{}</td><td>{}</td></tr>",
|
||||
volume.id,
|
||||
escape_html(&volume.collection),
|
||||
escape_html(&volume.disk_type),
|
||||
bytes_to_human_readable(volume.size),
|
||||
volume.file_count,
|
||||
volume.delete_count,
|
||||
bytes_to_human_readable(volume.deleted_byte_count),
|
||||
escape_html(&volume.ttl),
|
||||
volume.read_only,
|
||||
volume.version,
|
||||
);
|
||||
}
|
||||
|
||||
let remote_section = if remote_volume_rows.is_empty() {
|
||||
String::new()
|
||||
} else {
|
||||
let mut remote_rows_html = String::new();
|
||||
for volume in &remote_volume_rows {
|
||||
let _ = write!(
|
||||
remote_rows_html,
|
||||
"<tr><td><code>{}</code></td><td>{}</td><td>{}</td><td>{}</td><td>{} / {}</td><td>{}</td><td>{}</td></tr>",
|
||||
volume.id,
|
||||
escape_html(&volume.collection),
|
||||
bytes_to_human_readable(volume.size),
|
||||
volume.file_count,
|
||||
volume.delete_count,
|
||||
bytes_to_human_readable(volume.deleted_byte_count),
|
||||
escape_html(&volume.remote_storage_name),
|
||||
escape_html(&volume.remote_storage_key),
|
||||
);
|
||||
}
|
||||
format!(
|
||||
r#"<div class="row">
|
||||
<h2>Remote Volumes</h2>
|
||||
<table class="table table-striped">
|
||||
<thead>
|
||||
<tr>
|
||||
<th>Id</th>
|
||||
<th>Collection</th>
|
||||
<th>Size</th>
|
||||
<th>Files</th>
|
||||
<th>Trash</th>
|
||||
<th>Remote</th>
|
||||
<th>Key</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody>{}</tbody>
|
||||
</table>
|
||||
</div>"#,
|
||||
remote_rows_html
|
||||
)
|
||||
};
|
||||
|
||||
let ec_section = if ec_volume_rows.is_empty() {
|
||||
String::new()
|
||||
} else {
|
||||
let mut ec_rows_html = String::new();
|
||||
for ec in &ec_volume_rows {
|
||||
let mut shard_labels = String::new();
|
||||
for shard in &ec.shards {
|
||||
let _ = write!(
|
||||
shard_labels,
|
||||
"<span class=\"label label-info\" style=\"margin-right: 5px;\">{}: {}</span>",
|
||||
shard.shard_id,
|
||||
bytes_to_human_readable(shard.size)
|
||||
);
|
||||
}
|
||||
let _ = write!(
|
||||
ec_rows_html,
|
||||
"<tr><td><code>{}</code></td><td>{}</td><td>{}</td><td>{}</td><td>{}</td></tr>",
|
||||
ec.volume_id,
|
||||
escape_html(&ec.collection),
|
||||
bytes_to_human_readable(ec.size),
|
||||
shard_labels,
|
||||
escape_html(&ec.created_at),
|
||||
);
|
||||
}
|
||||
format!(
|
||||
r#"<div class="row">
|
||||
<h2>Erasure Coding Shards</h2>
|
||||
<table class="table table-striped">
|
||||
<thead>
|
||||
<tr>
|
||||
<th>Id</th>
|
||||
<th>Collection</th>
|
||||
<th>Total Size</th>
|
||||
<th>Shard Details</th>
|
||||
<th>CreatedAt</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody>{}</tbody>
|
||||
</table>
|
||||
</div>"#,
|
||||
ec_rows_html
|
||||
)
|
||||
};
|
||||
|
||||
format!(
|
||||
r#"<!DOCTYPE html>
|
||||
<html>
|
||||
<head>
|
||||
<title>SeaweedFS {version}</title>
|
||||
<link rel="stylesheet" href="/seaweedfsstatic/bootstrap/3.3.1/css/bootstrap.min.css">
|
||||
<script type="text/javascript" src="/seaweedfsstatic/javascript/jquery-3.6.0.min.js"></script>
|
||||
<script type="text/javascript" src="/seaweedfsstatic/javascript/jquery-sparklines/2.1.2/jquery.sparkline.min.js"></script>
|
||||
<script type="text/javascript">
|
||||
$(function () {{
|
||||
var periods = ['second', 'minute', 'hour', 'day'];
|
||||
for (var i = 0; i < periods.length; i++) {{
|
||||
var period = periods[i];
|
||||
$('.inlinesparkline-' + period).sparkline('html', {{
|
||||
type: 'line',
|
||||
barColor: 'red',
|
||||
tooltipSuffix: ' request per ' + period
|
||||
}});
|
||||
}}
|
||||
}});
|
||||
</script>
|
||||
<style>
|
||||
#jqstooltip {{
|
||||
height: 28px !important;
|
||||
width: 150px !important;
|
||||
}}
|
||||
</style>
|
||||
</head>
|
||||
<body>
|
||||
<div class="container">
|
||||
<div class="page-header">
|
||||
<h1>
|
||||
<a href="https://github.com/seaweedfs/seaweedfs"><img src="/seaweedfsstatic/seaweed50x50.png" alt="SeaweedFS"></a>
|
||||
SeaweedFS <small>{version}</small>
|
||||
</h1>
|
||||
</div>
|
||||
|
||||
<div class="row">
|
||||
<div class="col-sm-6">
|
||||
<h2>Disk Stats</h2>
|
||||
<table class="table table-striped">
|
||||
<thead>
|
||||
<tr>
|
||||
<th>Path</th>
|
||||
<th>Disk</th>
|
||||
<th>Total</th>
|
||||
<th>Free</th>
|
||||
<th>Usage</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody>{disk_rows_html}</tbody>
|
||||
</table>
|
||||
</div>
|
||||
|
||||
<div class="col-sm-6">
|
||||
<h2>System Stats</h2>
|
||||
<table class="table table-condensed table-striped">
|
||||
<tr><th>Masters</th><td>{masters}</td></tr>
|
||||
<tr><th>Weekly # ReadRequests</th><td><span class="inlinesparkline-day">{read_week}</span></td></tr>
|
||||
<tr><th>Daily # ReadRequests</th><td><span class="inlinesparkline-hour">{read_day}</span></td></tr>
|
||||
<tr><th>Hourly # ReadRequests</th><td><span class="inlinesparkline-minute">{read_hour}</span></td></tr>
|
||||
<tr><th>Last Minute # ReadRequests</th><td><span class="inlinesparkline-second">{read_minute}</span></td></tr>
|
||||
<tr><th>Up Time</th><td>{uptime}</td></tr>
|
||||
</table>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="row">
|
||||
<h2>Volumes</h2>
|
||||
<table class="table table-striped">
|
||||
<thead>
|
||||
<tr>
|
||||
<th>Id</th>
|
||||
<th>Collection</th>
|
||||
<th>Disk</th>
|
||||
<th>Data Size</th>
|
||||
<th>Files</th>
|
||||
<th>Trash</th>
|
||||
<th>TTL</th>
|
||||
<th>ReadOnly</th>
|
||||
<th>Version</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody>{volume_rows_html}</tbody>
|
||||
</table>
|
||||
</div>
|
||||
|
||||
{remote_section}
|
||||
{ec_section}
|
||||
</div>
|
||||
</body>
|
||||
</html>"#,
|
||||
version = escape_html(crate::version::version()),
|
||||
disk_rows_html = disk_rows_html,
|
||||
masters = escape_html(&masters),
|
||||
read_week = read_week,
|
||||
read_day = read_day,
|
||||
read_hour = read_hour,
|
||||
read_minute = read_minute,
|
||||
uptime = escape_html(&uptime),
|
||||
volume_rows_html = volume_rows_html,
|
||||
remote_section = remote_section,
|
||||
ec_section = ec_section,
|
||||
)
|
||||
}
|
||||
|
||||
fn collect_ui_data(
|
||||
store: &Store,
|
||||
) -> (
|
||||
Vec<UiDiskRow>,
|
||||
Vec<UiVolumeRow>,
|
||||
Vec<UiVolumeRow>,
|
||||
Vec<UiEcVolumeRow>,
|
||||
) {
|
||||
let mut disk_rows = Vec::new();
|
||||
let mut volumes = Vec::new();
|
||||
let mut remote_volumes = Vec::new();
|
||||
let mut ec_volumes = Vec::new();
|
||||
|
||||
for loc in &store.locations {
|
||||
let dir = absolute_display_path(&loc.directory);
|
||||
let (all, free) = crate::storage::disk_location::get_disk_stats(&dir);
|
||||
disk_rows.push(UiDiskRow {
|
||||
dir,
|
||||
disk_type: loc.disk_type.to_string(),
|
||||
all,
|
||||
free,
|
||||
used: all.saturating_sub(free),
|
||||
});
|
||||
|
||||
for (_, volume) in loc.volumes() {
|
||||
let (remote_storage_name, remote_storage_key) = volume.remote_storage_name_key();
|
||||
let row = UiVolumeRow {
|
||||
id: volume.id.0,
|
||||
collection: volume.collection.clone(),
|
||||
disk_type: loc.disk_type.to_string(),
|
||||
size: volume.content_size(),
|
||||
file_count: volume.file_count(),
|
||||
delete_count: volume.deleted_count(),
|
||||
deleted_byte_count: volume.deleted_size(),
|
||||
ttl: volume.super_block.ttl.to_string(),
|
||||
read_only: volume.is_read_only(),
|
||||
version: volume.version().0 as u32,
|
||||
remote_storage_name,
|
||||
remote_storage_key,
|
||||
};
|
||||
if row.remote_storage_name.is_empty() {
|
||||
volumes.push(row);
|
||||
} else {
|
||||
remote_volumes.push(row);
|
||||
}
|
||||
}
|
||||
|
||||
for (_, ec_volume) in loc.ec_volumes() {
|
||||
let mut shards = Vec::new();
|
||||
let mut total_size = 0u64;
|
||||
let mut created_at = String::from("-");
|
||||
for shard in ec_volume.shards.iter().flatten() {
|
||||
let shard_size = shard.file_size().max(0) as u64;
|
||||
total_size = total_size.saturating_add(shard_size);
|
||||
shards.push(UiEcShardRow {
|
||||
shard_id: shard.shard_id,
|
||||
size: shard_size,
|
||||
});
|
||||
if created_at == "-" {
|
||||
if let Ok(metadata) = std::fs::metadata(shard.file_name()) {
|
||||
if let Ok(modified) = metadata.modified() {
|
||||
let ts: chrono::DateTime<chrono::Local> = modified.into();
|
||||
created_at = ts.format("%Y-%m-%d %H:%M").to_string();
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
let preferred_size = ec_volume.dat_file_size.max(0) as u64;
|
||||
ec_volumes.push(UiEcVolumeRow {
|
||||
volume_id: ec_volume.volume_id.0,
|
||||
collection: ec_volume.collection.clone(),
|
||||
size: preferred_size.max(total_size),
|
||||
shards,
|
||||
created_at,
|
||||
});
|
||||
}
|
||||
}
|
||||
|
||||
disk_rows.sort_by(|left, right| left.dir.cmp(&right.dir));
|
||||
volumes.sort_by_key(|row| row.id);
|
||||
remote_volumes.sort_by_key(|row| row.id);
|
||||
ec_volumes.sort_by_key(|row| row.volume_id);
|
||||
|
||||
(disk_rows, volumes, remote_volumes, ec_volumes)
|
||||
}
|
||||
|
||||
fn absolute_display_path(path: &str) -> String {
|
||||
let p = std::path::Path::new(path);
|
||||
if p.is_absolute() {
|
||||
return path.to_string();
|
||||
}
|
||||
std::env::current_dir()
|
||||
.map(|cwd| cwd.join(p).to_string_lossy().to_string())
|
||||
.unwrap_or_else(|_| path.to_string())
|
||||
}
|
||||
|
||||
fn join_i64(values: &[i64]) -> String {
|
||||
values
|
||||
.iter()
|
||||
.map(std::string::ToString::to_string)
|
||||
.collect::<Vec<_>>()
|
||||
.join(",")
|
||||
}
|
||||
|
||||
fn percent_from(total: u64, part: u64) -> f64 {
|
||||
if total == 0 {
|
||||
return 0.0;
|
||||
}
|
||||
(part as f64 / total as f64) * 100.0
|
||||
}
|
||||
|
||||
fn bytes_to_human_readable(bytes: u64) -> String {
|
||||
const UNIT: u64 = 1024;
|
||||
if bytes < UNIT {
|
||||
return format!("{} B", bytes);
|
||||
}
|
||||
|
||||
let mut div = UNIT;
|
||||
let mut exp = 0usize;
|
||||
let mut n = bytes / UNIT;
|
||||
while n >= UNIT {
|
||||
div *= UNIT;
|
||||
n /= UNIT;
|
||||
exp += 1;
|
||||
}
|
||||
|
||||
format!(
|
||||
"{:.2} {}iB",
|
||||
bytes as f64 / div as f64,
|
||||
["K", "M", "G", "T", "P", "E"][exp]
|
||||
)
|
||||
}
|
||||
|
||||
fn escape_html(input: &str) -> String {
|
||||
input
|
||||
.replace('&', "&")
|
||||
.replace('<', "<")
|
||||
.replace('>', ">")
|
||||
.replace('"', """)
|
||||
}
|
||||
394
seaweed-volume/src/server/volume_server.rs
Normal file
394
seaweed-volume/src/server/volume_server.rs
Normal file
@@ -0,0 +1,394 @@
|
||||
//! VolumeServer: the main HTTP server for volume operations.
|
||||
//!
|
||||
//! Routes:
|
||||
//! GET/HEAD /{vid},{fid} — read a file
|
||||
//! POST/PUT /{vid},{fid} — write a file
|
||||
//! DELETE /{vid},{fid} — delete a file
|
||||
//! GET /status — server status
|
||||
//! GET /healthz — health check
|
||||
//!
|
||||
//! Matches Go's server/volume_server.go.
|
||||
|
||||
use std::net::SocketAddr;
|
||||
use std::sync::atomic::{AtomicBool, AtomicI64, AtomicU32, Ordering};
|
||||
use std::sync::{Arc, RwLock};
|
||||
|
||||
use axum::{
|
||||
extract::{connect_info::ConnectInfo, Request, State},
|
||||
http::{header, HeaderValue, Method, StatusCode},
|
||||
middleware::{self, Next},
|
||||
response::{IntoResponse, Response},
|
||||
routing::{any, get},
|
||||
Router,
|
||||
};
|
||||
|
||||
use crate::config::ReadMode;
|
||||
use crate::security::Guard;
|
||||
use crate::storage::store::Store;
|
||||
|
||||
use super::grpc_client::OutgoingGrpcTlsConfig;
|
||||
use super::handlers;
|
||||
use super::write_queue::WriteQueue;
|
||||
|
||||
#[derive(Clone, Debug, Default)]
|
||||
pub struct RuntimeMetricsConfig {
|
||||
pub push_gateway: crate::metrics::PushGatewayConfig,
|
||||
}
|
||||
|
||||
/// Shared state for the volume server.
|
||||
pub struct VolumeServerState {
|
||||
pub store: RwLock<Store>,
|
||||
pub guard: RwLock<Guard>,
|
||||
pub is_stopping: RwLock<bool>,
|
||||
/// Maintenance mode flag.
|
||||
pub maintenance: AtomicBool,
|
||||
/// State version — incremented on each SetState call.
|
||||
pub state_version: AtomicU32,
|
||||
/// Throttling: concurrent upload/download limits (in bytes, 0 = disabled).
|
||||
pub concurrent_upload_limit: i64,
|
||||
pub concurrent_download_limit: i64,
|
||||
pub inflight_upload_data_timeout: std::time::Duration,
|
||||
pub inflight_download_data_timeout: std::time::Duration,
|
||||
/// Current in-flight upload/download bytes.
|
||||
pub inflight_upload_bytes: AtomicI64,
|
||||
pub inflight_download_bytes: AtomicI64,
|
||||
/// Notify waiters when inflight bytes decrease.
|
||||
pub upload_notify: tokio::sync::Notify,
|
||||
pub download_notify: tokio::sync::Notify,
|
||||
/// Data center name from config.
|
||||
pub data_center: String,
|
||||
/// Rack name from config.
|
||||
pub rack: String,
|
||||
/// File size limit in bytes (0 = no limit).
|
||||
pub file_size_limit_bytes: i64,
|
||||
/// Default IO rate limit for maintenance copy/replication work.
|
||||
pub maintenance_byte_per_second: i64,
|
||||
/// Whether the server is connected to master (heartbeat active).
|
||||
pub is_heartbeating: AtomicBool,
|
||||
/// Whether master addresses are configured.
|
||||
pub has_master: bool,
|
||||
/// Seconds to wait before shutting down servers (graceful drain).
|
||||
pub pre_stop_seconds: u32,
|
||||
/// Notify heartbeat to send an immediate update when volume state changes.
|
||||
pub volume_state_notify: tokio::sync::Notify,
|
||||
/// Optional batched write queue for improved throughput under load.
|
||||
pub write_queue: std::sync::OnceLock<WriteQueue>,
|
||||
/// Registry of S3 tier backends for tiered storage operations.
|
||||
pub s3_tier_registry: std::sync::RwLock<crate::remote_storage::s3_tier::S3TierRegistry>,
|
||||
/// Read mode: local, proxy, or redirect for non-local volumes.
|
||||
pub read_mode: ReadMode,
|
||||
/// First master address for volume lookups (e.g., "localhost:9333").
|
||||
pub master_url: String,
|
||||
/// Seed master addresses for UI rendering.
|
||||
pub master_urls: Vec<String>,
|
||||
/// This server's own address (ip:port) for filtering self from lookup results.
|
||||
pub self_url: String,
|
||||
/// HTTP client for proxy requests and master lookups.
|
||||
pub http_client: reqwest::Client,
|
||||
/// Scheme used for outgoing master and peer HTTP requests ("http" or "https").
|
||||
pub outgoing_http_scheme: String,
|
||||
/// Optional client TLS material for outgoing gRPC connections.
|
||||
pub outgoing_grpc_tls: Option<OutgoingGrpcTlsConfig>,
|
||||
/// Metrics push settings learned from master heartbeat responses.
|
||||
pub metrics_runtime: std::sync::RwLock<RuntimeMetricsConfig>,
|
||||
pub metrics_notify: tokio::sync::Notify,
|
||||
/// Whether JPEG uploads should be normalized using EXIF orientation.
|
||||
pub fix_jpg_orientation: bool,
|
||||
/// Read tuning flags for large-file streaming.
|
||||
pub has_slow_read: bool,
|
||||
pub read_buffer_size_bytes: usize,
|
||||
/// Path to security.toml — stored for SIGHUP reload.
|
||||
pub security_file: String,
|
||||
/// Original CLI whitelist entries — stored for SIGHUP reload.
|
||||
pub cli_white_list: Vec<String>,
|
||||
/// Path to state.pb file for persisting VolumeServerState across restarts.
|
||||
pub state_file_path: String,
|
||||
}
|
||||
|
||||
impl VolumeServerState {
|
||||
/// Check if the server is in maintenance mode; return gRPC error if so.
|
||||
pub fn check_maintenance(&self) -> Result<(), tonic::Status> {
|
||||
if self.maintenance.load(Ordering::Relaxed) {
|
||||
let id = self.store.read().unwrap().id.clone();
|
||||
return Err(tonic::Status::unavailable(format!(
|
||||
"volume server {} is in maintenance mode",
|
||||
id
|
||||
)));
|
||||
}
|
||||
Ok(())
|
||||
}
|
||||
}
|
||||
|
||||
pub fn build_metrics_router() -> Router {
|
||||
Router::new().route("/metrics", get(handlers::metrics_handler))
|
||||
}
|
||||
|
||||
pub fn normalize_outgoing_http_url(scheme: &str, raw_target: &str) -> Result<String, String> {
|
||||
if raw_target.starts_with("http://") || raw_target.starts_with("https://") {
|
||||
let mut url = reqwest::Url::parse(raw_target)
|
||||
.map_err(|e| format!("invalid url {}: {}", raw_target, e))?;
|
||||
url.set_scheme(scheme)
|
||||
.map_err(|_| format!("invalid scheme {}", scheme))?;
|
||||
return Ok(url.to_string());
|
||||
}
|
||||
Ok(format!("{}://{}", scheme, raw_target))
|
||||
}
|
||||
|
||||
fn request_remote_addr(request: &Request) -> Option<SocketAddr> {
|
||||
request
|
||||
.extensions()
|
||||
.get::<ConnectInfo<SocketAddr>>()
|
||||
.map(|info| info.0)
|
||||
}
|
||||
|
||||
fn request_is_whitelisted(state: &VolumeServerState, request: &Request) -> bool {
|
||||
request_remote_addr(request)
|
||||
.map(|remote_addr| {
|
||||
state
|
||||
.guard
|
||||
.read()
|
||||
.unwrap()
|
||||
.check_whitelist(&remote_addr.to_string())
|
||||
})
|
||||
.unwrap_or(true)
|
||||
}
|
||||
|
||||
/// Middleware: set Server header, echo x-amz-request-id, set CORS if Origin present.
|
||||
async fn common_headers_middleware(request: Request, next: Next) -> Response {
|
||||
let origin = request.headers().get("origin").cloned();
|
||||
let request_id = super::request_id::generate_http_request_id();
|
||||
|
||||
let mut response =
|
||||
super::request_id::scope_request_id(
|
||||
request_id.clone(),
|
||||
async move { next.run(request).await },
|
||||
)
|
||||
.await;
|
||||
|
||||
let headers = response.headers_mut();
|
||||
if let Ok(val) = HeaderValue::from_str(crate::version::server_header()) {
|
||||
headers.insert("Server", val);
|
||||
}
|
||||
|
||||
if let Ok(val) = HeaderValue::from_str(&request_id) {
|
||||
headers.insert("X-Request-Id", val.clone());
|
||||
headers.insert("x-amz-request-id", val);
|
||||
}
|
||||
|
||||
if origin.is_some() {
|
||||
headers.insert("Access-Control-Allow-Origin", HeaderValue::from_static("*"));
|
||||
headers.insert(
|
||||
"Access-Control-Allow-Credentials",
|
||||
HeaderValue::from_static("true"),
|
||||
);
|
||||
}
|
||||
|
||||
response
|
||||
}
|
||||
|
||||
/// Admin store handler — dispatches based on HTTP method.
|
||||
/// Matches Go's privateStoreHandler: GET/HEAD → read, POST/PUT → write,
|
||||
/// DELETE → delete, OPTIONS → CORS headers, anything else → 400.
|
||||
async fn admin_store_handler(state: State<Arc<VolumeServerState>>, request: Request) -> Response {
|
||||
let start = std::time::Instant::now();
|
||||
let method = request.method().clone();
|
||||
let mut method_str = method.as_str().to_string();
|
||||
let request_bytes = request
|
||||
.headers()
|
||||
.get(header::CONTENT_LENGTH)
|
||||
.and_then(|value| value.to_str().ok())
|
||||
.and_then(|value| value.parse::<i64>().ok())
|
||||
.filter(|value| *value > 0)
|
||||
.unwrap_or(0);
|
||||
super::server_stats::record_request_open();
|
||||
crate::metrics::INFLIGHT_REQUESTS_GAUGE
|
||||
.with_label_values(&[&method_str])
|
||||
.inc();
|
||||
let whitelist_rejected = matches!(method, Method::POST | Method::PUT | Method::DELETE)
|
||||
&& !request_is_whitelisted(&state, &request);
|
||||
let response = match method.clone() {
|
||||
_ if whitelist_rejected => StatusCode::UNAUTHORIZED.into_response(),
|
||||
Method::GET | Method::HEAD => {
|
||||
super::server_stats::record_read_request();
|
||||
handlers::get_or_head_handler_from_request(state, request).await
|
||||
}
|
||||
Method::POST | Method::PUT => {
|
||||
super::server_stats::record_write_request();
|
||||
if request_bytes > 0 {
|
||||
super::server_stats::record_bytes_in(request_bytes);
|
||||
}
|
||||
handlers::post_handler(state, request).await
|
||||
}
|
||||
Method::DELETE => {
|
||||
super::server_stats::record_delete_request();
|
||||
handlers::delete_handler(state, request).await
|
||||
}
|
||||
Method::OPTIONS => {
|
||||
super::server_stats::record_read_request();
|
||||
admin_options_response()
|
||||
}
|
||||
_ => {
|
||||
let method_name = request.method().to_string();
|
||||
let query = request.uri().query().map(|q| q.to_string());
|
||||
method_str = "INVALID".to_string();
|
||||
handlers::json_error_with_query(
|
||||
StatusCode::BAD_REQUEST,
|
||||
format!("unsupported method {}", method_name),
|
||||
query.as_deref(),
|
||||
)
|
||||
}
|
||||
};
|
||||
if method == Method::GET {
|
||||
if let Some(response_bytes) = response
|
||||
.headers()
|
||||
.get(header::CONTENT_LENGTH)
|
||||
.and_then(|value| value.to_str().ok())
|
||||
.and_then(|value| value.parse::<i64>().ok())
|
||||
.filter(|value| *value > 0)
|
||||
{
|
||||
super::server_stats::record_bytes_out(response_bytes);
|
||||
}
|
||||
}
|
||||
super::server_stats::record_request_close();
|
||||
crate::metrics::INFLIGHT_REQUESTS_GAUGE
|
||||
.with_label_values(&[&method_str])
|
||||
.dec();
|
||||
crate::metrics::REQUEST_COUNTER
|
||||
.with_label_values(&[&method_str, response.status().as_str()])
|
||||
.inc();
|
||||
crate::metrics::REQUEST_DURATION
|
||||
.with_label_values(&[&method_str])
|
||||
.observe(start.elapsed().as_secs_f64());
|
||||
response
|
||||
}
|
||||
|
||||
/// Public store handler — dispatches based on HTTP method.
|
||||
/// Matches Go's publicReadOnlyHandler: GET/HEAD → read, OPTIONS → CORS,
|
||||
/// anything else → 200 (passthrough no-op).
|
||||
async fn public_store_handler(state: State<Arc<VolumeServerState>>, request: Request) -> Response {
|
||||
let start = std::time::Instant::now();
|
||||
let method = request.method().clone();
|
||||
let method_str = method.as_str().to_string();
|
||||
super::server_stats::record_request_open();
|
||||
crate::metrics::INFLIGHT_REQUESTS_GAUGE
|
||||
.with_label_values(&[&method_str])
|
||||
.inc();
|
||||
let response = match method.clone() {
|
||||
Method::GET | Method::HEAD => {
|
||||
super::server_stats::record_read_request();
|
||||
handlers::get_or_head_handler_from_request(state, request).await
|
||||
}
|
||||
Method::OPTIONS => {
|
||||
super::server_stats::record_read_request();
|
||||
public_options_response()
|
||||
}
|
||||
_ => StatusCode::OK.into_response(),
|
||||
};
|
||||
if method == Method::GET {
|
||||
if let Some(response_bytes) = response
|
||||
.headers()
|
||||
.get(header::CONTENT_LENGTH)
|
||||
.and_then(|value| value.to_str().ok())
|
||||
.and_then(|value| value.parse::<i64>().ok())
|
||||
.filter(|value| *value > 0)
|
||||
{
|
||||
super::server_stats::record_bytes_out(response_bytes);
|
||||
}
|
||||
}
|
||||
super::server_stats::record_request_close();
|
||||
crate::metrics::INFLIGHT_REQUESTS_GAUGE
|
||||
.with_label_values(&[&method_str])
|
||||
.dec();
|
||||
crate::metrics::REQUEST_COUNTER
|
||||
.with_label_values(&[&method_str, response.status().as_str()])
|
||||
.inc();
|
||||
crate::metrics::REQUEST_DURATION
|
||||
.with_label_values(&[&method_str])
|
||||
.observe(start.elapsed().as_secs_f64());
|
||||
response
|
||||
}
|
||||
|
||||
/// Build OPTIONS response for admin port.
|
||||
fn admin_options_response() -> Response {
|
||||
let mut response = StatusCode::OK.into_response();
|
||||
let headers = response.headers_mut();
|
||||
headers.insert(
|
||||
"Access-Control-Allow-Methods",
|
||||
HeaderValue::from_static("PUT, POST, GET, DELETE, OPTIONS"),
|
||||
);
|
||||
headers.insert(
|
||||
"Access-Control-Allow-Headers",
|
||||
HeaderValue::from_static("*"),
|
||||
);
|
||||
response
|
||||
}
|
||||
|
||||
/// Build OPTIONS response for public port.
|
||||
fn public_options_response() -> Response {
|
||||
let mut response = StatusCode::OK.into_response();
|
||||
let headers = response.headers_mut();
|
||||
headers.insert(
|
||||
"Access-Control-Allow-Methods",
|
||||
HeaderValue::from_static("GET, OPTIONS"),
|
||||
);
|
||||
headers.insert(
|
||||
"Access-Control-Allow-Headers",
|
||||
HeaderValue::from_static("*"),
|
||||
);
|
||||
response
|
||||
}
|
||||
|
||||
/// Build the admin (private) HTTP router — supports all operations.
|
||||
/// UI route is only registered when no signing keys are configured,
|
||||
/// matching Go's `if signingKey == "" || enableUiAccess` check.
|
||||
pub fn build_admin_router(state: Arc<VolumeServerState>) -> Router {
|
||||
let guard = state.guard.read().unwrap();
|
||||
// This helper can only derive the default Go behavior from the guard state:
|
||||
// UI stays enabled when the write signing key is empty. The explicit
|
||||
// `access.ui` override is handled by `build_admin_router_with_ui(...)`.
|
||||
let ui_enabled = guard.signing_key.0.is_empty();
|
||||
drop(guard);
|
||||
build_admin_router_with_ui(state, ui_enabled)
|
||||
}
|
||||
|
||||
/// Build the admin router with an explicit UI exposure flag.
|
||||
pub fn build_admin_router_with_ui(state: Arc<VolumeServerState>, ui_enabled: bool) -> Router {
|
||||
let mut router = Router::new()
|
||||
.route("/status", get(handlers::status_handler))
|
||||
.route("/healthz", get(handlers::healthz_handler))
|
||||
.route("/favicon.ico", get(handlers::favicon_handler))
|
||||
.route(
|
||||
"/seaweedfsstatic/*path",
|
||||
get(handlers::static_asset_handler),
|
||||
)
|
||||
.route("/", any(admin_store_handler))
|
||||
.route("/:path", any(admin_store_handler))
|
||||
.route("/:vid/:fid", any(admin_store_handler))
|
||||
.route("/:vid/:fid/:filename", any(admin_store_handler))
|
||||
.fallback(admin_store_handler);
|
||||
if ui_enabled {
|
||||
// Note: /stats/* endpoints are commented out in Go's volume_server.go (L130-134).
|
||||
// Only the UI endpoint is registered when UI access is enabled.
|
||||
router = router.route("/ui/index.html", get(handlers::ui_handler));
|
||||
}
|
||||
router
|
||||
.layer(middleware::from_fn(common_headers_middleware))
|
||||
.with_state(state)
|
||||
}
|
||||
|
||||
/// Build the public (read-only) HTTP router — only GET/HEAD.
|
||||
pub fn build_public_router(state: Arc<VolumeServerState>) -> Router {
|
||||
Router::new()
|
||||
.route("/favicon.ico", get(handlers::favicon_handler))
|
||||
.route(
|
||||
"/seaweedfsstatic/*path",
|
||||
get(handlers::static_asset_handler),
|
||||
)
|
||||
.route("/", any(public_store_handler))
|
||||
.route("/:path", any(public_store_handler))
|
||||
.route("/:vid/:fid", any(public_store_handler))
|
||||
.route("/:vid/:fid/:filename", any(public_store_handler))
|
||||
.fallback(public_store_handler)
|
||||
.layer(middleware::from_fn(common_headers_middleware))
|
||||
.with_state(state)
|
||||
}
|
||||
330
seaweed-volume/src/server/write_queue.rs
Normal file
330
seaweed-volume/src/server/write_queue.rs
Normal file
@@ -0,0 +1,330 @@
|
||||
//! Async batched write processing for the volume server.
|
||||
//!
|
||||
//! Instead of each upload handler directly calling `write_needle` and syncing,
|
||||
//! writes are submitted to a queue. A background worker drains the queue in
|
||||
//! batches (up to 128 entries), groups them by volume ID, processes them
|
||||
//! together, and syncs once per volume for the entire batch.
|
||||
|
||||
use std::sync::Arc;
|
||||
|
||||
use tokio::sync::{mpsc, oneshot};
|
||||
use tracing::debug;
|
||||
|
||||
use crate::storage::needle::needle::Needle;
|
||||
use crate::storage::types::{Size, VolumeId};
|
||||
use crate::storage::volume::VolumeError;
|
||||
|
||||
use super::volume_server::VolumeServerState;
|
||||
|
||||
/// Result of a single write operation: (offset, size, is_unchanged).
|
||||
pub type WriteResult = Result<(u64, Size, bool), VolumeError>;
|
||||
|
||||
/// A request to write a needle, submitted to the write queue.
|
||||
pub struct WriteRequest {
|
||||
pub volume_id: VolumeId,
|
||||
pub needle: Needle,
|
||||
pub response_tx: oneshot::Sender<WriteResult>,
|
||||
}
|
||||
|
||||
/// Maximum number of write requests to batch together.
|
||||
const MAX_BATCH_SIZE: usize = 128;
|
||||
|
||||
/// Maximum bytes to accumulate per batch before breaking (matches Go's 4MB limit).
|
||||
/// This prevents large writes from accumulating unbounded latency.
|
||||
const MAX_BATCH_BYTES: usize = 4 * 1024 * 1024;
|
||||
|
||||
/// Handle for submitting write requests to the background worker.
|
||||
#[derive(Clone)]
|
||||
pub struct WriteQueue {
|
||||
tx: mpsc::Sender<WriteRequest>,
|
||||
}
|
||||
|
||||
impl WriteQueue {
|
||||
/// Create a new write queue and spawn the background worker.
|
||||
///
|
||||
/// `capacity` controls the channel buffer size (backpressure kicks in when full).
|
||||
/// The worker holds a reference to `state` for accessing the store.
|
||||
pub fn new(state: Arc<VolumeServerState>, capacity: usize) -> Self {
|
||||
let (tx, rx) = mpsc::channel(capacity);
|
||||
let worker = WriteQueueWorker { rx, state };
|
||||
tokio::spawn(worker.run());
|
||||
WriteQueue { tx }
|
||||
}
|
||||
|
||||
/// Submit a write request and wait for the result.
|
||||
///
|
||||
/// Returns `Err` if the worker has shut down or the response channel was dropped.
|
||||
pub async fn submit(&self, volume_id: VolumeId, needle: Needle) -> WriteResult {
|
||||
let (response_tx, response_rx) = oneshot::channel();
|
||||
let request = WriteRequest {
|
||||
volume_id,
|
||||
needle,
|
||||
response_tx,
|
||||
};
|
||||
|
||||
// Send to queue; this awaits if the channel is full (backpressure).
|
||||
if self.tx.send(request).await.is_err() {
|
||||
return Err(VolumeError::Io(std::io::Error::new(
|
||||
std::io::ErrorKind::BrokenPipe,
|
||||
"write queue worker has shut down",
|
||||
)));
|
||||
}
|
||||
|
||||
// Wait for the worker to process our request.
|
||||
match response_rx.await {
|
||||
Ok(result) => result,
|
||||
Err(_) => Err(VolumeError::Io(std::io::Error::new(
|
||||
std::io::ErrorKind::BrokenPipe,
|
||||
"write queue worker dropped response channel",
|
||||
))),
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
/// Background worker that drains write requests and processes them in batches.
|
||||
struct WriteQueueWorker {
|
||||
rx: mpsc::Receiver<WriteRequest>,
|
||||
state: Arc<VolumeServerState>,
|
||||
}
|
||||
|
||||
impl WriteQueueWorker {
|
||||
async fn run(mut self) {
|
||||
debug!("write queue worker started");
|
||||
|
||||
loop {
|
||||
// Wait for the first request (blocks until one arrives or channel closes).
|
||||
let first = match self.rx.recv().await {
|
||||
Some(req) => req,
|
||||
None => {
|
||||
debug!("write queue channel closed, worker exiting");
|
||||
return;
|
||||
}
|
||||
};
|
||||
|
||||
// Drain as many additional requests as available, up to MAX_BATCH_SIZE
|
||||
// or MAX_BATCH_BYTES (matches Go: 128 requests or 4MB, whichever comes first).
|
||||
let mut batch = Vec::with_capacity(MAX_BATCH_SIZE);
|
||||
let mut batch_bytes: usize = first.needle.data.len();
|
||||
batch.push(first);
|
||||
|
||||
while batch.len() < MAX_BATCH_SIZE && batch_bytes < MAX_BATCH_BYTES {
|
||||
match self.rx.try_recv() {
|
||||
Ok(req) => {
|
||||
batch_bytes += req.needle.data.len();
|
||||
batch.push(req);
|
||||
}
|
||||
Err(_) => break,
|
||||
}
|
||||
}
|
||||
|
||||
let batch_size = batch.len();
|
||||
debug!("processing write batch of {} requests", batch_size);
|
||||
|
||||
// Process the batch in spawn_blocking since write_needle does file I/O.
|
||||
let state = self.state.clone();
|
||||
let _ = tokio::task::spawn_blocking(move || {
|
||||
process_batch(state, batch);
|
||||
})
|
||||
.await;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
/// Process a batch of write requests, grouped by volume ID.
|
||||
///
|
||||
/// Groups writes by volume to minimize the number of store lock acquisitions,
|
||||
/// then sends results back via each request's oneshot channel.
|
||||
fn process_batch(state: Arc<VolumeServerState>, batch: Vec<WriteRequest>) {
|
||||
// Group requests by volume ID for efficient processing.
|
||||
// We use a Vec of (VolumeId, Vec<(Needle, Sender)>) to preserve order
|
||||
// and avoid requiring Hash on VolumeId.
|
||||
let mut groups: Vec<(VolumeId, Vec<(Needle, oneshot::Sender<WriteResult>)>)> = Vec::new();
|
||||
|
||||
for req in batch {
|
||||
let vid = req.volume_id;
|
||||
if let Some(group) = groups.iter_mut().find(|(v, _)| *v == vid) {
|
||||
group.1.push((req.needle, req.response_tx));
|
||||
} else {
|
||||
groups.push((vid, vec![(req.needle, req.response_tx)]));
|
||||
}
|
||||
}
|
||||
|
||||
// Process each volume group under a single store lock.
|
||||
let mut store = state.store.write().unwrap();
|
||||
|
||||
for (vid, entries) in groups {
|
||||
for (mut needle, response_tx) in entries {
|
||||
let result = store.write_volume_needle(vid, &mut needle);
|
||||
// Send result back; ignore error if receiver dropped.
|
||||
let _ = response_tx.send(result);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
use crate::storage::types::VolumeId;
|
||||
|
||||
/// Helper to create a minimal VolumeServerState for testing.
|
||||
fn make_test_state() -> Arc<VolumeServerState> {
|
||||
use crate::security::{Guard, SigningKey};
|
||||
use crate::server::volume_server::RuntimeMetricsConfig;
|
||||
use crate::storage::needle_map::NeedleMapKind;
|
||||
use crate::storage::store::Store;
|
||||
use std::sync::atomic::{AtomicBool, AtomicI64, AtomicU32};
|
||||
use std::sync::RwLock;
|
||||
|
||||
let store = Store::new(NeedleMapKind::InMemory);
|
||||
let guard = Guard::new(&[], SigningKey(vec![]), 0, SigningKey(vec![]), 0);
|
||||
|
||||
Arc::new(VolumeServerState {
|
||||
store: RwLock::new(store),
|
||||
guard: RwLock::new(guard),
|
||||
is_stopping: RwLock::new(false),
|
||||
maintenance: AtomicBool::new(false),
|
||||
state_version: AtomicU32::new(0),
|
||||
concurrent_upload_limit: 0,
|
||||
concurrent_download_limit: 0,
|
||||
inflight_upload_data_timeout: std::time::Duration::ZERO,
|
||||
inflight_download_data_timeout: std::time::Duration::ZERO,
|
||||
inflight_upload_bytes: AtomicI64::new(0),
|
||||
inflight_download_bytes: AtomicI64::new(0),
|
||||
upload_notify: tokio::sync::Notify::new(),
|
||||
download_notify: tokio::sync::Notify::new(),
|
||||
data_center: String::new(),
|
||||
rack: String::new(),
|
||||
file_size_limit_bytes: 0,
|
||||
maintenance_byte_per_second: 0,
|
||||
is_heartbeating: AtomicBool::new(false),
|
||||
has_master: false,
|
||||
pre_stop_seconds: 0,
|
||||
volume_state_notify: tokio::sync::Notify::new(),
|
||||
write_queue: std::sync::OnceLock::new(),
|
||||
s3_tier_registry: std::sync::RwLock::new(
|
||||
crate::remote_storage::s3_tier::S3TierRegistry::new(),
|
||||
),
|
||||
read_mode: crate::config::ReadMode::Local,
|
||||
master_url: String::new(),
|
||||
master_urls: Vec::new(),
|
||||
self_url: String::new(),
|
||||
http_client: reqwest::Client::new(),
|
||||
outgoing_http_scheme: "http".to_string(),
|
||||
outgoing_grpc_tls: None,
|
||||
metrics_runtime: std::sync::RwLock::new(RuntimeMetricsConfig::default()),
|
||||
metrics_notify: tokio::sync::Notify::new(),
|
||||
fix_jpg_orientation: false,
|
||||
has_slow_read: true,
|
||||
read_buffer_size_bytes: 4 * 1024 * 1024,
|
||||
security_file: String::new(),
|
||||
cli_white_list: vec![],
|
||||
state_file_path: String::new(),
|
||||
})
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn test_write_queue_submit_no_volume() {
|
||||
// Submit a write to a non-existent volume -- should return VolumeError::NotFound.
|
||||
let state = make_test_state();
|
||||
let queue = WriteQueue::new(state, MAX_BATCH_SIZE);
|
||||
|
||||
let needle = Needle {
|
||||
id: 1.into(),
|
||||
cookie: 0x12345678.into(),
|
||||
data: vec![1, 2, 3],
|
||||
data_size: 3,
|
||||
..Needle::default()
|
||||
};
|
||||
|
||||
let result = queue.submit(VolumeId(999), needle).await;
|
||||
assert!(result.is_err());
|
||||
match result {
|
||||
Err(VolumeError::NotFound) => {} // expected
|
||||
other => panic!("expected NotFound, got {:?}", other),
|
||||
}
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn test_write_queue_concurrent_submissions() {
|
||||
// Submit multiple concurrent writes -- all should complete (with errors since no volume).
|
||||
let state = make_test_state();
|
||||
let queue = WriteQueue::new(state, MAX_BATCH_SIZE);
|
||||
|
||||
let mut handles = Vec::new();
|
||||
for i in 0..10u64 {
|
||||
let q = queue.clone();
|
||||
handles.push(tokio::spawn(async move {
|
||||
let needle = Needle {
|
||||
id: i.into(),
|
||||
cookie: 0xABCD.into(),
|
||||
data: vec![i as u8; 10],
|
||||
data_size: 10,
|
||||
..Needle::default()
|
||||
};
|
||||
q.submit(VolumeId(1), needle).await
|
||||
}));
|
||||
}
|
||||
|
||||
for handle in handles {
|
||||
let result = handle.await.unwrap();
|
||||
// All should fail with NotFound since there's no volume 1
|
||||
assert!(matches!(result, Err(VolumeError::NotFound)));
|
||||
}
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn test_write_queue_batching() {
|
||||
// Verify that many concurrent writes get processed (testing the batching path).
|
||||
let state = make_test_state();
|
||||
let queue = WriteQueue::new(state, MAX_BATCH_SIZE);
|
||||
|
||||
// Submit MAX_BATCH_SIZE requests concurrently
|
||||
let mut handles = Vec::new();
|
||||
for i in 0..MAX_BATCH_SIZE as u64 {
|
||||
let q = queue.clone();
|
||||
handles.push(tokio::spawn(async move {
|
||||
let needle = Needle {
|
||||
id: i.into(),
|
||||
cookie: 0x1111.into(),
|
||||
data: vec![0u8; 4],
|
||||
data_size: 4,
|
||||
..Needle::default()
|
||||
};
|
||||
q.submit(VolumeId(42), needle).await
|
||||
}));
|
||||
}
|
||||
|
||||
let mut results = Vec::new();
|
||||
for handle in handles {
|
||||
results.push(handle.await.unwrap());
|
||||
}
|
||||
|
||||
// All should complete (with NotFound errors since no volume exists)
|
||||
assert_eq!(results.len(), MAX_BATCH_SIZE);
|
||||
for r in results {
|
||||
assert!(matches!(r, Err(VolumeError::NotFound)));
|
||||
}
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn test_write_queue_dropped_sender() {
|
||||
// When the queue is dropped, subsequent submits should fail gracefully.
|
||||
let state = make_test_state();
|
||||
let queue = WriteQueue::new(state, 1);
|
||||
|
||||
// Clone then drop the original -- the worker keeps running via its rx handle.
|
||||
let queue2 = queue.clone();
|
||||
drop(queue);
|
||||
|
||||
// This should still work since the worker is alive.
|
||||
let needle = Needle {
|
||||
id: 1.into(),
|
||||
cookie: 0.into(),
|
||||
data: vec![],
|
||||
data_size: 0,
|
||||
..Needle::default()
|
||||
};
|
||||
let result = queue2.submit(VolumeId(1), needle).await;
|
||||
assert!(result.is_err()); // NotFound is fine -- the point is it doesn't panic
|
||||
}
|
||||
}
|
||||
951
seaweed-volume/src/storage/disk_location.rs
Normal file
951
seaweed-volume/src/storage/disk_location.rs
Normal file
@@ -0,0 +1,951 @@
|
||||
//! DiskLocation: manages volumes on a single disk/directory.
|
||||
//!
|
||||
//! Each DiskLocation represents one storage directory containing .dat + .idx files.
|
||||
//! A Store contains one or more DiskLocations (one per configured directory).
|
||||
//! Matches Go's storage/disk_location.go.
|
||||
|
||||
use std::collections::{HashMap, HashSet};
|
||||
use std::fs;
|
||||
use std::io;
|
||||
use std::sync::atomic::{AtomicBool, AtomicI32, AtomicU64, Ordering};
|
||||
use std::sync::Arc;
|
||||
|
||||
use tracing::{info, warn};
|
||||
|
||||
use crate::config::MinFreeSpace;
|
||||
use crate::storage::erasure_coding::ec_shard::{
|
||||
EcVolumeShard, DATA_SHARDS_COUNT, ERASURE_CODING_LARGE_BLOCK_SIZE,
|
||||
ERASURE_CODING_SMALL_BLOCK_SIZE,
|
||||
};
|
||||
use crate::storage::erasure_coding::ec_volume::EcVolume;
|
||||
use crate::storage::needle_map::NeedleMapKind;
|
||||
use crate::storage::super_block::ReplicaPlacement;
|
||||
use crate::storage::types::*;
|
||||
use crate::storage::volume::{remove_volume_files, volume_file_name, Volume, VolumeError};
|
||||
|
||||
/// A single disk location managing volumes in one directory.
|
||||
pub struct DiskLocation {
|
||||
pub directory: String,
|
||||
pub idx_directory: String,
|
||||
pub directory_uuid: String,
|
||||
pub disk_type: DiskType,
|
||||
pub tags: Vec<String>,
|
||||
pub max_volume_count: AtomicI32,
|
||||
pub original_max_volume_count: i32,
|
||||
volumes: HashMap<VolumeId, Volume>,
|
||||
ec_volumes: HashMap<VolumeId, EcVolume>,
|
||||
pub is_disk_space_low: Arc<AtomicBool>,
|
||||
pub available_space: AtomicU64,
|
||||
pub min_free_space: MinFreeSpace,
|
||||
}
|
||||
|
||||
impl DiskLocation {
|
||||
const UUID_FILE_NAME: &'static str = "vol_dir.uuid";
|
||||
|
||||
pub fn new(
|
||||
directory: &str,
|
||||
idx_directory: &str,
|
||||
max_volume_count: i32,
|
||||
disk_type: DiskType,
|
||||
min_free_space: MinFreeSpace,
|
||||
tags: Vec<String>,
|
||||
) -> io::Result<Self> {
|
||||
fs::create_dir_all(directory)?;
|
||||
|
||||
let idx_dir = if idx_directory.is_empty() {
|
||||
directory.to_string()
|
||||
} else {
|
||||
fs::create_dir_all(idx_directory)?;
|
||||
idx_directory.to_string()
|
||||
};
|
||||
let directory_uuid = Self::generate_directory_uuid(directory)?;
|
||||
|
||||
Ok(DiskLocation {
|
||||
directory: directory.to_string(),
|
||||
idx_directory: idx_dir,
|
||||
directory_uuid,
|
||||
disk_type,
|
||||
tags,
|
||||
max_volume_count: AtomicI32::new(max_volume_count),
|
||||
original_max_volume_count: max_volume_count,
|
||||
volumes: HashMap::new(),
|
||||
ec_volumes: HashMap::new(),
|
||||
is_disk_space_low: Arc::new(AtomicBool::new(false)),
|
||||
available_space: AtomicU64::new(0),
|
||||
min_free_space,
|
||||
})
|
||||
}
|
||||
|
||||
fn generate_directory_uuid(directory: &str) -> io::Result<String> {
|
||||
let path = std::path::Path::new(directory).join(Self::UUID_FILE_NAME);
|
||||
if path.exists() {
|
||||
let existing = fs::read_to_string(&path)?;
|
||||
if !existing.trim().is_empty() {
|
||||
return Ok(existing);
|
||||
}
|
||||
}
|
||||
|
||||
let dir_uuid = uuid::Uuid::new_v4().to_string();
|
||||
fs::write(path, &dir_uuid)?;
|
||||
Ok(dir_uuid)
|
||||
}
|
||||
|
||||
// ---- Volume management ----
|
||||
|
||||
/// Load existing volumes from the directory.
|
||||
///
|
||||
/// Matches Go's `loadExistingVolume`: checks for incomplete volumes (.note file),
|
||||
/// validates EC shards before skipping .dat loading, and cleans up stale
|
||||
/// compaction temp files (.cpd/.cpx).
|
||||
pub fn load_existing_volumes(&mut self, needle_map_kind: NeedleMapKind) -> io::Result<()> {
|
||||
// Ensure directory exists
|
||||
fs::create_dir_all(&self.directory)?;
|
||||
if self.directory != self.idx_directory {
|
||||
fs::create_dir_all(&self.idx_directory)?;
|
||||
}
|
||||
|
||||
// Scan for .dat files
|
||||
let entries = fs::read_dir(&self.directory)?;
|
||||
let mut dat_files: Vec<(String, VolumeId)> = Vec::new();
|
||||
let mut seen = HashSet::new();
|
||||
|
||||
for entry in entries {
|
||||
let entry = entry?;
|
||||
let name = entry.file_name().into_string().unwrap_or_default();
|
||||
if let Some((collection, vid)) = parse_volume_filename(&name) {
|
||||
if seen.insert((collection.clone(), vid)) {
|
||||
dat_files.push((collection, vid));
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
for (collection, vid) in dat_files {
|
||||
let volume_name = volume_file_name(&self.directory, &collection, vid);
|
||||
let idx_name = volume_file_name(&self.idx_directory, &collection, vid);
|
||||
|
||||
// Check for incomplete volume (.note file means a VolumeCopy was interrupted)
|
||||
let note_path = format!("{}.note", volume_name);
|
||||
if std::path::Path::new(¬e_path).exists() {
|
||||
let note = fs::read_to_string(¬e_path).unwrap_or_default();
|
||||
warn!(
|
||||
volume_id = vid.0,
|
||||
"volume was not completed: {}, removing files", note
|
||||
);
|
||||
remove_volume_files(&volume_name);
|
||||
remove_volume_files(&idx_name);
|
||||
continue;
|
||||
}
|
||||
|
||||
// If valid EC shards exist (.ecx file present), skip loading .dat
|
||||
let ecx_path = format!("{}.ecx", idx_name);
|
||||
let ecx_exists = if std::path::Path::new(&ecx_path).exists() {
|
||||
true
|
||||
} else if self.idx_directory != self.directory {
|
||||
// .ecx may have been created before -dir.idx was configured
|
||||
let fallback = format!("{}.ecx", volume_name);
|
||||
std::path::Path::new(&fallback).exists()
|
||||
} else {
|
||||
false
|
||||
};
|
||||
if ecx_exists {
|
||||
if self.validate_ec_volume(&collection, vid) {
|
||||
// Valid EC volume — don't load .dat
|
||||
continue;
|
||||
} else {
|
||||
warn!(
|
||||
volume_id = vid.0,
|
||||
"EC volume validation failed, removing incomplete EC files"
|
||||
);
|
||||
self.remove_ec_volume_files(&collection, vid);
|
||||
// Fall through to load .dat file
|
||||
}
|
||||
}
|
||||
|
||||
// Clean up stale compaction temp files
|
||||
let cpd_path = format!("{}.cpd", volume_name);
|
||||
let cpx_path = format!("{}.cpx", idx_name);
|
||||
if std::path::Path::new(&cpd_path).exists() {
|
||||
info!(volume_id = vid.0, "removing stale compaction file .cpd");
|
||||
let _ = fs::remove_file(&cpd_path);
|
||||
}
|
||||
if std::path::Path::new(&cpx_path).exists() {
|
||||
info!(volume_id = vid.0, "removing stale compaction file .cpx");
|
||||
let _ = fs::remove_file(&cpx_path);
|
||||
}
|
||||
|
||||
// Skip if already loaded (e.g., from a previous call)
|
||||
if self.volumes.contains_key(&vid) {
|
||||
continue;
|
||||
}
|
||||
|
||||
match Volume::new(
|
||||
&self.directory,
|
||||
&self.idx_directory,
|
||||
&collection,
|
||||
vid,
|
||||
needle_map_kind,
|
||||
None, // replica placement read from superblock
|
||||
None, // TTL read from superblock
|
||||
0, // no preallocate on load
|
||||
Version::current(),
|
||||
) {
|
||||
Ok(mut v) => {
|
||||
v.location_disk_space_low = self.is_disk_space_low.clone();
|
||||
crate::metrics::VOLUME_GAUGE
|
||||
.with_label_values(&[&collection, "volume"])
|
||||
.inc();
|
||||
self.volumes.insert(vid, v);
|
||||
}
|
||||
Err(e) => {
|
||||
warn!(volume_id = vid.0, error = %e, "failed to load volume");
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
Ok(())
|
||||
}
|
||||
|
||||
/// Validate EC volume shards: all shards must be same size, and if .dat exists,
|
||||
/// need at least DATA_SHARDS_COUNT shards with size matching expected.
|
||||
fn validate_ec_volume(&self, collection: &str, vid: VolumeId) -> bool {
|
||||
let base = volume_file_name(&self.directory, collection, vid);
|
||||
let dat_path = format!("{}.dat", base);
|
||||
|
||||
let mut expected_shard_size: Option<i64> = None;
|
||||
let dat_exists = std::path::Path::new(&dat_path).exists();
|
||||
|
||||
if dat_exists {
|
||||
if let Ok(meta) = fs::metadata(&dat_path) {
|
||||
expected_shard_size = Some(calculate_expected_shard_size(meta.len() as i64));
|
||||
} else {
|
||||
return false;
|
||||
}
|
||||
}
|
||||
|
||||
let mut shard_count = 0usize;
|
||||
let mut actual_shard_size: Option<i64> = None;
|
||||
const MAX_SHARD_COUNT: usize = 32;
|
||||
|
||||
for i in 0..MAX_SHARD_COUNT {
|
||||
let shard_path = format!("{}.ec{:02}", base, i);
|
||||
match fs::metadata(&shard_path) {
|
||||
Ok(meta) if meta.len() > 0 => {
|
||||
let size = meta.len() as i64;
|
||||
if let Some(prev) = actual_shard_size {
|
||||
if size != prev {
|
||||
warn!(
|
||||
volume_id = vid.0,
|
||||
shard = i,
|
||||
size,
|
||||
expected = prev,
|
||||
"EC shard size mismatch"
|
||||
);
|
||||
return false;
|
||||
}
|
||||
} else {
|
||||
actual_shard_size = Some(size);
|
||||
}
|
||||
shard_count += 1;
|
||||
}
|
||||
Err(e) if e.kind() != io::ErrorKind::NotFound => {
|
||||
warn!(
|
||||
volume_id = vid.0,
|
||||
shard = i,
|
||||
error = %e,
|
||||
"failed to stat EC shard"
|
||||
);
|
||||
return false;
|
||||
}
|
||||
_ => {} // not found or zero size — skip
|
||||
}
|
||||
}
|
||||
|
||||
// If .dat exists, validate shard size matches expected
|
||||
if dat_exists {
|
||||
if let (Some(actual), Some(expected)) = (actual_shard_size, expected_shard_size) {
|
||||
if actual != expected {
|
||||
warn!(
|
||||
volume_id = vid.0,
|
||||
actual_shard_size = actual,
|
||||
expected_shard_size = expected,
|
||||
"EC shard size doesn't match .dat file"
|
||||
);
|
||||
return false;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// Distributed EC (no .dat): any shard count is valid
|
||||
if !dat_exists {
|
||||
return true;
|
||||
}
|
||||
|
||||
// With .dat: need at least DATA_SHARDS_COUNT shards
|
||||
if shard_count < DATA_SHARDS_COUNT {
|
||||
warn!(
|
||||
volume_id = vid.0,
|
||||
shard_count,
|
||||
required = DATA_SHARDS_COUNT,
|
||||
"EC volume has .dat but too few shards"
|
||||
);
|
||||
return false;
|
||||
}
|
||||
|
||||
true
|
||||
}
|
||||
|
||||
/// Remove all EC-related files for a volume.
|
||||
fn remove_ec_volume_files(&self, collection: &str, vid: VolumeId) {
|
||||
let base = volume_file_name(&self.directory, collection, vid);
|
||||
let idx_base = volume_file_name(&self.idx_directory, collection, vid);
|
||||
const MAX_SHARD_COUNT: usize = 32;
|
||||
|
||||
// Remove index files from idx directory (.ecx, .ecj)
|
||||
let _ = fs::remove_file(format!("{}.ecx", idx_base));
|
||||
let _ = fs::remove_file(format!("{}.ecj", idx_base));
|
||||
// Also try data directory in case .ecx/.ecj were created before -dir.idx was configured
|
||||
if self.idx_directory != self.directory {
|
||||
let _ = fs::remove_file(format!("{}.ecx", base));
|
||||
let _ = fs::remove_file(format!("{}.ecj", base));
|
||||
}
|
||||
|
||||
// Remove all EC shard files (.ec00 ~ .ec31)
|
||||
for i in 0..MAX_SHARD_COUNT {
|
||||
let _ = fs::remove_file(format!("{}.ec{:02}", base, i));
|
||||
}
|
||||
}
|
||||
|
||||
/// Find a volume by ID.
|
||||
pub fn find_volume(&self, vid: VolumeId) -> Option<&Volume> {
|
||||
self.volumes.get(&vid)
|
||||
}
|
||||
|
||||
/// Find a volume by ID (mutable).
|
||||
pub fn find_volume_mut(&mut self, vid: VolumeId) -> Option<&mut Volume> {
|
||||
self.volumes.get_mut(&vid)
|
||||
}
|
||||
|
||||
/// Add a volume to this location.
|
||||
pub fn set_volume(&mut self, vid: VolumeId, volume: Volume) {
|
||||
let collection = volume.collection.clone();
|
||||
self.volumes.insert(vid, volume);
|
||||
crate::metrics::VOLUME_GAUGE
|
||||
.with_label_values(&[&collection, "volume"])
|
||||
.inc();
|
||||
}
|
||||
|
||||
/// Create a new volume in this location.
|
||||
pub fn create_volume(
|
||||
&mut self,
|
||||
vid: VolumeId,
|
||||
collection: &str,
|
||||
needle_map_kind: NeedleMapKind,
|
||||
replica_placement: Option<ReplicaPlacement>,
|
||||
ttl: Option<crate::storage::needle::ttl::TTL>,
|
||||
preallocate: u64,
|
||||
version: Version,
|
||||
) -> Result<(), VolumeError> {
|
||||
let mut v = Volume::new(
|
||||
&self.directory,
|
||||
&self.idx_directory,
|
||||
collection,
|
||||
vid,
|
||||
needle_map_kind,
|
||||
replica_placement,
|
||||
ttl,
|
||||
preallocate,
|
||||
version,
|
||||
)?;
|
||||
v.location_disk_space_low = self.is_disk_space_low.clone();
|
||||
crate::metrics::VOLUME_GAUGE
|
||||
.with_label_values(&[collection, "volume"])
|
||||
.inc();
|
||||
self.volumes.insert(vid, v);
|
||||
Ok(())
|
||||
}
|
||||
|
||||
/// Remove and close a volume.
|
||||
pub fn unload_volume(&mut self, vid: VolumeId) -> Option<Volume> {
|
||||
if let Some(mut v) = self.volumes.remove(&vid) {
|
||||
crate::metrics::VOLUME_GAUGE
|
||||
.with_label_values(&[&v.collection, "volume"])
|
||||
.dec();
|
||||
v.close();
|
||||
Some(v)
|
||||
} else {
|
||||
None
|
||||
}
|
||||
}
|
||||
|
||||
/// Remove, close, and delete all files for a volume.
|
||||
pub fn delete_volume(&mut self, vid: VolumeId, only_empty: bool) -> Result<(), VolumeError> {
|
||||
if let Some(mut v) = self.volumes.remove(&vid) {
|
||||
crate::metrics::VOLUME_GAUGE
|
||||
.with_label_values(&[&v.collection, "volume"])
|
||||
.dec();
|
||||
v.destroy(only_empty)?;
|
||||
Ok(())
|
||||
} else {
|
||||
Err(VolumeError::NotFound)
|
||||
}
|
||||
}
|
||||
|
||||
/// Delete all volumes in a collection.
|
||||
pub fn delete_collection(&mut self, collection: &str) -> Result<(), VolumeError> {
|
||||
let vids: Vec<VolumeId> = self
|
||||
.volumes
|
||||
.iter()
|
||||
.filter(|(_, v)| v.collection == collection && !v.is_compacting())
|
||||
.map(|(vid, _)| *vid)
|
||||
.collect();
|
||||
|
||||
for vid in vids {
|
||||
if let Some(mut v) = self.volumes.remove(&vid) {
|
||||
crate::metrics::VOLUME_GAUGE
|
||||
.with_label_values(&[&v.collection, "volume"])
|
||||
.dec();
|
||||
if let Err(e) = v.destroy(false) {
|
||||
warn!(volume_id = vid.0, error = %e, "delete collection: failed to destroy volume");
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
let ec_vids: Vec<VolumeId> = self
|
||||
.ec_volumes
|
||||
.iter()
|
||||
.filter(|(_, v)| v.collection == collection)
|
||||
.map(|(vid, _)| *vid)
|
||||
.collect();
|
||||
|
||||
for vid in ec_vids {
|
||||
if let Some(mut ec_vol) = self.ec_volumes.remove(&vid) {
|
||||
for _ in 0..ec_vol.shard_count() {
|
||||
crate::metrics::VOLUME_GAUGE
|
||||
.with_label_values(&[collection, "ec_shards"])
|
||||
.dec();
|
||||
}
|
||||
ec_vol.destroy();
|
||||
}
|
||||
}
|
||||
Ok(())
|
||||
}
|
||||
|
||||
// ---- Metrics ----
|
||||
|
||||
/// Number of volumes on this disk.
|
||||
pub fn volumes_len(&self) -> usize {
|
||||
self.volumes.len()
|
||||
}
|
||||
|
||||
/// Get all volume IDs, sorted.
|
||||
pub fn volume_ids(&self) -> Vec<VolumeId> {
|
||||
let mut ids: Vec<VolumeId> = self.volumes.keys().copied().collect();
|
||||
ids.sort();
|
||||
ids
|
||||
}
|
||||
|
||||
/// Iterate over all volumes.
|
||||
pub fn iter_volumes(&self) -> impl Iterator<Item = (&VolumeId, &Volume)> {
|
||||
self.volumes.iter()
|
||||
}
|
||||
|
||||
/// Number of free volume slots.
|
||||
/// Matches Go's FindFreeLocation formula:
|
||||
/// free = ((MaxVolumeCount - VolumesLen()) * DataShardsCount - EcShardCount()) / DataShardsCount
|
||||
pub fn free_volume_count(&self) -> i32 {
|
||||
use crate::storage::erasure_coding::ec_shard::DATA_SHARDS_COUNT;
|
||||
let max = self.max_volume_count.load(Ordering::Relaxed);
|
||||
let free_count = (max as i64 - self.volumes.len() as i64)
|
||||
* DATA_SHARDS_COUNT as i64
|
||||
- self.ec_shard_count() as i64;
|
||||
let effective_free = free_count / DATA_SHARDS_COUNT as i64;
|
||||
if effective_free > 0 {
|
||||
effective_free as i32
|
||||
} else {
|
||||
0
|
||||
}
|
||||
}
|
||||
|
||||
/// Iterate over all volumes.
|
||||
pub fn volumes(&self) -> impl Iterator<Item = (&VolumeId, &Volume)> {
|
||||
self.volumes.iter()
|
||||
}
|
||||
|
||||
/// Iterate over all volumes (mutable).
|
||||
pub fn volumes_mut(&mut self) -> impl Iterator<Item = (&VolumeId, &mut Volume)> {
|
||||
self.volumes.iter_mut()
|
||||
}
|
||||
|
||||
/// Sum of unused space in writable volumes (volumeSizeLimit - actual size per volume).
|
||||
/// Used by auto-max-volume-count to estimate how many more volumes can fit.
|
||||
pub fn unused_space(&self, volume_size_limit: u64) -> u64 {
|
||||
let mut unused: u64 = 0;
|
||||
for vol in self.volumes.values() {
|
||||
if vol.is_read_only() {
|
||||
continue;
|
||||
}
|
||||
let dat_size = vol.dat_file_size().unwrap_or(0);
|
||||
let idx_size = vol.idx_file_size();
|
||||
let used = dat_size + idx_size;
|
||||
if volume_size_limit > used {
|
||||
unused += volume_size_limit - used;
|
||||
}
|
||||
}
|
||||
unused
|
||||
}
|
||||
|
||||
/// Check disk space against min_free_space and update is_disk_space_low.
|
||||
pub fn check_disk_space(&self) {
|
||||
let (total, free) = get_disk_stats(&self.directory);
|
||||
if total == 0 {
|
||||
return;
|
||||
}
|
||||
let used = total.saturating_sub(free);
|
||||
let is_low = match &self.min_free_space {
|
||||
MinFreeSpace::Percent(pct) => {
|
||||
let free_pct = (free as f64 / total as f64) * 100.0;
|
||||
free_pct < *pct
|
||||
}
|
||||
MinFreeSpace::Bytes(min_bytes) => free < *min_bytes,
|
||||
};
|
||||
self.is_disk_space_low.store(is_low, Ordering::Relaxed);
|
||||
self.available_space.store(free, Ordering::Relaxed);
|
||||
|
||||
// Update resource gauges
|
||||
crate::metrics::RESOURCE_GAUGE
|
||||
.with_label_values(&[&self.directory, "all"])
|
||||
.set(total as f64);
|
||||
crate::metrics::RESOURCE_GAUGE
|
||||
.with_label_values(&[&self.directory, "used"])
|
||||
.set(used as f64);
|
||||
crate::metrics::RESOURCE_GAUGE
|
||||
.with_label_values(&[&self.directory, "free"])
|
||||
.set(free as f64);
|
||||
// "avail" is same as "free" for us (Go subtracts reserved blocks but we use statvfs f_bavail)
|
||||
crate::metrics::RESOURCE_GAUGE
|
||||
.with_label_values(&[&self.directory, "avail"])
|
||||
.set(free as f64);
|
||||
}
|
||||
|
||||
// ---- EC volume operations ----
|
||||
|
||||
/// Find an EC volume by ID.
|
||||
pub fn find_ec_volume(&self, vid: VolumeId) -> Option<&EcVolume> {
|
||||
self.ec_volumes.get(&vid)
|
||||
}
|
||||
|
||||
/// Find an EC volume by ID (mutable).
|
||||
pub fn find_ec_volume_mut(&mut self, vid: VolumeId) -> Option<&mut EcVolume> {
|
||||
self.ec_volumes.get_mut(&vid)
|
||||
}
|
||||
|
||||
/// Check if this location has an EC volume.
|
||||
pub fn has_ec_volume(&self, vid: VolumeId) -> bool {
|
||||
self.ec_volumes.contains_key(&vid)
|
||||
}
|
||||
|
||||
/// Remove an EC volume, returning it.
|
||||
pub fn remove_ec_volume(&mut self, vid: VolumeId) -> Option<EcVolume> {
|
||||
self.ec_volumes.remove(&vid)
|
||||
}
|
||||
|
||||
/// Mount EC shards for a volume on this location.
|
||||
pub fn mount_ec_shards(
|
||||
&mut self,
|
||||
vid: VolumeId,
|
||||
collection: &str,
|
||||
shard_ids: &[u32],
|
||||
) -> Result<(), VolumeError> {
|
||||
let dir = self.directory.clone();
|
||||
let idx_dir = self.idx_directory.clone();
|
||||
let ec_vol = self
|
||||
.ec_volumes
|
||||
.entry(vid)
|
||||
.or_insert_with(|| EcVolume::new(&dir, &idx_dir, collection, vid).unwrap());
|
||||
ec_vol.disk_type = self.disk_type.clone();
|
||||
|
||||
for &shard_id in shard_ids {
|
||||
let shard = EcVolumeShard::new(&dir, collection, vid, shard_id as u8);
|
||||
ec_vol.add_shard(shard).map_err(VolumeError::Io)?;
|
||||
crate::metrics::VOLUME_GAUGE
|
||||
.with_label_values(&[collection, "ec_shards"])
|
||||
.inc();
|
||||
}
|
||||
Ok(())
|
||||
}
|
||||
|
||||
/// Unmount EC shards for a volume on this location.
|
||||
pub fn unmount_ec_shards(&mut self, vid: VolumeId, shard_ids: &[u32]) {
|
||||
if let Some(ec_vol) = self.ec_volumes.get_mut(&vid) {
|
||||
let collection = ec_vol.collection.clone();
|
||||
for &shard_id in shard_ids {
|
||||
ec_vol.remove_shard(shard_id as u8);
|
||||
crate::metrics::VOLUME_GAUGE
|
||||
.with_label_values(&[&collection, "ec_shards"])
|
||||
.dec();
|
||||
}
|
||||
if ec_vol.shard_count() == 0 {
|
||||
let mut vol = self.ec_volumes.remove(&vid).unwrap();
|
||||
vol.close();
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
/// Total number of EC shards on this location.
|
||||
pub fn ec_shard_count(&self) -> usize {
|
||||
self.ec_volumes
|
||||
.values()
|
||||
.map(|ecv| ecv.shards.iter().filter(|s| s.is_some()).count())
|
||||
.sum()
|
||||
}
|
||||
|
||||
/// Iterate over all EC volumes.
|
||||
pub fn ec_volumes(&self) -> impl Iterator<Item = (&VolumeId, &EcVolume)> {
|
||||
self.ec_volumes.iter()
|
||||
}
|
||||
|
||||
/// Close all volumes.
|
||||
pub fn close(&mut self) {
|
||||
for (_, v) in self.volumes.iter_mut() {
|
||||
v.close();
|
||||
}
|
||||
self.volumes.clear();
|
||||
for (_, mut ec_vol) in self.ec_volumes.drain() {
|
||||
ec_vol.close();
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
/// Get total and free disk space for a given path.
|
||||
/// Returns (total_bytes, free_bytes).
|
||||
pub fn get_disk_stats(path: &str) -> (u64, u64) {
|
||||
#[cfg(unix)]
|
||||
{
|
||||
use std::ffi::CString;
|
||||
let c_path = match CString::new(path) {
|
||||
Ok(p) => p,
|
||||
Err(_) => return (0, 0),
|
||||
};
|
||||
unsafe {
|
||||
let mut stat: libc::statvfs = std::mem::zeroed();
|
||||
if libc::statvfs(c_path.as_ptr(), &mut stat) == 0 {
|
||||
let all = stat.f_blocks as u64 * stat.f_frsize as u64;
|
||||
let free = stat.f_bavail as u64 * stat.f_frsize as u64;
|
||||
return (all, free);
|
||||
}
|
||||
}
|
||||
(0, 0)
|
||||
}
|
||||
#[cfg(not(unix))]
|
||||
{
|
||||
let _ = path;
|
||||
(0, 0)
|
||||
}
|
||||
}
|
||||
|
||||
/// Calculate expected EC shard size from .dat file size.
|
||||
/// Matches Go's `calculateExpectedShardSize`: large blocks (1GB * data_shards) first,
|
||||
/// then small blocks (1MB * data_shards) for the remainder.
|
||||
fn calculate_expected_shard_size(dat_file_size: i64) -> i64 {
|
||||
let large_batch_size = ERASURE_CODING_LARGE_BLOCK_SIZE as i64 * DATA_SHARDS_COUNT as i64;
|
||||
let num_large_batches = dat_file_size / large_batch_size;
|
||||
let mut shard_size = num_large_batches * ERASURE_CODING_LARGE_BLOCK_SIZE as i64;
|
||||
let remaining = dat_file_size - (num_large_batches * large_batch_size);
|
||||
|
||||
if remaining > 0 {
|
||||
let small_batch_size = ERASURE_CODING_SMALL_BLOCK_SIZE as i64 * DATA_SHARDS_COUNT as i64;
|
||||
// Ceiling division
|
||||
let num_small_batches = (remaining + small_batch_size - 1) / small_batch_size;
|
||||
shard_size += num_small_batches * ERASURE_CODING_SMALL_BLOCK_SIZE as i64;
|
||||
}
|
||||
|
||||
shard_size
|
||||
}
|
||||
|
||||
/// Parse a volume filename like "collection_42.dat" or "42.dat" into (collection, VolumeId).
|
||||
fn parse_volume_filename(filename: &str) -> Option<(String, VolumeId)> {
|
||||
let stem = filename
|
||||
.strip_suffix(".dat")
|
||||
.or_else(|| filename.strip_suffix(".vif"))
|
||||
.or_else(|| filename.strip_suffix(".idx"))?;
|
||||
if let Some(pos) = stem.rfind('_') {
|
||||
let collection = &stem[..pos];
|
||||
let id_str = &stem[pos + 1..];
|
||||
let id: u32 = id_str.parse().ok()?;
|
||||
Some((collection.to_string(), VolumeId(id)))
|
||||
} else {
|
||||
let id: u32 = stem.parse().ok()?;
|
||||
Some((String::new(), VolumeId(id)))
|
||||
}
|
||||
}
|
||||
|
||||
// ============================================================================
|
||||
// Tests
|
||||
// ============================================================================
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
use tempfile::TempDir;
|
||||
|
||||
#[test]
|
||||
fn test_parse_volume_filename() {
|
||||
assert_eq!(
|
||||
parse_volume_filename("42.dat"),
|
||||
Some(("".to_string(), VolumeId(42)))
|
||||
);
|
||||
assert_eq!(
|
||||
parse_volume_filename("pics_7.dat"),
|
||||
Some(("pics".to_string(), VolumeId(7)))
|
||||
);
|
||||
assert_eq!(
|
||||
parse_volume_filename("42.vif"),
|
||||
Some(("".to_string(), VolumeId(42)))
|
||||
);
|
||||
assert_eq!(
|
||||
parse_volume_filename("pics_7.idx"),
|
||||
Some(("pics".to_string(), VolumeId(7)))
|
||||
);
|
||||
assert_eq!(parse_volume_filename("notadat.idx"), None);
|
||||
assert_eq!(parse_volume_filename("bad.dat"), None);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_disk_location_create_volume() {
|
||||
let tmp = TempDir::new().unwrap();
|
||||
let dir = tmp.path().to_str().unwrap();
|
||||
let mut loc = DiskLocation::new(
|
||||
dir,
|
||||
dir,
|
||||
10,
|
||||
DiskType::HardDrive,
|
||||
MinFreeSpace::Percent(1.0),
|
||||
Vec::new(),
|
||||
)
|
||||
.unwrap();
|
||||
|
||||
loc.create_volume(
|
||||
VolumeId(1),
|
||||
"",
|
||||
NeedleMapKind::InMemory,
|
||||
None,
|
||||
None,
|
||||
0,
|
||||
Version::current(),
|
||||
)
|
||||
.unwrap();
|
||||
|
||||
assert_eq!(loc.volumes_len(), 1);
|
||||
assert!(loc.find_volume(VolumeId(1)).is_some());
|
||||
assert!(loc.find_volume(VolumeId(99)).is_none());
|
||||
assert_eq!(loc.free_volume_count(), 9);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_disk_location_load_existing() {
|
||||
let tmp = TempDir::new().unwrap();
|
||||
let dir = tmp.path().to_str().unwrap();
|
||||
|
||||
// Create volumes
|
||||
{
|
||||
let mut loc = DiskLocation::new(
|
||||
dir,
|
||||
dir,
|
||||
10,
|
||||
DiskType::HardDrive,
|
||||
MinFreeSpace::Percent(1.0),
|
||||
Vec::new(),
|
||||
)
|
||||
.unwrap();
|
||||
loc.create_volume(
|
||||
VolumeId(1),
|
||||
"",
|
||||
NeedleMapKind::InMemory,
|
||||
None,
|
||||
None,
|
||||
0,
|
||||
Version::current(),
|
||||
)
|
||||
.unwrap();
|
||||
loc.create_volume(
|
||||
VolumeId(2),
|
||||
"test",
|
||||
NeedleMapKind::InMemory,
|
||||
None,
|
||||
None,
|
||||
0,
|
||||
Version::current(),
|
||||
)
|
||||
.unwrap();
|
||||
loc.close();
|
||||
}
|
||||
|
||||
// Reload
|
||||
let mut loc = DiskLocation::new(
|
||||
dir,
|
||||
dir,
|
||||
10,
|
||||
DiskType::HardDrive,
|
||||
MinFreeSpace::Percent(1.0),
|
||||
Vec::new(),
|
||||
)
|
||||
.unwrap();
|
||||
loc.load_existing_volumes(NeedleMapKind::InMemory).unwrap();
|
||||
assert_eq!(loc.volumes_len(), 2);
|
||||
|
||||
let ids = loc.volume_ids();
|
||||
assert!(ids.contains(&VolumeId(1)));
|
||||
assert!(ids.contains(&VolumeId(2)));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_disk_location_delete_volume() {
|
||||
let tmp = TempDir::new().unwrap();
|
||||
let dir = tmp.path().to_str().unwrap();
|
||||
let mut loc = DiskLocation::new(
|
||||
dir,
|
||||
dir,
|
||||
10,
|
||||
DiskType::HardDrive,
|
||||
MinFreeSpace::Percent(1.0),
|
||||
Vec::new(),
|
||||
)
|
||||
.unwrap();
|
||||
|
||||
loc.create_volume(
|
||||
VolumeId(1),
|
||||
"",
|
||||
NeedleMapKind::InMemory,
|
||||
None,
|
||||
None,
|
||||
0,
|
||||
Version::current(),
|
||||
)
|
||||
.unwrap();
|
||||
loc.create_volume(
|
||||
VolumeId(2),
|
||||
"",
|
||||
NeedleMapKind::InMemory,
|
||||
None,
|
||||
None,
|
||||
0,
|
||||
Version::current(),
|
||||
)
|
||||
.unwrap();
|
||||
assert_eq!(loc.volumes_len(), 2);
|
||||
|
||||
loc.delete_volume(VolumeId(1), false).unwrap();
|
||||
assert_eq!(loc.volumes_len(), 1);
|
||||
assert!(loc.find_volume(VolumeId(1)).is_none());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_disk_location_delete_collection() {
|
||||
let tmp = TempDir::new().unwrap();
|
||||
let dir = tmp.path().to_str().unwrap();
|
||||
let mut loc = DiskLocation::new(
|
||||
dir,
|
||||
dir,
|
||||
10,
|
||||
DiskType::HardDrive,
|
||||
MinFreeSpace::Percent(1.0),
|
||||
Vec::new(),
|
||||
)
|
||||
.unwrap();
|
||||
|
||||
loc.create_volume(
|
||||
VolumeId(1),
|
||||
"pics",
|
||||
NeedleMapKind::InMemory,
|
||||
None,
|
||||
None,
|
||||
0,
|
||||
Version::current(),
|
||||
)
|
||||
.unwrap();
|
||||
loc.create_volume(
|
||||
VolumeId(2),
|
||||
"pics",
|
||||
NeedleMapKind::InMemory,
|
||||
None,
|
||||
None,
|
||||
0,
|
||||
Version::current(),
|
||||
)
|
||||
.unwrap();
|
||||
loc.create_volume(
|
||||
VolumeId(3),
|
||||
"docs",
|
||||
NeedleMapKind::InMemory,
|
||||
None,
|
||||
None,
|
||||
0,
|
||||
Version::current(),
|
||||
)
|
||||
.unwrap();
|
||||
assert_eq!(loc.volumes_len(), 3);
|
||||
|
||||
loc.delete_collection("pics").unwrap();
|
||||
assert_eq!(loc.volumes_len(), 1);
|
||||
assert!(loc.find_volume(VolumeId(3)).is_some());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_disk_location_delete_collection_removes_ec_volumes() {
|
||||
let tmp = TempDir::new().unwrap();
|
||||
let dir = tmp.path().to_str().unwrap();
|
||||
let mut loc = DiskLocation::new(
|
||||
dir,
|
||||
dir,
|
||||
10,
|
||||
DiskType::HardDrive,
|
||||
MinFreeSpace::Percent(1.0),
|
||||
Vec::new(),
|
||||
)
|
||||
.unwrap();
|
||||
|
||||
let shard_path = format!("{}/pics_7.ec00", dir);
|
||||
std::fs::write(&shard_path, b"ec-shard").unwrap();
|
||||
|
||||
loc.mount_ec_shards(VolumeId(7), "pics", &[0]).unwrap();
|
||||
assert!(loc.has_ec_volume(VolumeId(7)));
|
||||
assert!(std::path::Path::new(&shard_path).exists());
|
||||
assert!(std::path::Path::new(&format!("{}/pics_7.ecj", dir)).exists());
|
||||
|
||||
loc.delete_collection("pics").unwrap();
|
||||
|
||||
assert!(!loc.has_ec_volume(VolumeId(7)));
|
||||
assert!(!std::path::Path::new(&shard_path).exists());
|
||||
assert!(!std::path::Path::new(&format!("{}/pics_7.ecj", dir)).exists());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_disk_location_persists_directory_uuid_and_tags() {
|
||||
let tmp = TempDir::new().unwrap();
|
||||
let dir = tmp.path().to_str().unwrap();
|
||||
|
||||
let loc = DiskLocation::new(
|
||||
dir,
|
||||
dir,
|
||||
10,
|
||||
DiskType::HardDrive,
|
||||
MinFreeSpace::Percent(1.0),
|
||||
vec!["fast".to_string(), "ssd".to_string()],
|
||||
)
|
||||
.unwrap();
|
||||
let directory_uuid = loc.directory_uuid.clone();
|
||||
assert_eq!(loc.tags, vec!["fast".to_string(), "ssd".to_string()]);
|
||||
drop(loc);
|
||||
|
||||
let reloaded = DiskLocation::new(
|
||||
dir,
|
||||
dir,
|
||||
10,
|
||||
DiskType::HardDrive,
|
||||
MinFreeSpace::Percent(1.0),
|
||||
Vec::new(),
|
||||
)
|
||||
.unwrap();
|
||||
assert_eq!(reloaded.directory_uuid, directory_uuid);
|
||||
}
|
||||
}
|
||||
261
seaweed-volume/src/storage/erasure_coding/ec_decoder.rs
Normal file
261
seaweed-volume/src/storage/erasure_coding/ec_decoder.rs
Normal file
@@ -0,0 +1,261 @@
|
||||
//! EC decoding: reconstruct a .dat file from EC shards.
|
||||
//!
|
||||
//! Rebuilds the original .dat + .idx files from data shards (.ec00-.ec09)
|
||||
//! and the sorted index (.ecx) + deletion journal (.ecj).
|
||||
|
||||
use std::fs::File;
|
||||
use std::io::{self, Read, Write};
|
||||
|
||||
use crate::storage::erasure_coding::ec_shard::*;
|
||||
use crate::storage::idx;
|
||||
use crate::storage::needle::needle::get_actual_size;
|
||||
use crate::storage::super_block::SUPER_BLOCK_SIZE;
|
||||
use crate::storage::types::*;
|
||||
use crate::storage::volume::volume_file_name;
|
||||
|
||||
/// Calculate .dat file size from the max offset entry in .ecx.
|
||||
/// Reads the volume version from the first EC shard (.ec00) superblock,
|
||||
/// then scans .ecx entries to find the largest (offset + needle_actual_size).
|
||||
pub fn find_dat_file_size(dir: &str, collection: &str, volume_id: VolumeId) -> io::Result<i64> {
|
||||
let base = volume_file_name(dir, collection, volume_id);
|
||||
|
||||
// Read volume version from .ec00 superblock
|
||||
let ec00_path = format!("{}.ec00", base);
|
||||
let mut ec00 = File::open(&ec00_path)?;
|
||||
let mut sb_buf = [0u8; SUPER_BLOCK_SIZE];
|
||||
ec00.read_exact(&mut sb_buf)?;
|
||||
let version = Version(sb_buf[0]);
|
||||
|
||||
// Start with at least the superblock size
|
||||
let mut dat_size: i64 = SUPER_BLOCK_SIZE as i64;
|
||||
|
||||
// Scan .ecx entries
|
||||
let ecx_path = format!("{}.ecx", base);
|
||||
let ecx_data = std::fs::read(&ecx_path)?;
|
||||
let entry_count = ecx_data.len() / NEEDLE_MAP_ENTRY_SIZE;
|
||||
|
||||
for i in 0..entry_count {
|
||||
let start = i * NEEDLE_MAP_ENTRY_SIZE;
|
||||
let (_, offset, size) =
|
||||
idx_entry_from_bytes(&ecx_data[start..start + NEEDLE_MAP_ENTRY_SIZE]);
|
||||
if size.is_deleted() {
|
||||
continue;
|
||||
}
|
||||
let entry_stop = offset.to_actual_offset() + get_actual_size(size, version);
|
||||
if entry_stop > dat_size {
|
||||
dat_size = entry_stop;
|
||||
}
|
||||
}
|
||||
|
||||
Ok(dat_size)
|
||||
}
|
||||
|
||||
/// Reconstruct a .dat file from EC data shards.
|
||||
///
|
||||
/// Reads from .ec00-.ec09 and writes a new .dat file.
|
||||
pub fn write_dat_file_from_shards(
|
||||
dir: &str,
|
||||
collection: &str,
|
||||
volume_id: VolumeId,
|
||||
dat_file_size: i64,
|
||||
data_shards: usize,
|
||||
) -> io::Result<()> {
|
||||
let base = volume_file_name(dir, collection, volume_id);
|
||||
let dat_path = format!("{}.dat", base);
|
||||
|
||||
// Open data shards
|
||||
let mut shards: Vec<EcVolumeShard> = (0..data_shards as u8)
|
||||
.map(|i| EcVolumeShard::new(dir, collection, volume_id, i))
|
||||
.collect();
|
||||
|
||||
for shard in &mut shards {
|
||||
shard.open()?;
|
||||
}
|
||||
|
||||
let mut dat_file = File::create(&dat_path)?;
|
||||
let mut remaining = dat_file_size;
|
||||
let large_block_size = ERASURE_CODING_LARGE_BLOCK_SIZE;
|
||||
let small_block_size = ERASURE_CODING_SMALL_BLOCK_SIZE;
|
||||
let large_row_size = (large_block_size * data_shards) as i64;
|
||||
|
||||
let mut shard_offset: u64 = 0;
|
||||
|
||||
// Read large blocks
|
||||
while remaining >= large_row_size {
|
||||
for i in 0..data_shards {
|
||||
let mut buf = vec![0u8; large_block_size];
|
||||
shards[i].read_at(&mut buf, shard_offset)?;
|
||||
let to_write = large_block_size.min(remaining as usize);
|
||||
dat_file.write_all(&buf[..to_write])?;
|
||||
remaining -= to_write as i64;
|
||||
if remaining <= 0 {
|
||||
break;
|
||||
}
|
||||
}
|
||||
shard_offset += large_block_size as u64;
|
||||
}
|
||||
|
||||
// Read small blocks
|
||||
while remaining > 0 {
|
||||
for i in 0..data_shards {
|
||||
let mut buf = vec![0u8; small_block_size];
|
||||
shards[i].read_at(&mut buf, shard_offset)?;
|
||||
let to_write = small_block_size.min(remaining as usize);
|
||||
dat_file.write_all(&buf[..to_write])?;
|
||||
remaining -= to_write as i64;
|
||||
if remaining <= 0 {
|
||||
break;
|
||||
}
|
||||
}
|
||||
shard_offset += small_block_size as u64;
|
||||
}
|
||||
|
||||
for shard in &mut shards {
|
||||
shard.close();
|
||||
}
|
||||
|
||||
dat_file.sync_all()?;
|
||||
Ok(())
|
||||
}
|
||||
|
||||
/// Write .idx file from .ecx index + .ecj deletion journal.
|
||||
///
|
||||
/// Copies sorted .ecx entries to .idx, then appends tombstones for
|
||||
/// deleted needles from .ecj.
|
||||
pub fn write_idx_file_from_ec_index(
|
||||
dir: &str,
|
||||
collection: &str,
|
||||
volume_id: VolumeId,
|
||||
) -> io::Result<()> {
|
||||
let base = volume_file_name(dir, collection, volume_id);
|
||||
let ecx_path = format!("{}.ecx", base);
|
||||
let ecj_path = format!("{}.ecj", base);
|
||||
let idx_path = format!("{}.idx", base);
|
||||
|
||||
// Copy .ecx to .idx
|
||||
std::fs::copy(&ecx_path, &idx_path)?;
|
||||
|
||||
// Append deletions from .ecj as tombstones
|
||||
if std::path::Path::new(&ecj_path).exists() {
|
||||
let ecj_data = std::fs::read(&ecj_path)?;
|
||||
if !ecj_data.is_empty() {
|
||||
let mut idx_file = std::fs::OpenOptions::new()
|
||||
.write(true)
|
||||
.append(true)
|
||||
.open(&idx_path)?;
|
||||
|
||||
let count = ecj_data.len() / NEEDLE_ID_SIZE;
|
||||
for i in 0..count {
|
||||
let start = i * NEEDLE_ID_SIZE;
|
||||
let needle_id = NeedleId::from_bytes(&ecj_data[start..start + NEEDLE_ID_SIZE]);
|
||||
idx::write_index_entry(
|
||||
&mut idx_file,
|
||||
needle_id,
|
||||
Offset::default(),
|
||||
TOMBSTONE_FILE_SIZE,
|
||||
)?;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
Ok(())
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
use crate::storage::erasure_coding::ec_encoder;
|
||||
use crate::storage::needle::needle::Needle;
|
||||
use crate::storage::needle_map::NeedleMapKind;
|
||||
use crate::storage::volume::Volume;
|
||||
use tempfile::TempDir;
|
||||
|
||||
#[test]
|
||||
fn test_ec_full_round_trip() {
|
||||
let tmp = TempDir::new().unwrap();
|
||||
let dir = tmp.path().to_str().unwrap();
|
||||
|
||||
// Create volume with data
|
||||
let mut v = Volume::new(
|
||||
dir,
|
||||
dir,
|
||||
"",
|
||||
VolumeId(1),
|
||||
NeedleMapKind::InMemory,
|
||||
None,
|
||||
None,
|
||||
0,
|
||||
Version::current(),
|
||||
)
|
||||
.unwrap();
|
||||
|
||||
let test_data: Vec<(NeedleId, Vec<u8>)> = (1..=3)
|
||||
.map(|i| {
|
||||
let data = format!("EC round trip data for needle {}", i);
|
||||
(NeedleId(i), data.into_bytes())
|
||||
})
|
||||
.collect();
|
||||
|
||||
for (id, data) in &test_data {
|
||||
let mut n = Needle {
|
||||
id: *id,
|
||||
cookie: Cookie(id.0 as u32),
|
||||
data: data.clone(),
|
||||
data_size: data.len() as u32,
|
||||
..Needle::default()
|
||||
};
|
||||
v.write_needle(&mut n, true).unwrap();
|
||||
}
|
||||
v.sync_to_disk().unwrap();
|
||||
let original_dat_size = v.dat_file_size().unwrap();
|
||||
v.close();
|
||||
|
||||
// Read original .dat for comparison
|
||||
let original_dat = std::fs::read(format!("{}/1.dat", dir)).unwrap();
|
||||
|
||||
// Encode to EC
|
||||
let data_shards = 10;
|
||||
let parity_shards = 4;
|
||||
ec_encoder::write_ec_files(dir, dir, "", VolumeId(1), data_shards, parity_shards).unwrap();
|
||||
|
||||
// Delete original .dat and .idx
|
||||
std::fs::remove_file(format!("{}/1.dat", dir)).unwrap();
|
||||
std::fs::remove_file(format!("{}/1.idx", dir)).unwrap();
|
||||
|
||||
// Reconstruct from EC shards
|
||||
write_dat_file_from_shards(dir, "", VolumeId(1), original_dat_size as i64, data_shards)
|
||||
.unwrap();
|
||||
write_idx_file_from_ec_index(dir, "", VolumeId(1)).unwrap();
|
||||
|
||||
// Verify reconstructed .dat matches original
|
||||
let reconstructed_dat = std::fs::read(format!("{}/1.dat", dir)).unwrap();
|
||||
assert_eq!(
|
||||
original_dat[..original_dat_size as usize],
|
||||
reconstructed_dat[..original_dat_size as usize],
|
||||
"reconstructed .dat should match original"
|
||||
);
|
||||
|
||||
// Verify we can load and read from reconstructed volume
|
||||
let v2 = Volume::new(
|
||||
dir,
|
||||
dir,
|
||||
"",
|
||||
VolumeId(1),
|
||||
NeedleMapKind::InMemory,
|
||||
None,
|
||||
None,
|
||||
0,
|
||||
Version::current(),
|
||||
)
|
||||
.unwrap();
|
||||
|
||||
for (id, expected_data) in &test_data {
|
||||
let mut n = Needle {
|
||||
id: *id,
|
||||
..Needle::default()
|
||||
};
|
||||
v2.read_needle(&mut n).unwrap();
|
||||
assert_eq!(&n.data, expected_data, "needle {} data should match", id);
|
||||
}
|
||||
}
|
||||
}
|
||||
824
seaweed-volume/src/storage/erasure_coding/ec_encoder.rs
Normal file
824
seaweed-volume/src/storage/erasure_coding/ec_encoder.rs
Normal file
@@ -0,0 +1,824 @@
|
||||
//! EC encoding: convert a .dat file into 10 data + 4 parity shards.
|
||||
//!
|
||||
//! Uses Reed-Solomon erasure coding. The .dat file is split into blocks
|
||||
//! (1GB large, 1MB small) and encoded across 14 shard files.
|
||||
|
||||
use std::fs::File;
|
||||
use std::io;
|
||||
#[cfg(not(unix))]
|
||||
use std::io::{Seek, SeekFrom};
|
||||
|
||||
use reed_solomon_erasure::galois_8::ReedSolomon;
|
||||
|
||||
use crate::storage::erasure_coding::ec_shard::*;
|
||||
use crate::storage::idx;
|
||||
use crate::storage::types::*;
|
||||
use crate::storage::volume::volume_file_name;
|
||||
|
||||
/// Encode a .dat file into EC shard files.
|
||||
///
|
||||
/// Creates .ec00-.ec13 files in the same directory.
|
||||
/// Also creates a sorted .ecx index from the .idx file.
|
||||
pub fn write_ec_files(
|
||||
dir: &str,
|
||||
idx_dir: &str,
|
||||
collection: &str,
|
||||
volume_id: VolumeId,
|
||||
data_shards: usize,
|
||||
parity_shards: usize,
|
||||
) -> io::Result<()> {
|
||||
let base = volume_file_name(dir, collection, volume_id);
|
||||
let dat_path = format!("{}.dat", base);
|
||||
let idx_base = volume_file_name(idx_dir, collection, volume_id);
|
||||
let idx_path = format!("{}.idx", idx_base);
|
||||
|
||||
// Create sorted .ecx from .idx
|
||||
write_sorted_ecx_from_idx(&idx_path, &format!("{}.ecx", base))?;
|
||||
|
||||
// Encode .dat into shards
|
||||
let dat_file = File::open(&dat_path)?;
|
||||
let dat_size = dat_file.metadata()?.len() as i64;
|
||||
|
||||
let rs = ReedSolomon::new(data_shards, parity_shards)
|
||||
.map_err(|e| io::Error::new(io::ErrorKind::Other, format!("reed-solomon init: {:?}", e)))?;
|
||||
|
||||
// Create shard files
|
||||
let total_shards = data_shards + parity_shards;
|
||||
let mut shards: Vec<EcVolumeShard> = (0..total_shards as u8)
|
||||
.map(|i| EcVolumeShard::new(dir, collection, volume_id, i))
|
||||
.collect();
|
||||
|
||||
for shard in &mut shards {
|
||||
shard.create()?;
|
||||
}
|
||||
|
||||
// Encode in large blocks, then small blocks
|
||||
encode_dat_file(
|
||||
&dat_file,
|
||||
dat_size,
|
||||
&rs,
|
||||
&mut shards,
|
||||
data_shards,
|
||||
parity_shards,
|
||||
)?;
|
||||
|
||||
// Close all shards
|
||||
for shard in &mut shards {
|
||||
shard.close();
|
||||
}
|
||||
|
||||
Ok(())
|
||||
}
|
||||
|
||||
/// Rebuild missing EC shard files from existing shards using Reed-Solomon reconstruct.
|
||||
///
|
||||
/// This does not require the `.dat` file, only the existing `.ecXX` shard files.
|
||||
pub fn rebuild_ec_files(
|
||||
dir: &str,
|
||||
collection: &str,
|
||||
volume_id: VolumeId,
|
||||
missing_shard_ids: &[u32],
|
||||
data_shards: usize,
|
||||
parity_shards: usize,
|
||||
) -> io::Result<()> {
|
||||
if missing_shard_ids.is_empty() {
|
||||
return Ok(());
|
||||
}
|
||||
|
||||
let rs = ReedSolomon::new(data_shards, parity_shards)
|
||||
.map_err(|e| io::Error::new(io::ErrorKind::Other, format!("reed-solomon init: {:?}", e)))?;
|
||||
|
||||
let total_shards = data_shards + parity_shards;
|
||||
let mut shards: Vec<EcVolumeShard> = (0..total_shards as u8)
|
||||
.map(|i| EcVolumeShard::new(dir, collection, volume_id, i))
|
||||
.collect();
|
||||
|
||||
// Determine the exact shard size from the first available existing shard
|
||||
let mut shard_size = 0;
|
||||
for (i, shard) in shards.iter_mut().enumerate() {
|
||||
if !missing_shard_ids.contains(&(i as u32)) {
|
||||
if let Ok(_) = shard.open() {
|
||||
let size = shard.file_size();
|
||||
if size > shard_size {
|
||||
shard_size = size;
|
||||
}
|
||||
} else {
|
||||
return Err(io::Error::new(
|
||||
io::ErrorKind::NotFound,
|
||||
format!("missing non-rebuild shard {}", i),
|
||||
));
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
if shard_size == 0 {
|
||||
return Err(io::Error::new(
|
||||
io::ErrorKind::InvalidData,
|
||||
"all existing shards are empty or cannot find an existing shard to determine size",
|
||||
));
|
||||
}
|
||||
|
||||
// Create the missing shards for writing
|
||||
for i in missing_shard_ids {
|
||||
if let Some(shard) = shards.get_mut(*i as usize) {
|
||||
shard.create()?;
|
||||
}
|
||||
}
|
||||
|
||||
let block_size = ERASURE_CODING_SMALL_BLOCK_SIZE;
|
||||
let mut remaining = shard_size;
|
||||
let mut offset: u64 = 0;
|
||||
|
||||
// Process all data in blocks
|
||||
while remaining > 0 {
|
||||
let to_process = remaining.min(block_size as i64) as usize;
|
||||
|
||||
// Allocate buffers for all shards. Option<Vec<u8>> is required by rs.reconstruct()
|
||||
let mut buffers: Vec<Option<Vec<u8>>> = vec![None; total_shards];
|
||||
|
||||
// Read available shards
|
||||
for (i, shard) in shards.iter().enumerate() {
|
||||
if !missing_shard_ids.contains(&(i as u32)) {
|
||||
let mut buf = vec![0u8; to_process];
|
||||
shard.read_at(&mut buf, offset)?;
|
||||
buffers[i] = Some(buf);
|
||||
}
|
||||
}
|
||||
|
||||
// Reconstruct missing shards
|
||||
rs.reconstruct(&mut buffers).map_err(|e| {
|
||||
io::Error::new(
|
||||
io::ErrorKind::Other,
|
||||
format!("reed-solomon reconstruct: {:?}", e),
|
||||
)
|
||||
})?;
|
||||
|
||||
// Write recovered data into the missing shards
|
||||
for i in missing_shard_ids {
|
||||
let idx = *i as usize;
|
||||
if let Some(buf) = buffers[idx].take() {
|
||||
shards[idx].write_all(&buf)?;
|
||||
}
|
||||
}
|
||||
|
||||
offset += to_process as u64;
|
||||
remaining -= to_process as i64;
|
||||
}
|
||||
|
||||
// Close all shards
|
||||
for shard in &mut shards {
|
||||
shard.close();
|
||||
}
|
||||
|
||||
Ok(())
|
||||
}
|
||||
|
||||
/// Verify EC shards by computing parity against the existing data and identifying corrupted shards.
|
||||
pub fn verify_ec_shards(
|
||||
dir: &str,
|
||||
collection: &str,
|
||||
volume_id: VolumeId,
|
||||
data_shards: usize,
|
||||
parity_shards: usize,
|
||||
) -> io::Result<(Vec<u32>, Vec<String>)> {
|
||||
let rs = ReedSolomon::new(data_shards, parity_shards)
|
||||
.map_err(|e| io::Error::new(io::ErrorKind::Other, format!("reed-solomon init: {:?}", e)))?;
|
||||
|
||||
let total_shards = data_shards + parity_shards;
|
||||
let mut shards: Vec<EcVolumeShard> = (0..total_shards as u8)
|
||||
.map(|i| EcVolumeShard::new(dir, collection, volume_id, i))
|
||||
.collect();
|
||||
|
||||
let mut shard_size = 0;
|
||||
let mut broken_shards = std::collections::HashSet::new();
|
||||
let mut details = Vec::new();
|
||||
|
||||
for (i, shard) in shards.iter_mut().enumerate() {
|
||||
if let Ok(_) = shard.open() {
|
||||
let size = shard.file_size();
|
||||
if size > shard_size {
|
||||
shard_size = size;
|
||||
}
|
||||
} else {
|
||||
broken_shards.insert(i as u32);
|
||||
details.push(format!("failed to open or missing shard {}", i));
|
||||
}
|
||||
}
|
||||
|
||||
if shard_size == 0 || broken_shards.len() >= parity_shards {
|
||||
// Can't do much if we don't know the size or have too many missing
|
||||
return Ok((broken_shards.into_iter().collect(), details));
|
||||
}
|
||||
|
||||
let block_size = ERASURE_CODING_SMALL_BLOCK_SIZE;
|
||||
let mut remaining = shard_size;
|
||||
let mut offset: u64 = 0;
|
||||
|
||||
while remaining > 0 {
|
||||
let to_process = remaining.min(block_size as i64) as usize;
|
||||
let mut buffers = vec![vec![0u8; to_process]; total_shards];
|
||||
|
||||
let mut read_failed = false;
|
||||
for i in 0..total_shards {
|
||||
if !broken_shards.contains(&(i as u32)) {
|
||||
if let Err(e) = shards[i].read_at(&mut buffers[i], offset) {
|
||||
broken_shards.insert(i as u32);
|
||||
details.push(format!("read error shard {}: {}", i, e));
|
||||
read_failed = true;
|
||||
}
|
||||
} else {
|
||||
read_failed = true;
|
||||
}
|
||||
}
|
||||
|
||||
// Only do verification if all shards were readable
|
||||
if !read_failed {
|
||||
// Need to convert Vec<Vec<u8>> to &[&[u8]] for rs.verify
|
||||
let slice_ptrs: Vec<&[u8]> = buffers.iter().map(|v| v.as_slice()).collect();
|
||||
if let Ok(is_valid) = rs.verify(&slice_ptrs) {
|
||||
if !is_valid {
|
||||
// Reed-Solomon verification failed. We cannot easily pinpoint which shard
|
||||
// is corrupted without recalculating parities or syndromes, so we just
|
||||
// log that this batch has corruption. Wait, we can test each parity shard!
|
||||
// Let's re-encode from the first `data_shards` and compare to the actual `parity_shards`.
|
||||
|
||||
let mut verify_buffers = buffers.clone();
|
||||
// Clear the parity parts
|
||||
for i in data_shards..total_shards {
|
||||
verify_buffers[i].fill(0);
|
||||
}
|
||||
if rs.encode(&mut verify_buffers).is_ok() {
|
||||
for i in 0..total_shards {
|
||||
if buffers[i] != verify_buffers[i] {
|
||||
broken_shards.insert(i as u32);
|
||||
details.push(format!(
|
||||
"parity mismatch on shard {} at offset {}",
|
||||
i, offset
|
||||
));
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
offset += to_process as u64;
|
||||
remaining -= to_process as i64;
|
||||
}
|
||||
|
||||
// Close all shards
|
||||
for shard in &mut shards {
|
||||
shard.close();
|
||||
}
|
||||
|
||||
let mut broken_vec: Vec<u32> = broken_shards.into_iter().collect();
|
||||
broken_vec.sort_unstable();
|
||||
|
||||
Ok((broken_vec, details))
|
||||
}
|
||||
|
||||
/// Write sorted .ecx index from .idx file.
|
||||
fn write_sorted_ecx_from_idx(idx_path: &str, ecx_path: &str) -> io::Result<()> {
|
||||
if !std::path::Path::new(idx_path).exists() {
|
||||
return Err(io::Error::new(
|
||||
io::ErrorKind::NotFound,
|
||||
"idx file not found",
|
||||
));
|
||||
}
|
||||
|
||||
// Read all idx entries
|
||||
let mut idx_file = File::open(idx_path)?;
|
||||
let mut entries: Vec<(NeedleId, Offset, Size)> = Vec::new();
|
||||
|
||||
idx::walk_index_file(&mut idx_file, 0, |key, offset, size| {
|
||||
entries.push((key, offset, size));
|
||||
Ok(())
|
||||
})?;
|
||||
|
||||
// Sort by NeedleId, then by actual offset so later entries come last
|
||||
entries.sort_by_key(|&(key, offset, _)| (key, offset.to_actual_offset()));
|
||||
|
||||
// Remove duplicates (keep last/latest entry for each key).
|
||||
// dedup_by_key keeps the first in each run, so we reverse first,
|
||||
// dedup, then reverse back.
|
||||
entries.reverse();
|
||||
entries.dedup_by_key(|entry| entry.0);
|
||||
entries.reverse();
|
||||
|
||||
// Write sorted entries to .ecx
|
||||
let mut ecx_file = File::create(ecx_path)?;
|
||||
for &(key, offset, size) in &entries {
|
||||
idx::write_index_entry(&mut ecx_file, key, offset, size)?;
|
||||
}
|
||||
|
||||
Ok(())
|
||||
}
|
||||
|
||||
/// Rebuild the .ecx index file by walking needles in the EC data shards.
|
||||
///
|
||||
/// This is the equivalent of Go's `RebuildEcxFile`. It reads the logical .dat
|
||||
/// content from the EC data shards, walks through needle headers to extract
|
||||
/// (needle_id, offset, size) entries, deduplicates them, and writes a sorted
|
||||
/// .ecx index file.
|
||||
pub fn rebuild_ecx_file(
|
||||
dir: &str,
|
||||
collection: &str,
|
||||
volume_id: VolumeId,
|
||||
data_shards: usize,
|
||||
) -> io::Result<()> {
|
||||
use crate::storage::needle::needle::get_actual_size;
|
||||
use crate::storage::super_block::SUPER_BLOCK_SIZE;
|
||||
|
||||
let base = volume_file_name(dir, collection, volume_id);
|
||||
let ecx_path = format!("{}.ecx", base);
|
||||
|
||||
// Open data shards to read logical .dat content
|
||||
let mut shards: Vec<EcVolumeShard> = (0..data_shards as u8)
|
||||
.map(|i| EcVolumeShard::new(dir, collection, volume_id, i))
|
||||
.collect();
|
||||
|
||||
for shard in &mut shards {
|
||||
if let Err(_) = shard.open() {
|
||||
// If a data shard is missing, we can't rebuild ecx
|
||||
for s in &mut shards {
|
||||
s.close();
|
||||
}
|
||||
return Err(io::Error::new(
|
||||
io::ErrorKind::NotFound,
|
||||
format!("cannot open data shard for ecx rebuild"),
|
||||
));
|
||||
}
|
||||
}
|
||||
|
||||
// Determine total logical data size from shard sizes
|
||||
let shard_size = shards.iter().map(|s| s.file_size()).max().unwrap_or(0);
|
||||
let total_data_size = shard_size as i64 * data_shards as i64;
|
||||
|
||||
// Read version from superblock (first byte of logical data)
|
||||
let mut sb_buf = [0u8; SUPER_BLOCK_SIZE];
|
||||
read_from_data_shards(&shards, &mut sb_buf, 0, data_shards)?;
|
||||
let version = Version(sb_buf[0]);
|
||||
|
||||
// Walk needles starting after superblock
|
||||
let mut offset = SUPER_BLOCK_SIZE as i64;
|
||||
let header_size = NEEDLE_HEADER_SIZE;
|
||||
let mut entries: Vec<(NeedleId, Offset, Size)> = Vec::new();
|
||||
|
||||
while offset + header_size as i64 <= total_data_size {
|
||||
// Read needle header (cookie + needle_id + size = 16 bytes)
|
||||
let mut header_buf = [0u8; NEEDLE_HEADER_SIZE];
|
||||
if read_from_data_shards(&shards, &mut header_buf, offset as u64, data_shards).is_err() {
|
||||
break;
|
||||
}
|
||||
|
||||
let cookie = Cookie::from_bytes(&header_buf[..COOKIE_SIZE]);
|
||||
let needle_id = NeedleId::from_bytes(&header_buf[COOKIE_SIZE..COOKIE_SIZE + NEEDLE_ID_SIZE]);
|
||||
let size = Size::from_bytes(&header_buf[COOKIE_SIZE + NEEDLE_ID_SIZE..header_size]);
|
||||
|
||||
// Validate: stop if we hit zero cookie+id (end of data)
|
||||
if cookie.0 == 0 && needle_id.0 == 0 {
|
||||
break;
|
||||
}
|
||||
|
||||
// Validate size is reasonable
|
||||
if size.0 < 0 && !size.is_deleted() {
|
||||
break;
|
||||
}
|
||||
|
||||
let actual_size = get_actual_size(size, version);
|
||||
if actual_size <= 0 || offset + actual_size > total_data_size {
|
||||
break;
|
||||
}
|
||||
|
||||
entries.push((needle_id, Offset::from_actual_offset(offset), size));
|
||||
|
||||
// Advance to next needle (aligned to NEEDLE_PADDING_SIZE)
|
||||
offset += actual_size;
|
||||
let padding_rem = offset % NEEDLE_PADDING_SIZE as i64;
|
||||
if padding_rem != 0 {
|
||||
offset += NEEDLE_PADDING_SIZE as i64 - padding_rem;
|
||||
}
|
||||
}
|
||||
|
||||
for shard in &mut shards {
|
||||
shard.close();
|
||||
}
|
||||
|
||||
// Sort by NeedleId, then by offset (later entries override earlier)
|
||||
entries.sort_by_key(|&(key, offset, _)| (key, offset.to_actual_offset()));
|
||||
|
||||
// Deduplicate: keep latest entry per needle_id
|
||||
entries.reverse();
|
||||
entries.dedup_by_key(|entry| entry.0);
|
||||
entries.reverse();
|
||||
|
||||
// Write sorted .ecx
|
||||
let mut ecx_file = File::create(&ecx_path)?;
|
||||
for &(key, offset, size) in &entries {
|
||||
idx::write_index_entry(&mut ecx_file, key, offset, size)?;
|
||||
}
|
||||
ecx_file.sync_all()?;
|
||||
|
||||
Ok(())
|
||||
}
|
||||
|
||||
/// Read bytes from EC data shards at a logical offset in the .dat file.
|
||||
fn read_from_data_shards(
|
||||
shards: &[EcVolumeShard],
|
||||
buf: &mut [u8],
|
||||
logical_offset: u64,
|
||||
data_shards: usize,
|
||||
) -> io::Result<()> {
|
||||
let small_block = ERASURE_CODING_SMALL_BLOCK_SIZE as u64;
|
||||
let data_shards_u64 = data_shards as u64;
|
||||
|
||||
let mut bytes_read = 0u64;
|
||||
let mut remaining = buf.len() as u64;
|
||||
let mut current_offset = logical_offset;
|
||||
|
||||
while remaining > 0 {
|
||||
// Determine which shard and at what shard-offset this logical offset maps to.
|
||||
// The data is interleaved: large blocks first, then small blocks.
|
||||
// For simplicity, use the small block size for all calculations since
|
||||
// large blocks are multiples of small blocks.
|
||||
let row_size = small_block * data_shards_u64;
|
||||
let row_index = current_offset / row_size;
|
||||
let row_offset = current_offset % row_size;
|
||||
let shard_index = (row_offset / small_block) as usize;
|
||||
let shard_offset = row_index * small_block + (row_offset % small_block);
|
||||
|
||||
if shard_index >= data_shards {
|
||||
return Err(io::Error::new(
|
||||
io::ErrorKind::InvalidInput,
|
||||
"shard index out of range",
|
||||
));
|
||||
}
|
||||
|
||||
// How many bytes can we read from this position in this shard block
|
||||
let bytes_left_in_block = small_block - (row_offset % small_block);
|
||||
let to_read = remaining.min(bytes_left_in_block) as usize;
|
||||
|
||||
let dest = &mut buf[bytes_read as usize..bytes_read as usize + to_read];
|
||||
shards[shard_index].read_at(dest, shard_offset)?;
|
||||
|
||||
bytes_read += to_read as u64;
|
||||
remaining -= to_read as u64;
|
||||
current_offset += to_read as u64;
|
||||
}
|
||||
|
||||
Ok(())
|
||||
}
|
||||
|
||||
/// Encode the .dat file data into shard files.
|
||||
///
|
||||
/// Uses a two-phase approach matching Go's ec_encoder.go:
|
||||
/// 1. Process as many large blocks (1GB) as possible
|
||||
/// 2. Process remaining data with small blocks (1MB)
|
||||
fn encode_dat_file(
|
||||
dat_file: &File,
|
||||
dat_size: i64,
|
||||
rs: &ReedSolomon,
|
||||
shards: &mut [EcVolumeShard],
|
||||
data_shards: usize,
|
||||
parity_shards: usize,
|
||||
) -> io::Result<()> {
|
||||
let mut remaining = dat_size;
|
||||
let mut offset: u64 = 0;
|
||||
|
||||
// Phase 1: Process large blocks (1GB each) while enough data remains
|
||||
let large_block_size = ERASURE_CODING_LARGE_BLOCK_SIZE;
|
||||
let large_row_size = large_block_size * data_shards;
|
||||
|
||||
while remaining >= large_row_size as i64 {
|
||||
encode_one_batch(
|
||||
dat_file,
|
||||
offset,
|
||||
large_block_size,
|
||||
rs,
|
||||
shards,
|
||||
data_shards,
|
||||
parity_shards,
|
||||
)?;
|
||||
offset += large_row_size as u64;
|
||||
remaining -= large_row_size as i64;
|
||||
}
|
||||
|
||||
// Phase 2: Process remaining data with small blocks (1MB each)
|
||||
let small_block_size = ERASURE_CODING_SMALL_BLOCK_SIZE;
|
||||
let small_row_size = small_block_size * data_shards;
|
||||
|
||||
while remaining > 0 {
|
||||
let to_process = remaining.min(small_row_size as i64);
|
||||
encode_one_batch(
|
||||
dat_file,
|
||||
offset,
|
||||
small_block_size,
|
||||
rs,
|
||||
shards,
|
||||
data_shards,
|
||||
parity_shards,
|
||||
)?;
|
||||
offset += to_process as u64;
|
||||
remaining -= to_process;
|
||||
}
|
||||
|
||||
Ok(())
|
||||
}
|
||||
|
||||
/// Encode one batch (row) of data.
|
||||
fn encode_one_batch(
|
||||
dat_file: &File,
|
||||
offset: u64,
|
||||
block_size: usize,
|
||||
rs: &ReedSolomon,
|
||||
shards: &mut [EcVolumeShard],
|
||||
data_shards: usize,
|
||||
parity_shards: usize,
|
||||
) -> io::Result<()> {
|
||||
let total_shards = data_shards + parity_shards;
|
||||
// Each batch allocates block_size * total_shards bytes.
|
||||
// With large blocks (1 GiB) this is 14 GiB -- guard against OOM.
|
||||
let total_alloc = block_size.checked_mul(total_shards).ok_or_else(|| {
|
||||
io::Error::new(
|
||||
io::ErrorKind::InvalidInput,
|
||||
"block_size * shard count overflows usize",
|
||||
)
|
||||
})?;
|
||||
// Large-block encoding uses 1 GiB * 14 shards = 14 GiB; allow up to 16 GiB.
|
||||
const MAX_BATCH_ALLOC: usize = 16 * 1024 * 1024 * 1024; // 16 GiB safety limit
|
||||
if total_alloc > MAX_BATCH_ALLOC {
|
||||
return Err(io::Error::new(
|
||||
io::ErrorKind::InvalidInput,
|
||||
format!(
|
||||
"batch allocation too large ({} bytes, limit {} bytes); block_size={} shards={}",
|
||||
total_alloc, MAX_BATCH_ALLOC, block_size, total_shards,
|
||||
),
|
||||
));
|
||||
}
|
||||
|
||||
// Allocate buffers for all shards
|
||||
let mut buffers: Vec<Vec<u8>> = (0..total_shards).map(|_| vec![0u8; block_size]).collect();
|
||||
|
||||
// Read data shards from .dat file
|
||||
for i in 0..data_shards {
|
||||
let read_offset = offset + (i * block_size) as u64;
|
||||
|
||||
#[cfg(unix)]
|
||||
{
|
||||
use std::os::unix::fs::FileExt;
|
||||
dat_file.read_at(&mut buffers[i], read_offset)?;
|
||||
}
|
||||
|
||||
#[cfg(not(unix))]
|
||||
{
|
||||
let mut f = dat_file.try_clone()?;
|
||||
f.seek(SeekFrom::Start(read_offset))?;
|
||||
f.read(&mut buffers[i])?;
|
||||
}
|
||||
}
|
||||
|
||||
// Encode parity shards
|
||||
rs.encode(&mut buffers).map_err(|e| {
|
||||
io::Error::new(
|
||||
io::ErrorKind::Other,
|
||||
format!("reed-solomon encode: {:?}", e),
|
||||
)
|
||||
})?;
|
||||
|
||||
// Write all shard buffers to files
|
||||
for (i, buf) in buffers.iter().enumerate() {
|
||||
shards[i].write_all(buf)?;
|
||||
}
|
||||
|
||||
Ok(())
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
use crate::storage::needle::needle::Needle;
|
||||
use crate::storage::needle_map::NeedleMapKind;
|
||||
use crate::storage::volume::Volume;
|
||||
use tempfile::TempDir;
|
||||
|
||||
#[test]
|
||||
fn test_ec_encode_decode_round_trip() {
|
||||
let tmp = TempDir::new().unwrap();
|
||||
let dir = tmp.path().to_str().unwrap();
|
||||
|
||||
// Create a volume with some data
|
||||
let mut v = Volume::new(
|
||||
dir,
|
||||
dir,
|
||||
"",
|
||||
VolumeId(1),
|
||||
NeedleMapKind::InMemory,
|
||||
None,
|
||||
None,
|
||||
0,
|
||||
Version::current(),
|
||||
)
|
||||
.unwrap();
|
||||
|
||||
for i in 1..=5 {
|
||||
let data = format!("test data for needle {}", i);
|
||||
let mut n = Needle {
|
||||
id: NeedleId(i),
|
||||
cookie: Cookie(i as u32),
|
||||
data: data.as_bytes().to_vec(),
|
||||
data_size: data.len() as u32,
|
||||
..Needle::default()
|
||||
};
|
||||
v.write_needle(&mut n, true).unwrap();
|
||||
}
|
||||
v.sync_to_disk().unwrap();
|
||||
v.close();
|
||||
|
||||
// Encode to EC shards
|
||||
let data_shards = 10;
|
||||
let parity_shards = 4;
|
||||
let total_shards = data_shards + parity_shards;
|
||||
write_ec_files(dir, dir, "", VolumeId(1), data_shards, parity_shards).unwrap();
|
||||
|
||||
// Verify shard files exist
|
||||
for i in 0..total_shards {
|
||||
let path = format!("{}/{}.ec{:02}", dir, 1, i);
|
||||
assert!(
|
||||
std::path::Path::new(&path).exists(),
|
||||
"shard file {} should exist",
|
||||
path
|
||||
);
|
||||
}
|
||||
|
||||
// Verify .ecx exists
|
||||
let ecx_path = format!("{}/1.ecx", dir);
|
||||
assert!(std::path::Path::new(&ecx_path).exists());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_reed_solomon_basic() {
|
||||
let data_shards = 10;
|
||||
let parity_shards = 4;
|
||||
let total_shards = data_shards + parity_shards;
|
||||
let rs = ReedSolomon::new(data_shards, parity_shards).unwrap();
|
||||
let block_size = 1024;
|
||||
let mut shards: Vec<Vec<u8>> = (0..total_shards)
|
||||
.map(|i| {
|
||||
if i < data_shards {
|
||||
vec![(i as u8).wrapping_mul(7); block_size]
|
||||
} else {
|
||||
vec![0u8; block_size]
|
||||
}
|
||||
})
|
||||
.collect();
|
||||
|
||||
// Encode
|
||||
rs.encode(&mut shards).unwrap();
|
||||
|
||||
// Verify parity is non-zero (at least some)
|
||||
let parity_nonzero: bool = shards[data_shards..]
|
||||
.iter()
|
||||
.any(|s| s.iter().any(|&b| b != 0));
|
||||
assert!(parity_nonzero);
|
||||
|
||||
// Simulate losing 4 shards and reconstructing
|
||||
let original_0 = shards[0].clone();
|
||||
let original_1 = shards[1].clone();
|
||||
|
||||
let mut shard_opts: Vec<Option<Vec<u8>>> = shards.into_iter().map(Some).collect();
|
||||
shard_opts[0] = None;
|
||||
shard_opts[1] = None;
|
||||
shard_opts[2] = None;
|
||||
shard_opts[3] = None;
|
||||
|
||||
rs.reconstruct(&mut shard_opts).unwrap();
|
||||
|
||||
assert_eq!(shard_opts[0].as_ref().unwrap(), &original_0);
|
||||
assert_eq!(shard_opts[1].as_ref().unwrap(), &original_1);
|
||||
}
|
||||
|
||||
/// EC encode must read .idx from a separate index directory when configured.
|
||||
#[test]
|
||||
fn test_ec_encode_with_separate_idx_dir() {
|
||||
let dat_tmp = TempDir::new().unwrap();
|
||||
let idx_tmp = TempDir::new().unwrap();
|
||||
let dat_dir = dat_tmp.path().to_str().unwrap();
|
||||
let idx_dir = idx_tmp.path().to_str().unwrap();
|
||||
|
||||
// Create a volume with separate data and index directories
|
||||
let mut v = Volume::new(
|
||||
dat_dir,
|
||||
idx_dir,
|
||||
"",
|
||||
VolumeId(1),
|
||||
NeedleMapKind::InMemory,
|
||||
None,
|
||||
None,
|
||||
0,
|
||||
Version::current(),
|
||||
)
|
||||
.unwrap();
|
||||
|
||||
for i in 1..=5 {
|
||||
let data = format!("needle {} payload", i);
|
||||
let mut n = Needle {
|
||||
id: NeedleId(i),
|
||||
cookie: Cookie(i as u32),
|
||||
data: data.as_bytes().to_vec(),
|
||||
data_size: data.len() as u32,
|
||||
..Needle::default()
|
||||
};
|
||||
v.write_needle(&mut n, true).unwrap();
|
||||
}
|
||||
v.sync_to_disk().unwrap();
|
||||
v.close();
|
||||
|
||||
// Verify .dat is in data dir, .idx is in idx dir
|
||||
assert!(std::path::Path::new(&format!("{}/1.dat", dat_dir)).exists());
|
||||
assert!(!std::path::Path::new(&format!("{}/1.idx", dat_dir)).exists());
|
||||
assert!(std::path::Path::new(&format!("{}/1.idx", idx_dir)).exists());
|
||||
assert!(!std::path::Path::new(&format!("{}/1.dat", idx_dir)).exists());
|
||||
|
||||
// EC encode with separate idx dir
|
||||
let data_shards = 10;
|
||||
let parity_shards = 4;
|
||||
let total_shards = data_shards + parity_shards;
|
||||
write_ec_files(
|
||||
dat_dir,
|
||||
idx_dir,
|
||||
"",
|
||||
VolumeId(1),
|
||||
data_shards,
|
||||
parity_shards,
|
||||
)
|
||||
.unwrap();
|
||||
|
||||
// Verify all 14 shard files in data dir
|
||||
for i in 0..total_shards {
|
||||
let path = format!("{}/1.ec{:02}", dat_dir, i);
|
||||
assert!(
|
||||
std::path::Path::new(&path).exists(),
|
||||
"shard {} should exist in data dir",
|
||||
path
|
||||
);
|
||||
}
|
||||
|
||||
// Verify .ecx in data dir (not idx dir)
|
||||
assert!(std::path::Path::new(&format!("{}/1.ecx", dat_dir)).exists());
|
||||
assert!(!std::path::Path::new(&format!("{}/1.ecx", idx_dir)).exists());
|
||||
|
||||
// Verify no shard files leaked into idx dir
|
||||
for i in 0..total_shards {
|
||||
let path = format!("{}/1.ec{:02}", idx_dir, i);
|
||||
assert!(
|
||||
!std::path::Path::new(&path).exists(),
|
||||
"shard {} should NOT exist in idx dir",
|
||||
path
|
||||
);
|
||||
}
|
||||
}
|
||||
|
||||
/// EC encode should fail gracefully when .idx is only in the data dir
|
||||
/// but we pass a wrong idx_dir. This guards against regressions where
|
||||
/// write_ec_files ignores the idx_dir parameter.
|
||||
#[test]
|
||||
fn test_ec_encode_fails_with_wrong_idx_dir() {
|
||||
let dat_tmp = TempDir::new().unwrap();
|
||||
let idx_tmp = TempDir::new().unwrap();
|
||||
let wrong_tmp = TempDir::new().unwrap();
|
||||
let dat_dir = dat_tmp.path().to_str().unwrap();
|
||||
let idx_dir = idx_tmp.path().to_str().unwrap();
|
||||
let wrong_dir = wrong_tmp.path().to_str().unwrap();
|
||||
|
||||
let mut v = Volume::new(
|
||||
dat_dir,
|
||||
idx_dir,
|
||||
"",
|
||||
VolumeId(1),
|
||||
NeedleMapKind::InMemory,
|
||||
None,
|
||||
None,
|
||||
0,
|
||||
Version::current(),
|
||||
)
|
||||
.unwrap();
|
||||
|
||||
let mut n = Needle {
|
||||
id: NeedleId(1),
|
||||
cookie: Cookie(1),
|
||||
data: b"hello".to_vec(),
|
||||
data_size: 5,
|
||||
..Needle::default()
|
||||
};
|
||||
v.write_needle(&mut n, true).unwrap();
|
||||
v.sync_to_disk().unwrap();
|
||||
v.close();
|
||||
|
||||
// Should fail: .idx is in idx_dir, not wrong_dir
|
||||
let result = write_ec_files(dat_dir, wrong_dir, "", VolumeId(1), 10, 4);
|
||||
assert!(
|
||||
result.is_err(),
|
||||
"should fail when idx_dir doesn't contain .idx"
|
||||
);
|
||||
}
|
||||
}
|
||||
223
seaweed-volume/src/storage/erasure_coding/ec_locate.rs
Normal file
223
seaweed-volume/src/storage/erasure_coding/ec_locate.rs
Normal file
@@ -0,0 +1,223 @@
|
||||
//! EC data location: maps needle offset/size to shard intervals.
|
||||
//!
|
||||
//! Determines which shard(s) contain data for a given needle and at what
|
||||
//! offsets within those shards. Handles both large (1GB) and small (1MB)
|
||||
//! block sections.
|
||||
|
||||
use crate::storage::erasure_coding::ec_shard::*;
|
||||
use crate::storage::types::*;
|
||||
|
||||
/// An interval to read from EC shards.
|
||||
#[derive(Debug, Clone)]
|
||||
pub struct Interval {
|
||||
pub block_index: usize,
|
||||
pub inner_block_offset: i64,
|
||||
pub size: i64,
|
||||
pub is_large_block: bool,
|
||||
pub large_block_rows_count: usize,
|
||||
}
|
||||
|
||||
impl Interval {
|
||||
pub fn to_shard_id_and_offset(&self, data_shards: u32) -> (ShardId, i64) {
|
||||
let data_shards_usize = data_shards as usize;
|
||||
let shard_id = (self.block_index % data_shards_usize) as ShardId;
|
||||
let row_index = self.block_index / data_shards_usize;
|
||||
|
||||
let block_size = if self.is_large_block {
|
||||
ERASURE_CODING_LARGE_BLOCK_SIZE as i64
|
||||
} else {
|
||||
ERASURE_CODING_SMALL_BLOCK_SIZE as i64
|
||||
};
|
||||
|
||||
let mut offset = row_index as i64 * block_size + self.inner_block_offset;
|
||||
if !self.is_large_block {
|
||||
// Small blocks come after large blocks in the shard file
|
||||
offset += self.large_block_rows_count as i64 * ERASURE_CODING_LARGE_BLOCK_SIZE as i64;
|
||||
}
|
||||
|
||||
(shard_id, offset)
|
||||
}
|
||||
}
|
||||
|
||||
/// Locate the EC shard intervals needed to read data at the given offset and size.
|
||||
///
|
||||
/// `shard_size` is the size of a single shard file.
|
||||
pub fn locate_data(offset: i64, size: Size, shard_size: i64, data_shards: u32) -> Vec<Interval> {
|
||||
let mut intervals = Vec::new();
|
||||
let data_size = size.0 as i64;
|
||||
|
||||
if data_size <= 0 || shard_size <= 0 {
|
||||
return intervals;
|
||||
}
|
||||
|
||||
let large_block_size = ERASURE_CODING_LARGE_BLOCK_SIZE as i64;
|
||||
let small_block_size = ERASURE_CODING_SMALL_BLOCK_SIZE as i64;
|
||||
let large_row_size = large_block_size * data_shards as i64;
|
||||
let small_row_size = small_block_size * data_shards as i64;
|
||||
|
||||
// Number of large block rows
|
||||
let n_large_block_rows = if shard_size > 0 {
|
||||
((shard_size - 1) / large_block_size) as usize
|
||||
} else {
|
||||
0
|
||||
};
|
||||
let large_section_size = n_large_block_rows as i64 * large_row_size;
|
||||
|
||||
let mut remaining_offset = offset;
|
||||
let mut remaining_size = data_size;
|
||||
|
||||
// In large block section?
|
||||
if remaining_offset < large_section_size {
|
||||
let available_in_large = large_section_size - remaining_offset;
|
||||
let to_read = remaining_size.min(available_in_large);
|
||||
|
||||
add_intervals(
|
||||
&mut intervals,
|
||||
remaining_offset,
|
||||
to_read,
|
||||
large_block_size,
|
||||
large_row_size,
|
||||
true,
|
||||
n_large_block_rows,
|
||||
);
|
||||
|
||||
remaining_offset += to_read;
|
||||
remaining_size -= to_read;
|
||||
}
|
||||
|
||||
// In small block section?
|
||||
if remaining_size > 0 {
|
||||
let small_offset = remaining_offset - large_section_size;
|
||||
add_intervals(
|
||||
&mut intervals,
|
||||
small_offset,
|
||||
remaining_size,
|
||||
small_block_size,
|
||||
small_row_size,
|
||||
false,
|
||||
n_large_block_rows,
|
||||
);
|
||||
}
|
||||
|
||||
intervals
|
||||
}
|
||||
|
||||
fn add_intervals(
|
||||
intervals: &mut Vec<Interval>,
|
||||
offset: i64,
|
||||
size: i64,
|
||||
block_size: i64,
|
||||
_row_size: i64,
|
||||
is_large_block: bool,
|
||||
large_block_rows_count: usize,
|
||||
) {
|
||||
let mut pos = offset;
|
||||
let end = offset + size;
|
||||
|
||||
while pos < end {
|
||||
let block_index = (pos / block_size) as usize;
|
||||
let inner_offset = pos % block_size;
|
||||
let remaining_in_block = block_size - inner_offset;
|
||||
let interval_size = remaining_in_block.min(end - pos);
|
||||
|
||||
intervals.push(Interval {
|
||||
block_index,
|
||||
inner_block_offset: inner_offset,
|
||||
size: interval_size,
|
||||
is_large_block,
|
||||
large_block_rows_count,
|
||||
});
|
||||
|
||||
pos += interval_size;
|
||||
}
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn test_interval_to_shard_id() {
|
||||
let data_shards = 10;
|
||||
let large_block_size = ERASURE_CODING_LARGE_BLOCK_SIZE as i64;
|
||||
let _shard_size = 1024 * 1024; // Example shard size
|
||||
|
||||
// Block index 0 → shard 0
|
||||
let interval = Interval {
|
||||
block_index: 0,
|
||||
inner_block_offset: 100,
|
||||
size: 50,
|
||||
is_large_block: true,
|
||||
large_block_rows_count: 1,
|
||||
};
|
||||
let (shard_id, offset) = interval.to_shard_id_and_offset(data_shards);
|
||||
assert_eq!(shard_id, 0);
|
||||
assert_eq!(offset, 100);
|
||||
|
||||
// Block index 5 → shard 5
|
||||
let interval = Interval {
|
||||
block_index: 5,
|
||||
inner_block_offset: 0,
|
||||
size: 1024,
|
||||
is_large_block: true,
|
||||
large_block_rows_count: 1,
|
||||
};
|
||||
let (shard_id, _offset) = interval.to_shard_id_and_offset(data_shards);
|
||||
assert_eq!(shard_id, 5);
|
||||
|
||||
// Block index 12 (data_shards=10) → row_index 1, shard_id 2
|
||||
let interval = Interval {
|
||||
block_index: 12,
|
||||
inner_block_offset: 200,
|
||||
size: 50,
|
||||
is_large_block: true,
|
||||
large_block_rows_count: 5,
|
||||
};
|
||||
let (shard_id, offset) = interval.to_shard_id_and_offset(data_shards);
|
||||
assert_eq!(shard_id, 2); // 12 % 10 = 2
|
||||
assert_eq!(offset, large_block_size + 200); // row 1 offset + inner_block_offset
|
||||
|
||||
// Block index 10 → shard 0 (second row)
|
||||
let interval = Interval {
|
||||
block_index: 10,
|
||||
inner_block_offset: 0,
|
||||
size: 100,
|
||||
is_large_block: true,
|
||||
large_block_rows_count: 2,
|
||||
};
|
||||
let (shard_id, offset) = interval.to_shard_id_and_offset(data_shards);
|
||||
assert_eq!(shard_id, 0);
|
||||
assert_eq!(offset, ERASURE_CODING_LARGE_BLOCK_SIZE as i64); // row 1 offset
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_locate_data_small_file() {
|
||||
// Small file: 100 bytes at offset 50, shard size = 1MB
|
||||
let intervals = locate_data(50, Size(100), 1024 * 1024, 10);
|
||||
assert!(!intervals.is_empty());
|
||||
|
||||
// Should be a single small block interval (no large block rows for 1MB shard)
|
||||
assert_eq!(intervals.len(), 1);
|
||||
assert!(!intervals[0].is_large_block);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_locate_data_empty() {
|
||||
let intervals = locate_data(0, Size(0), 1024 * 1024, 10);
|
||||
assert!(intervals.is_empty());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_small_block_after_large() {
|
||||
let interval = Interval {
|
||||
block_index: 0,
|
||||
inner_block_offset: 0,
|
||||
size: 100,
|
||||
is_large_block: false,
|
||||
large_block_rows_count: 2,
|
||||
};
|
||||
let (_shard_id, offset) = interval.to_shard_id_and_offset(10);
|
||||
// Should be after 2 large block rows
|
||||
assert_eq!(offset, 2 * ERASURE_CODING_LARGE_BLOCK_SIZE as i64);
|
||||
}
|
||||
}
|
||||
225
seaweed-volume/src/storage/erasure_coding/ec_shard.rs
Normal file
225
seaweed-volume/src/storage/erasure_coding/ec_shard.rs
Normal file
@@ -0,0 +1,225 @@
|
||||
//! EcVolumeShard: a single shard file (.ec00-.ec13) of an erasure-coded volume.
|
||||
|
||||
use std::fs::{self, File, OpenOptions};
|
||||
use std::io::{self, Write};
|
||||
|
||||
use crate::storage::types::*;
|
||||
|
||||
pub const DATA_SHARDS_COUNT: usize = 10;
|
||||
pub const PARITY_SHARDS_COUNT: usize = 4;
|
||||
pub const TOTAL_SHARDS_COUNT: usize = DATA_SHARDS_COUNT + PARITY_SHARDS_COUNT;
|
||||
pub const MAX_SHARD_COUNT: usize = 32;
|
||||
pub const MIN_TOTAL_DISKS: usize = TOTAL_SHARDS_COUNT / PARITY_SHARDS_COUNT + 1;
|
||||
pub const ERASURE_CODING_LARGE_BLOCK_SIZE: usize = 1024 * 1024 * 1024; // 1GB
|
||||
pub const ERASURE_CODING_SMALL_BLOCK_SIZE: usize = 1024 * 1024; // 1MB
|
||||
|
||||
pub type ShardId = u8;
|
||||
|
||||
/// A single erasure-coded shard file.
|
||||
pub struct EcVolumeShard {
|
||||
pub volume_id: VolumeId,
|
||||
pub shard_id: ShardId,
|
||||
pub collection: String,
|
||||
pub dir: String,
|
||||
pub disk_type: DiskType,
|
||||
ecd_file: Option<File>,
|
||||
ecd_file_size: i64,
|
||||
}
|
||||
|
||||
impl EcVolumeShard {
|
||||
/// Create a new shard reference (does not open the file).
|
||||
pub fn new(dir: &str, collection: &str, volume_id: VolumeId, shard_id: ShardId) -> Self {
|
||||
EcVolumeShard {
|
||||
volume_id,
|
||||
shard_id,
|
||||
collection: collection.to_string(),
|
||||
dir: dir.to_string(),
|
||||
disk_type: DiskType::default(),
|
||||
ecd_file: None,
|
||||
ecd_file_size: 0,
|
||||
}
|
||||
}
|
||||
|
||||
/// Shard file name, e.g. "dir/collection_42.ec03"
|
||||
pub fn file_name(&self) -> String {
|
||||
let base =
|
||||
crate::storage::volume::volume_file_name(&self.dir, &self.collection, self.volume_id);
|
||||
format!("{}.ec{:02}", base, self.shard_id)
|
||||
}
|
||||
|
||||
/// Open the shard file for reading.
|
||||
pub fn open(&mut self) -> io::Result<()> {
|
||||
let path = self.file_name();
|
||||
let file = File::open(&path)?;
|
||||
self.ecd_file_size = file.metadata()?.len() as i64;
|
||||
self.ecd_file = Some(file);
|
||||
Ok(())
|
||||
}
|
||||
|
||||
/// Create the shard file for writing.
|
||||
pub fn create(&mut self) -> io::Result<()> {
|
||||
let path = self.file_name();
|
||||
let file = OpenOptions::new()
|
||||
.read(true)
|
||||
.write(true)
|
||||
.create(true)
|
||||
.truncate(true)
|
||||
.open(&path)?;
|
||||
self.ecd_file = Some(file);
|
||||
self.ecd_file_size = 0;
|
||||
Ok(())
|
||||
}
|
||||
|
||||
/// Read data at a specific offset.
|
||||
pub fn read_at(&self, buf: &mut [u8], offset: u64) -> io::Result<usize> {
|
||||
let file = self
|
||||
.ecd_file
|
||||
.as_ref()
|
||||
.ok_or_else(|| io::Error::new(io::ErrorKind::Other, "shard file not open"))?;
|
||||
|
||||
#[cfg(unix)]
|
||||
{
|
||||
use std::os::unix::fs::FileExt;
|
||||
file.read_at(buf, offset)
|
||||
}
|
||||
|
||||
#[cfg(not(unix))]
|
||||
{
|
||||
use std::io::{Read, Seek, SeekFrom};
|
||||
// File::read_at is unix-only; fall back to seek + read.
|
||||
// We need a mutable reference for seek/read, so clone the handle.
|
||||
let mut f = file.try_clone()?;
|
||||
f.seek(SeekFrom::Start(offset))?;
|
||||
f.read(buf)
|
||||
}
|
||||
}
|
||||
|
||||
/// Write data to the shard file (appends).
|
||||
pub fn write_all(&mut self, data: &[u8]) -> io::Result<()> {
|
||||
let file = self
|
||||
.ecd_file
|
||||
.as_mut()
|
||||
.ok_or_else(|| io::Error::new(io::ErrorKind::Other, "shard file not open"))?;
|
||||
file.write_all(data)?;
|
||||
self.ecd_file_size += data.len() as i64;
|
||||
Ok(())
|
||||
}
|
||||
|
||||
pub fn file_size(&self) -> i64 {
|
||||
self.ecd_file_size
|
||||
}
|
||||
|
||||
/// Close the shard file.
|
||||
pub fn close(&mut self) {
|
||||
if let Some(ref file) = self.ecd_file {
|
||||
let _ = file.sync_all();
|
||||
}
|
||||
self.ecd_file = None;
|
||||
}
|
||||
|
||||
/// Delete the shard file from disk.
|
||||
pub fn destroy(&mut self) {
|
||||
self.close();
|
||||
let _ = fs::remove_file(self.file_name());
|
||||
}
|
||||
}
|
||||
|
||||
/// ShardBits: bitmap tracking which shards are present.
|
||||
#[derive(Debug, Clone, Copy, Default, PartialEq, Eq)]
|
||||
pub struct ShardBits(pub u32);
|
||||
|
||||
impl ShardBits {
|
||||
pub fn add_shard_id(&mut self, id: ShardId) {
|
||||
assert!((id as usize) < 32, "shard id {} out of bounds (max 31)", id,);
|
||||
self.0 |= 1 << id;
|
||||
}
|
||||
|
||||
pub fn remove_shard_id(&mut self, id: ShardId) {
|
||||
assert!((id as usize) < 32, "shard id {} out of bounds (max 31)", id,);
|
||||
self.0 &= !(1 << id);
|
||||
}
|
||||
|
||||
pub fn has_shard_id(&self, id: ShardId) -> bool {
|
||||
if (id as usize) >= 32 {
|
||||
return false;
|
||||
}
|
||||
self.0 & (1 << id) != 0
|
||||
}
|
||||
|
||||
pub fn shard_id_count(&self) -> usize {
|
||||
self.0.count_ones() as usize
|
||||
}
|
||||
|
||||
/// Iterator over present shard IDs.
|
||||
pub fn shard_ids(&self) -> Vec<ShardId> {
|
||||
let mut ids = Vec::with_capacity(self.shard_id_count());
|
||||
for i in 0..32 {
|
||||
if self.has_shard_id(i) {
|
||||
ids.push(i);
|
||||
}
|
||||
}
|
||||
ids
|
||||
}
|
||||
|
||||
pub fn minus(&self, other: ShardBits) -> ShardBits {
|
||||
ShardBits(self.0 & !other.0)
|
||||
}
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn test_shard_bits() {
|
||||
let mut bits = ShardBits::default();
|
||||
assert_eq!(bits.shard_id_count(), 0);
|
||||
|
||||
bits.add_shard_id(0);
|
||||
bits.add_shard_id(3);
|
||||
bits.add_shard_id(13);
|
||||
assert_eq!(bits.shard_id_count(), 3);
|
||||
assert!(bits.has_shard_id(0));
|
||||
assert!(bits.has_shard_id(3));
|
||||
assert!(!bits.has_shard_id(1));
|
||||
|
||||
bits.remove_shard_id(3);
|
||||
assert!(!bits.has_shard_id(3));
|
||||
assert_eq!(bits.shard_id_count(), 2);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_shard_bits_ids() {
|
||||
let mut bits = ShardBits::default();
|
||||
bits.add_shard_id(1);
|
||||
bits.add_shard_id(5);
|
||||
bits.add_shard_id(9);
|
||||
assert_eq!(bits.shard_ids(), vec![1, 5, 9]);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_shard_bits_minus() {
|
||||
let mut a = ShardBits::default();
|
||||
a.add_shard_id(0);
|
||||
a.add_shard_id(1);
|
||||
a.add_shard_id(2);
|
||||
|
||||
let mut b = ShardBits::default();
|
||||
b.add_shard_id(1);
|
||||
|
||||
let c = a.minus(b);
|
||||
assert_eq!(c.shard_ids(), vec![0, 2]);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_shard_file_name() {
|
||||
let shard = EcVolumeShard::new("/data", "pics", VolumeId(42), 3);
|
||||
assert_eq!(shard.file_name(), "/data/pics_42.ec03");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_shard_file_name_no_collection() {
|
||||
let shard = EcVolumeShard::new("/data", "", VolumeId(7), 13);
|
||||
assert_eq!(shard.file_name(), "/data/7.ec13");
|
||||
}
|
||||
}
|
||||
944
seaweed-volume/src/storage/erasure_coding/ec_volume.rs
Normal file
944
seaweed-volume/src/storage/erasure_coding/ec_volume.rs
Normal file
@@ -0,0 +1,944 @@
|
||||
//! EcVolume: an erasure-coded volume with up to 14 shards.
|
||||
//!
|
||||
//! Each EcVolume has a sorted index (.ecx) and a deletion journal (.ecj).
|
||||
//! Shards (.ec00-.ec13) may be distributed across multiple servers.
|
||||
|
||||
use std::collections::HashMap;
|
||||
use std::fs::{self, File, OpenOptions};
|
||||
use std::io::{self, Write};
|
||||
use std::time::{SystemTime, UNIX_EPOCH};
|
||||
|
||||
use crate::pb::master_pb;
|
||||
use crate::storage::erasure_coding::ec_locate;
|
||||
use crate::storage::erasure_coding::ec_shard::*;
|
||||
use crate::storage::needle::needle::{get_actual_size, Needle};
|
||||
use crate::storage::types::*;
|
||||
|
||||
/// An erasure-coded volume managing its local shards and index.
|
||||
pub struct EcVolume {
|
||||
pub volume_id: VolumeId,
|
||||
pub collection: String,
|
||||
pub dir: String,
|
||||
pub dir_idx: String,
|
||||
pub version: Version,
|
||||
pub shards: Vec<Option<EcVolumeShard>>, // indexed by ShardId (0..14)
|
||||
pub dat_file_size: i64,
|
||||
pub data_shards: u32,
|
||||
pub parity_shards: u32,
|
||||
ecx_file: Option<File>,
|
||||
ecx_file_size: i64,
|
||||
ecj_file: Option<File>,
|
||||
pub disk_type: DiskType,
|
||||
/// Directory where .ecx/.ecj were actually found (may differ from dir_idx after fallback).
|
||||
ecx_actual_dir: String,
|
||||
/// Maps shard ID -> list of server addresses where that shard exists.
|
||||
/// Used for distributed EC reads across the cluster.
|
||||
pub shard_locations: HashMap<ShardId, Vec<String>>,
|
||||
/// EC volume expiration time (unix epoch seconds), set during EC encode from TTL.
|
||||
pub expire_at_sec: u64,
|
||||
}
|
||||
|
||||
pub fn read_ec_shard_config(dir: &str, collection: &str, volume_id: VolumeId) -> (u32, u32) {
|
||||
let mut data_shards = crate::storage::erasure_coding::ec_shard::DATA_SHARDS_COUNT as u32;
|
||||
let mut parity_shards = crate::storage::erasure_coding::ec_shard::PARITY_SHARDS_COUNT as u32;
|
||||
let base = crate::storage::volume::volume_file_name(dir, collection, volume_id);
|
||||
let vif_path = format!("{}.vif", base);
|
||||
if let Ok(vif_content) = std::fs::read_to_string(&vif_path) {
|
||||
if let Ok(vif_info) =
|
||||
serde_json::from_str::<crate::storage::volume::VifVolumeInfo>(&vif_content)
|
||||
{
|
||||
if let Some(ec) = vif_info.ec_shard_config {
|
||||
if ec.data_shards > 0
|
||||
&& ec.parity_shards > 0
|
||||
&& (ec.data_shards + ec.parity_shards) <= TOTAL_SHARDS_COUNT as u32
|
||||
{
|
||||
data_shards = ec.data_shards;
|
||||
parity_shards = ec.parity_shards;
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
(data_shards, parity_shards)
|
||||
}
|
||||
|
||||
impl EcVolume {
|
||||
/// Create a new EcVolume. Loads .ecx index and .ecj journal if present.
|
||||
pub fn new(
|
||||
dir: &str,
|
||||
dir_idx: &str,
|
||||
collection: &str,
|
||||
volume_id: VolumeId,
|
||||
) -> io::Result<Self> {
|
||||
let (data_shards, parity_shards) = read_ec_shard_config(dir, collection, volume_id);
|
||||
|
||||
let total_shards = (data_shards + parity_shards) as usize;
|
||||
let mut shards = Vec::with_capacity(total_shards);
|
||||
for _ in 0..total_shards {
|
||||
shards.push(None);
|
||||
}
|
||||
|
||||
// Read expire_at_sec and version from .vif if present (matches Go's MaybeLoadVolumeInfo)
|
||||
let (expire_at_sec, vif_version) = {
|
||||
let base = crate::storage::volume::volume_file_name(dir, collection, volume_id);
|
||||
let vif_path = format!("{}.vif", base);
|
||||
if let Ok(vif_content) = std::fs::read_to_string(&vif_path) {
|
||||
if let Ok(vif_info) =
|
||||
serde_json::from_str::<crate::storage::volume::VifVolumeInfo>(&vif_content)
|
||||
{
|
||||
let ver = if vif_info.version > 0 {
|
||||
Version(vif_info.version as u8)
|
||||
} else {
|
||||
Version::current()
|
||||
};
|
||||
(vif_info.expire_at_sec, ver)
|
||||
} else {
|
||||
(0, Version::current())
|
||||
}
|
||||
} else {
|
||||
(0, Version::current())
|
||||
}
|
||||
};
|
||||
|
||||
let mut vol = EcVolume {
|
||||
volume_id,
|
||||
collection: collection.to_string(),
|
||||
dir: dir.to_string(),
|
||||
dir_idx: dir_idx.to_string(),
|
||||
version: vif_version,
|
||||
shards,
|
||||
dat_file_size: 0,
|
||||
data_shards,
|
||||
parity_shards,
|
||||
ecx_file: None,
|
||||
ecx_file_size: 0,
|
||||
ecj_file: None,
|
||||
disk_type: DiskType::default(),
|
||||
ecx_actual_dir: dir_idx.to_string(),
|
||||
shard_locations: HashMap::new(),
|
||||
expire_at_sec,
|
||||
};
|
||||
|
||||
// Open .ecx file (sorted index) in read/write mode for in-place deletion marking.
|
||||
// Matches Go which opens ecx for writing via MarkNeedleDeleted.
|
||||
let ecx_path = vol.ecx_file_name();
|
||||
if std::path::Path::new(&ecx_path).exists() {
|
||||
let file = OpenOptions::new().read(true).write(true).open(&ecx_path)?;
|
||||
vol.ecx_file_size = file.metadata()?.len() as i64;
|
||||
vol.ecx_file = Some(file);
|
||||
} else if dir_idx != dir {
|
||||
// Fall back to data directory if .ecx was created before -dir.idx was configured
|
||||
let data_base = crate::storage::volume::volume_file_name(dir, collection, volume_id);
|
||||
let fallback_ecx = format!("{}.ecx", data_base);
|
||||
if std::path::Path::new(&fallback_ecx).exists() {
|
||||
tracing::info!(
|
||||
volume_id = volume_id.0,
|
||||
"ecx file not found in idx dir, falling back to data dir"
|
||||
);
|
||||
let file = OpenOptions::new().read(true).write(true).open(&fallback_ecx)?;
|
||||
vol.ecx_file_size = file.metadata()?.len() as i64;
|
||||
vol.ecx_file = Some(file);
|
||||
vol.ecx_actual_dir = dir.to_string();
|
||||
}
|
||||
}
|
||||
|
||||
// Replay .ecj journal into .ecx on startup (matches Go's RebuildEcxFile).
|
||||
vol.rebuild_ecx_from_journal()?;
|
||||
|
||||
// Open .ecj file (deletion journal) — use ecx_actual_dir for consistency
|
||||
let ecj_base =
|
||||
crate::storage::volume::volume_file_name(&vol.ecx_actual_dir, collection, volume_id);
|
||||
let ecj_path = format!("{}.ecj", ecj_base);
|
||||
let ecj_file = OpenOptions::new()
|
||||
.read(true)
|
||||
.write(true)
|
||||
.create(true)
|
||||
.append(true)
|
||||
.open(&ecj_path)?;
|
||||
vol.ecj_file = Some(ecj_file);
|
||||
|
||||
Ok(vol)
|
||||
}
|
||||
|
||||
// ---- File names ----
|
||||
|
||||
#[allow(dead_code)]
|
||||
fn base_name(&self) -> String {
|
||||
crate::storage::volume::volume_file_name(&self.dir, &self.collection, self.volume_id)
|
||||
}
|
||||
|
||||
fn idx_base_name(&self) -> String {
|
||||
crate::storage::volume::volume_file_name(&self.dir_idx, &self.collection, self.volume_id)
|
||||
}
|
||||
|
||||
pub fn ecx_file_name(&self) -> String {
|
||||
format!("{}.ecx", self.idx_base_name())
|
||||
}
|
||||
|
||||
pub fn ecj_file_name(&self) -> String {
|
||||
format!("{}.ecj", self.idx_base_name())
|
||||
}
|
||||
|
||||
/// Sync the EC volume's journal and index files to disk (matching Go's ecv.Sync()).
|
||||
/// Go flushes both .ecj and .ecx to ensure in-place deletion marks are persisted.
|
||||
pub fn sync_to_disk(&self) -> io::Result<()> {
|
||||
if let Some(ref ecj_file) = self.ecj_file {
|
||||
ecj_file.sync_all()?;
|
||||
}
|
||||
if let Some(ref ecx_file) = self.ecx_file {
|
||||
ecx_file.sync_all()?;
|
||||
}
|
||||
Ok(())
|
||||
}
|
||||
|
||||
// ---- Shard management ----
|
||||
|
||||
/// Add a shard to this volume.
|
||||
pub fn add_shard(&mut self, mut shard: EcVolumeShard) -> io::Result<()> {
|
||||
let id = shard.shard_id as usize;
|
||||
let total_shards = (self.data_shards + self.parity_shards) as usize;
|
||||
if id >= total_shards {
|
||||
return Err(io::Error::new(
|
||||
io::ErrorKind::InvalidInput,
|
||||
format!("invalid shard id: {} (max {})", id, total_shards - 1),
|
||||
));
|
||||
}
|
||||
shard.open()?;
|
||||
self.shards[id] = Some(shard);
|
||||
Ok(())
|
||||
}
|
||||
|
||||
/// Remove and close a shard.
|
||||
pub fn remove_shard(&mut self, shard_id: ShardId) {
|
||||
if let Some(ref mut shard) = self.shards[shard_id as usize] {
|
||||
shard.close();
|
||||
}
|
||||
self.shards[shard_id as usize] = None;
|
||||
}
|
||||
|
||||
/// Get a ShardBits bitmap of locally available shards.
|
||||
pub fn shard_bits(&self) -> ShardBits {
|
||||
let mut bits = ShardBits::default();
|
||||
for (i, shard) in self.shards.iter().enumerate() {
|
||||
if shard.is_some() {
|
||||
bits.add_shard_id(i as ShardId);
|
||||
}
|
||||
}
|
||||
bits
|
||||
}
|
||||
|
||||
/// Count of locally available shards.
|
||||
pub fn shard_count(&self) -> usize {
|
||||
self.shards.iter().filter(|s| s.is_some()).count()
|
||||
}
|
||||
|
||||
pub fn is_time_to_destroy(&self) -> bool {
|
||||
self.expire_at_sec > 0
|
||||
&& SystemTime::now()
|
||||
.duration_since(UNIX_EPOCH)
|
||||
.unwrap_or_default()
|
||||
.as_secs()
|
||||
> self.expire_at_sec
|
||||
}
|
||||
|
||||
pub fn to_volume_ec_shard_information_messages(
|
||||
&self,
|
||||
disk_id: u32,
|
||||
) -> Vec<master_pb::VolumeEcShardInformationMessage> {
|
||||
let mut ec_index_bits: u32 = 0;
|
||||
let mut shard_sizes = Vec::new();
|
||||
for shard in self.shards.iter().flatten() {
|
||||
ec_index_bits |= 1u32 << shard.shard_id;
|
||||
shard_sizes.push(shard.file_size());
|
||||
}
|
||||
|
||||
if ec_index_bits == 0 {
|
||||
return Vec::new();
|
||||
}
|
||||
|
||||
vec![master_pb::VolumeEcShardInformationMessage {
|
||||
id: self.volume_id.0,
|
||||
collection: self.collection.clone(),
|
||||
ec_index_bits,
|
||||
shard_sizes,
|
||||
disk_type: self.disk_type.to_string(),
|
||||
expire_at_sec: self.expire_at_sec,
|
||||
disk_id,
|
||||
..Default::default()
|
||||
}]
|
||||
}
|
||||
|
||||
// ---- Shard locations (distributed tracking) ----
|
||||
|
||||
/// Set the list of server addresses for a given shard ID.
|
||||
pub fn set_shard_locations(&mut self, shard_id: ShardId, locations: Vec<String>) {
|
||||
self.shard_locations.insert(shard_id, locations);
|
||||
}
|
||||
|
||||
/// Get the list of server addresses for a given shard ID.
|
||||
pub fn get_shard_locations(&self, shard_id: ShardId) -> &[String] {
|
||||
self.shard_locations
|
||||
.get(&shard_id)
|
||||
.map(|v| v.as_slice())
|
||||
.unwrap_or(&[])
|
||||
}
|
||||
|
||||
// ---- Index operations ----
|
||||
|
||||
/// Find a needle's offset and size in the sorted .ecx index via binary search.
|
||||
pub fn find_needle_from_ecx(&self, needle_id: NeedleId) -> io::Result<Option<(Offset, Size)>> {
|
||||
let ecx_file = self
|
||||
.ecx_file
|
||||
.as_ref()
|
||||
.ok_or_else(|| io::Error::new(io::ErrorKind::Other, "ecx file not open"))?;
|
||||
|
||||
let entry_count = self.ecx_file_size as usize / NEEDLE_MAP_ENTRY_SIZE;
|
||||
if entry_count == 0 {
|
||||
return Ok(None);
|
||||
}
|
||||
|
||||
// Binary search
|
||||
let mut lo: usize = 0;
|
||||
let mut hi: usize = entry_count;
|
||||
let mut entry_buf = [0u8; NEEDLE_MAP_ENTRY_SIZE];
|
||||
|
||||
while lo < hi {
|
||||
let mid = lo + (hi - lo) / 2;
|
||||
let file_offset = (mid * NEEDLE_MAP_ENTRY_SIZE) as u64;
|
||||
|
||||
#[cfg(unix)]
|
||||
{
|
||||
use std::os::unix::fs::FileExt;
|
||||
ecx_file.read_exact_at(&mut entry_buf, file_offset)?;
|
||||
}
|
||||
|
||||
let (key, offset, size) = idx_entry_from_bytes(&entry_buf);
|
||||
if key == needle_id {
|
||||
return Ok(Some((offset, size)));
|
||||
} else if key < needle_id {
|
||||
lo = mid + 1;
|
||||
} else {
|
||||
hi = mid;
|
||||
}
|
||||
}
|
||||
|
||||
Ok(None)
|
||||
}
|
||||
|
||||
/// Locate the EC shard intervals needed to read a needle.
|
||||
pub fn locate_needle(
|
||||
&self,
|
||||
needle_id: NeedleId,
|
||||
) -> io::Result<Option<(Offset, Size, Vec<ec_locate::Interval>)>> {
|
||||
let (offset, size) = match self.find_needle_from_ecx(needle_id)? {
|
||||
Some((o, s)) => (o, s),
|
||||
None => return Ok(None),
|
||||
};
|
||||
|
||||
if size.is_deleted() || offset.is_zero() {
|
||||
return Ok(None);
|
||||
}
|
||||
|
||||
// Match Go's LocateEcShardNeedleInterval: shardSize = shard.ecdFileSize - 1
|
||||
// Shards are usually padded to ErasureCodingSmallBlockSize, so subtract 1
|
||||
// to avoid off-by-one in large block row count calculation.
|
||||
// If datFileSize is known, use datFileSize / DataShards instead.
|
||||
let shard_size = if self.dat_file_size > 0 {
|
||||
self.dat_file_size / self.data_shards as i64
|
||||
} else {
|
||||
self.shard_file_size() - 1
|
||||
};
|
||||
// Pass the actual on-disk size (header+body+checksum+timestamp+padding)
|
||||
// to locate_data, matching Go: types.Size(needle.GetActualSize(size, version))
|
||||
let actual = get_actual_size(size, self.version);
|
||||
let intervals = ec_locate::locate_data(
|
||||
offset.to_actual_offset(),
|
||||
Size(actual as i32),
|
||||
shard_size,
|
||||
self.data_shards,
|
||||
);
|
||||
|
||||
Ok(Some((offset, size, intervals)))
|
||||
}
|
||||
|
||||
/// Read a full needle from locally available EC shards.
|
||||
///
|
||||
/// Locates the needle in the .ecx index, determines which shard intervals
|
||||
/// contain its data, reads from local shards, and parses the result into
|
||||
/// a fully populated Needle (including last_modified, checksum, ttl).
|
||||
///
|
||||
/// Returns `Ok(None)` if the needle is not found or is deleted.
|
||||
/// Returns an error if a required shard is not available locally.
|
||||
pub fn read_ec_shard_needle(&self, needle_id: NeedleId) -> io::Result<Option<Needle>> {
|
||||
let (offset, size, intervals) = match self.locate_needle(needle_id)? {
|
||||
Some(v) => v,
|
||||
None => return Ok(None),
|
||||
};
|
||||
|
||||
if intervals.is_empty() {
|
||||
return Err(io::Error::new(
|
||||
io::ErrorKind::InvalidData,
|
||||
"no intervals for needle",
|
||||
));
|
||||
}
|
||||
|
||||
// Compute the total bytes we need to read (full needle on disk)
|
||||
let actual_size = get_actual_size(size, self.version) as usize;
|
||||
let mut bytes = Vec::with_capacity(actual_size);
|
||||
|
||||
for interval in &intervals {
|
||||
let (shard_id, shard_offset) = interval.to_shard_id_and_offset(self.data_shards);
|
||||
let shard = self
|
||||
.shards
|
||||
.get(shard_id as usize)
|
||||
.and_then(|s| s.as_ref())
|
||||
.ok_or_else(|| {
|
||||
io::Error::new(
|
||||
io::ErrorKind::NotFound,
|
||||
format!("ec shard {} not available locally", shard_id),
|
||||
)
|
||||
})?;
|
||||
|
||||
let mut buf = vec![0u8; interval.size as usize];
|
||||
shard.read_at(&mut buf, shard_offset as u64)?;
|
||||
bytes.extend_from_slice(&buf);
|
||||
}
|
||||
|
||||
// Truncate to exact actual_size (intervals may span more than needed)
|
||||
bytes.truncate(actual_size);
|
||||
|
||||
if bytes.len() < actual_size {
|
||||
return Err(io::Error::new(
|
||||
io::ErrorKind::UnexpectedEof,
|
||||
format!(
|
||||
"read {} bytes but need {} for needle {}",
|
||||
bytes.len(),
|
||||
actual_size,
|
||||
needle_id
|
||||
),
|
||||
));
|
||||
}
|
||||
|
||||
let mut n = Needle::default();
|
||||
n.read_bytes(&bytes, offset.to_actual_offset(), size, self.version)
|
||||
.map_err(|e| io::Error::new(io::ErrorKind::InvalidData, format!("{}", e)))?;
|
||||
|
||||
Ok(Some(n))
|
||||
}
|
||||
|
||||
/// Get the size of a single shard (all shards are the same size).
|
||||
fn shard_file_size(&self) -> i64 {
|
||||
for shard in &self.shards {
|
||||
if let Some(s) = shard {
|
||||
return s.file_size();
|
||||
}
|
||||
}
|
||||
0
|
||||
}
|
||||
|
||||
/// Walk the .ecx index and return (file_count, file_deleted_count, total_size).
|
||||
/// total_size sums size.Raw() for all entries (including deleted), matching Go's WalkIndex.
|
||||
pub fn walk_ecx_stats(&self) -> io::Result<(u64, u64, u64)> {
|
||||
let ecx_file = match self.ecx_file.as_ref() {
|
||||
Some(f) => f,
|
||||
None => return Ok((0, 0, 0)),
|
||||
};
|
||||
|
||||
let entry_count = self.ecx_file_size as usize / NEEDLE_MAP_ENTRY_SIZE;
|
||||
let mut files: u64 = 0;
|
||||
let mut files_deleted: u64 = 0;
|
||||
let mut total_size: u64 = 0;
|
||||
let mut entry_buf = [0u8; NEEDLE_MAP_ENTRY_SIZE];
|
||||
|
||||
for i in 0..entry_count {
|
||||
let file_offset = (i * NEEDLE_MAP_ENTRY_SIZE) as u64;
|
||||
#[cfg(unix)]
|
||||
{
|
||||
use std::os::unix::fs::FileExt;
|
||||
ecx_file.read_exact_at(&mut entry_buf, file_offset)?;
|
||||
}
|
||||
let (_key, _offset, size) = idx_entry_from_bytes(&entry_buf);
|
||||
// Match Go's Size.Raw(): tombstone (-1) returns 0, other negatives return abs
|
||||
if !size.is_tombstone() {
|
||||
total_size += size.0.unsigned_abs() as u64;
|
||||
}
|
||||
if size.is_deleted() {
|
||||
files_deleted += 1;
|
||||
} else {
|
||||
files += 1;
|
||||
}
|
||||
}
|
||||
|
||||
Ok((files, files_deleted, total_size))
|
||||
}
|
||||
|
||||
/// ScrubIndex verifies index integrity of an EC volume.
|
||||
/// Matches Go's `(ev *EcVolume) ScrubIndex()` → `idx.CheckIndexFile()`.
|
||||
/// Returns (entry_count, errors).
|
||||
pub fn scrub_index(&self) -> (u64, Vec<String>) {
|
||||
let ecx_file = match self.ecx_file.as_ref() {
|
||||
Some(f) => f,
|
||||
None => {
|
||||
return (
|
||||
0,
|
||||
vec![format!(
|
||||
"no ECX file associated with EC volume {}",
|
||||
self.volume_id.0
|
||||
)],
|
||||
)
|
||||
}
|
||||
};
|
||||
|
||||
if self.ecx_file_size == 0 {
|
||||
return (
|
||||
0,
|
||||
vec![format!(
|
||||
"zero-size ECX file for EC volume {}",
|
||||
self.volume_id.0
|
||||
)],
|
||||
);
|
||||
}
|
||||
|
||||
let entry_count = self.ecx_file_size as usize / NEEDLE_MAP_ENTRY_SIZE;
|
||||
let mut entries: Vec<(usize, NeedleId, i64, Size)> = Vec::with_capacity(entry_count);
|
||||
let mut errs: Vec<String> = Vec::new();
|
||||
let mut entry_buf = [0u8; NEEDLE_MAP_ENTRY_SIZE];
|
||||
|
||||
// Walk all entries
|
||||
for i in 0..entry_count {
|
||||
let file_offset = (i * NEEDLE_MAP_ENTRY_SIZE) as u64;
|
||||
#[cfg(unix)]
|
||||
{
|
||||
use std::os::unix::fs::FileExt;
|
||||
if let Err(e) = ecx_file.read_exact_at(&mut entry_buf, file_offset) {
|
||||
errs.push(format!("read ecx entry {}: {}", i, e));
|
||||
continue;
|
||||
}
|
||||
}
|
||||
let (key, offset, size) = idx_entry_from_bytes(&entry_buf);
|
||||
entries.push((i, key, offset.to_actual_offset(), size));
|
||||
}
|
||||
|
||||
// Sort by offset, then size
|
||||
entries.sort_by(|a, b| a.2.cmp(&b.2).then(a.3 .0.cmp(&b.3 .0)));
|
||||
|
||||
// Check for overlapping needles
|
||||
for i in 1..entries.len() {
|
||||
let (idx, id, offset, size) = entries[i];
|
||||
let (_, last_id, last_offset, last_size) = entries[i - 1];
|
||||
|
||||
let actual_size =
|
||||
crate::storage::needle::needle::get_actual_size(size, self.version);
|
||||
let end = if actual_size != 0 {
|
||||
offset + actual_size - 1
|
||||
} else {
|
||||
offset
|
||||
};
|
||||
|
||||
let last_actual_size =
|
||||
crate::storage::needle::needle::get_actual_size(last_size, self.version);
|
||||
let last_end = if last_actual_size != 0 {
|
||||
last_offset + last_actual_size - 1
|
||||
} else {
|
||||
last_offset
|
||||
};
|
||||
|
||||
if offset <= last_end {
|
||||
errs.push(format!(
|
||||
"needle {} (#{}) at [{}-{}] overlaps needle {} at [{}-{}]",
|
||||
id.0,
|
||||
idx + 1,
|
||||
offset,
|
||||
end,
|
||||
last_id.0,
|
||||
last_offset,
|
||||
last_end
|
||||
));
|
||||
}
|
||||
}
|
||||
|
||||
// Verify file size matches entry count
|
||||
let expected_size = entry_count as i64 * NEEDLE_MAP_ENTRY_SIZE as i64;
|
||||
if expected_size != self.ecx_file_size {
|
||||
errs.push(format!(
|
||||
"expected an index file of size {}, got {}",
|
||||
expected_size, self.ecx_file_size
|
||||
));
|
||||
}
|
||||
|
||||
(entries.len() as u64, errs)
|
||||
}
|
||||
|
||||
// ---- Deletion ----
|
||||
|
||||
/// Mark a needle as deleted in the .ecx file in-place.
|
||||
/// Matches Go's MarkNeedleDeleted: binary search the .ecx, then overwrite
|
||||
/// the size field with TOMBSTONE_FILE_SIZE.
|
||||
fn mark_needle_deleted_in_ecx(&self, needle_id: NeedleId) -> io::Result<bool> {
|
||||
let ecx_file = match self.ecx_file.as_ref() {
|
||||
Some(f) => f,
|
||||
None => return Ok(false),
|
||||
};
|
||||
|
||||
let entry_count = self.ecx_file_size as usize / NEEDLE_MAP_ENTRY_SIZE;
|
||||
if entry_count == 0 {
|
||||
return Ok(false);
|
||||
}
|
||||
|
||||
// Binary search for the needle
|
||||
let mut lo: usize = 0;
|
||||
let mut hi: usize = entry_count;
|
||||
let mut entry_buf = [0u8; NEEDLE_MAP_ENTRY_SIZE];
|
||||
|
||||
while lo < hi {
|
||||
let mid = lo + (hi - lo) / 2;
|
||||
let file_offset = (mid * NEEDLE_MAP_ENTRY_SIZE) as u64;
|
||||
|
||||
#[cfg(unix)]
|
||||
{
|
||||
use std::os::unix::fs::FileExt;
|
||||
ecx_file.read_exact_at(&mut entry_buf, file_offset)?;
|
||||
}
|
||||
|
||||
let (key, _offset, _size) = idx_entry_from_bytes(&entry_buf);
|
||||
if key == needle_id {
|
||||
// Found — overwrite the size field with TOMBSTONE_FILE_SIZE
|
||||
let size_offset = file_offset + NEEDLE_ID_SIZE as u64 + OFFSET_SIZE as u64;
|
||||
let mut size_buf = [0u8; SIZE_SIZE];
|
||||
TOMBSTONE_FILE_SIZE.to_bytes(&mut size_buf);
|
||||
#[cfg(unix)]
|
||||
{
|
||||
use std::os::unix::fs::FileExt;
|
||||
ecx_file.write_all_at(&size_buf, size_offset)?;
|
||||
}
|
||||
return Ok(true);
|
||||
} else if key < needle_id {
|
||||
lo = mid + 1;
|
||||
} else {
|
||||
hi = mid;
|
||||
}
|
||||
}
|
||||
|
||||
Ok(false) // not found
|
||||
}
|
||||
|
||||
/// Replay .ecj journal entries into .ecx on startup.
|
||||
/// Matches Go's RebuildEcxFile: for each needle ID in .ecj, marks it
|
||||
/// deleted in .ecx, then removes the .ecj file.
|
||||
fn rebuild_ecx_from_journal(&mut self) -> io::Result<()> {
|
||||
let ecj_path = self.ecj_file_name();
|
||||
if !std::path::Path::new(&ecj_path).exists() {
|
||||
return Ok(());
|
||||
}
|
||||
|
||||
let data = fs::read(&ecj_path)?;
|
||||
if data.is_empty() {
|
||||
return Ok(());
|
||||
}
|
||||
|
||||
let count = data.len() / NEEDLE_ID_SIZE;
|
||||
for i in 0..count {
|
||||
let start = i * NEEDLE_ID_SIZE;
|
||||
if start + NEEDLE_ID_SIZE > data.len() {
|
||||
break;
|
||||
}
|
||||
let needle_id = NeedleId::from_bytes(&data[start..start + NEEDLE_ID_SIZE]);
|
||||
// Errors for individual entries are non-fatal (needle may not exist in .ecx)
|
||||
let _ = self.mark_needle_deleted_in_ecx(needle_id);
|
||||
}
|
||||
|
||||
// Remove the .ecj file after replay (matches Go)
|
||||
let _ = fs::remove_file(&ecj_path);
|
||||
|
||||
// Re-create .ecj for future deletions
|
||||
let ecj_file = OpenOptions::new()
|
||||
.read(true)
|
||||
.write(true)
|
||||
.create(true)
|
||||
.append(true)
|
||||
.open(&ecj_path)?;
|
||||
self.ecj_file = Some(ecj_file);
|
||||
|
||||
Ok(())
|
||||
}
|
||||
|
||||
// ---- Deletion journal ----
|
||||
|
||||
/// Append a deleted needle ID to the .ecj journal and mark in .ecx.
|
||||
/// Matches Go's DeleteNeedleFromEcx: marks in .ecx first, then journals.
|
||||
pub fn journal_delete(&mut self, needle_id: NeedleId) -> io::Result<()> {
|
||||
// Mark deleted in .ecx in-place (matches Go's MarkNeedleDeleted)
|
||||
let _ = self.mark_needle_deleted_in_ecx(needle_id);
|
||||
let ecj_file = self
|
||||
.ecj_file
|
||||
.as_mut()
|
||||
.ok_or_else(|| io::Error::new(io::ErrorKind::Other, "ecj file not open"))?;
|
||||
|
||||
let mut buf = [0u8; NEEDLE_ID_SIZE];
|
||||
needle_id.to_bytes(&mut buf);
|
||||
ecj_file.write_all(&buf)?;
|
||||
ecj_file.sync_all()?;
|
||||
Ok(())
|
||||
}
|
||||
|
||||
/// Append a deleted needle ID to the .ecj journal, validating the cookie first.
|
||||
/// Matches Go's DeleteEcShardNeedle which validates cookie before journaling.
|
||||
/// A cookie of 0 means skip cookie check (e.g., orphan cleanup).
|
||||
pub fn journal_delete_with_cookie(
|
||||
&mut self,
|
||||
needle_id: NeedleId,
|
||||
cookie: crate::storage::types::Cookie,
|
||||
) -> io::Result<()> {
|
||||
// cookie == 0 indicates SkipCookieCheck was requested
|
||||
if cookie.0 != 0 {
|
||||
// Try to read the needle's cookie from the EC shards to validate
|
||||
// Look up the needle in ecx index to find its offset, then read header from shard
|
||||
if let Ok(Some((offset, size))) = self.find_needle_from_ecx(needle_id) {
|
||||
if !size.is_deleted() && !offset.is_zero() {
|
||||
let actual_offset = offset.to_actual_offset() as u64;
|
||||
// Determine which shard contains this offset and read the cookie
|
||||
let shard_size = self
|
||||
.shards
|
||||
.iter()
|
||||
.filter_map(|s| s.as_ref())
|
||||
.map(|s| s.file_size())
|
||||
.next()
|
||||
.unwrap_or(0) as u64;
|
||||
if shard_size > 0 {
|
||||
let shard_id = (actual_offset / shard_size) as usize;
|
||||
let shard_offset = actual_offset % shard_size;
|
||||
if let Some(Some(shard)) = self.shards.get(shard_id) {
|
||||
let mut header_buf = [0u8; 4]; // cookie is first 4 bytes of needle
|
||||
if shard.read_at(&mut header_buf, shard_offset).is_ok() {
|
||||
let needle_cookie =
|
||||
crate::storage::types::Cookie(u32::from_be_bytes(header_buf));
|
||||
if needle_cookie != cookie {
|
||||
return Err(io::Error::new(
|
||||
io::ErrorKind::InvalidData,
|
||||
format!("unexpected cookie {:x}", cookie.0),
|
||||
));
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
self.journal_delete(needle_id)
|
||||
}
|
||||
|
||||
/// Read all deleted needle IDs from the .ecj journal.
|
||||
pub fn read_deleted_needles(&self) -> io::Result<Vec<NeedleId>> {
|
||||
let ecj_path = self.ecj_file_name();
|
||||
if !std::path::Path::new(&ecj_path).exists() {
|
||||
return Ok(Vec::new());
|
||||
}
|
||||
|
||||
let data = fs::read(&ecj_path)?;
|
||||
let count = data.len() / NEEDLE_ID_SIZE;
|
||||
let mut needles = Vec::with_capacity(count);
|
||||
for i in 0..count {
|
||||
let start = i * NEEDLE_ID_SIZE;
|
||||
let id = NeedleId::from_bytes(&data[start..start + NEEDLE_ID_SIZE]);
|
||||
needles.push(id);
|
||||
}
|
||||
Ok(needles)
|
||||
}
|
||||
|
||||
// ---- Lifecycle ----
|
||||
|
||||
pub fn close(&mut self) {
|
||||
for shard in &mut self.shards {
|
||||
if let Some(s) = shard {
|
||||
s.close();
|
||||
}
|
||||
*shard = None;
|
||||
}
|
||||
// Sync .ecx before closing to flush in-place deletion marks (matches Go's ev.ecxFile.Sync())
|
||||
if let Some(ref ecx_file) = self.ecx_file {
|
||||
let _ = ecx_file.sync_all();
|
||||
}
|
||||
self.ecx_file = None;
|
||||
self.ecj_file = None;
|
||||
}
|
||||
|
||||
pub fn destroy(&mut self) {
|
||||
for shard in &mut self.shards {
|
||||
if let Some(s) = shard {
|
||||
s.destroy();
|
||||
}
|
||||
*shard = None;
|
||||
}
|
||||
// Remove .ecx/.ecj/.vif from ecx_actual_dir (where they were found)
|
||||
// Go's Destroy() removes .ecx, .ecj, and .vif files.
|
||||
let actual_base = crate::storage::volume::volume_file_name(
|
||||
&self.ecx_actual_dir,
|
||||
&self.collection,
|
||||
self.volume_id,
|
||||
);
|
||||
let _ = fs::remove_file(format!("{}.ecx", actual_base));
|
||||
let _ = fs::remove_file(format!("{}.ecj", actual_base));
|
||||
let _ = fs::remove_file(format!("{}.vif", actual_base));
|
||||
// Also try the configured idx dir and data dir in case files exist in either
|
||||
if self.ecx_actual_dir != self.dir_idx {
|
||||
let _ = fs::remove_file(self.ecx_file_name());
|
||||
let _ = fs::remove_file(self.ecj_file_name());
|
||||
let idx_base = crate::storage::volume::volume_file_name(
|
||||
&self.dir_idx,
|
||||
&self.collection,
|
||||
self.volume_id,
|
||||
);
|
||||
let _ = fs::remove_file(format!("{}.vif", idx_base));
|
||||
}
|
||||
if self.ecx_actual_dir != self.dir && self.dir_idx != self.dir {
|
||||
let data_base = crate::storage::volume::volume_file_name(
|
||||
&self.dir,
|
||||
&self.collection,
|
||||
self.volume_id,
|
||||
);
|
||||
let _ = fs::remove_file(format!("{}.ecx", data_base));
|
||||
let _ = fs::remove_file(format!("{}.ecj", data_base));
|
||||
let _ = fs::remove_file(format!("{}.vif", data_base));
|
||||
}
|
||||
self.ecx_file = None;
|
||||
self.ecj_file = None;
|
||||
}
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
use tempfile::TempDir;
|
||||
|
||||
fn write_ecx_file(
|
||||
dir: &str,
|
||||
collection: &str,
|
||||
vid: VolumeId,
|
||||
entries: &[(NeedleId, Offset, Size)],
|
||||
) {
|
||||
let base = crate::storage::volume::volume_file_name(dir, collection, vid);
|
||||
let ecx_path = format!("{}.ecx", base);
|
||||
let mut file = File::create(&ecx_path).unwrap();
|
||||
|
||||
// Write sorted entries
|
||||
for &(key, offset, size) in entries {
|
||||
let mut buf = [0u8; NEEDLE_MAP_ENTRY_SIZE];
|
||||
idx_entry_to_bytes(&mut buf, key, offset, size);
|
||||
file.write_all(&buf).unwrap();
|
||||
}
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_ec_volume_find_needle() {
|
||||
let tmp = TempDir::new().unwrap();
|
||||
let dir = tmp.path().to_str().unwrap();
|
||||
|
||||
// Write sorted ecx entries
|
||||
let entries = vec![
|
||||
(NeedleId(1), Offset::from_actual_offset(8), Size(100)),
|
||||
(NeedleId(5), Offset::from_actual_offset(200), Size(200)),
|
||||
(NeedleId(10), Offset::from_actual_offset(500), Size(300)),
|
||||
];
|
||||
write_ecx_file(dir, "", VolumeId(1), &entries);
|
||||
|
||||
let vol = EcVolume::new(dir, dir, "", VolumeId(1)).unwrap();
|
||||
|
||||
// Found
|
||||
let result = vol.find_needle_from_ecx(NeedleId(5)).unwrap();
|
||||
assert!(result.is_some());
|
||||
let (offset, size) = result.unwrap();
|
||||
assert_eq!(offset.to_actual_offset(), 200);
|
||||
assert_eq!(size, Size(200));
|
||||
|
||||
// Not found
|
||||
let result = vol.find_needle_from_ecx(NeedleId(7)).unwrap();
|
||||
assert!(result.is_none());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_ec_volume_journal() {
|
||||
let tmp = TempDir::new().unwrap();
|
||||
let dir = tmp.path().to_str().unwrap();
|
||||
|
||||
// Need ecx file for EcVolume::new to succeed
|
||||
write_ecx_file(dir, "", VolumeId(1), &[]);
|
||||
|
||||
let mut vol = EcVolume::new(dir, dir, "", VolumeId(1)).unwrap();
|
||||
|
||||
vol.journal_delete(NeedleId(10)).unwrap();
|
||||
vol.journal_delete(NeedleId(20)).unwrap();
|
||||
|
||||
let deleted = vol.read_deleted_needles().unwrap();
|
||||
assert_eq!(deleted, vec![NeedleId(10), NeedleId(20)]);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_ec_volume_shard_bits() {
|
||||
let tmp = TempDir::new().unwrap();
|
||||
let dir = tmp.path().to_str().unwrap();
|
||||
write_ecx_file(dir, "", VolumeId(1), &[]);
|
||||
|
||||
let mut vol = EcVolume::new(dir, dir, "", VolumeId(1)).unwrap();
|
||||
assert_eq!(vol.shard_count(), 0);
|
||||
|
||||
// Create a shard file so we can add it
|
||||
let mut shard = EcVolumeShard::new(dir, "", VolumeId(1), 3);
|
||||
shard.create().unwrap();
|
||||
shard.write_all(&[0u8; 100]).unwrap();
|
||||
shard.close();
|
||||
|
||||
vol.add_shard(EcVolumeShard::new(dir, "", VolumeId(1), 3))
|
||||
.unwrap();
|
||||
assert_eq!(vol.shard_count(), 1);
|
||||
assert!(vol.shard_bits().has_shard_id(3));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_ec_volume_uses_collection_prefixed_vif_config() {
|
||||
let tmp = TempDir::new().unwrap();
|
||||
let dir = tmp.path().to_str().unwrap();
|
||||
write_ecx_file(dir, "pics", VolumeId(1), &[]);
|
||||
|
||||
let vif = crate::storage::volume::VifVolumeInfo {
|
||||
ec_shard_config: Some(crate::storage::volume::VifEcShardConfig {
|
||||
data_shards: 6,
|
||||
parity_shards: 3,
|
||||
}),
|
||||
..Default::default()
|
||||
};
|
||||
let base = crate::storage::volume::volume_file_name(dir, "pics", VolumeId(1));
|
||||
std::fs::write(
|
||||
format!("{}.vif", base),
|
||||
serde_json::to_string_pretty(&vif).unwrap(),
|
||||
)
|
||||
.unwrap();
|
||||
|
||||
let vol = EcVolume::new(dir, dir, "pics", VolumeId(1)).unwrap();
|
||||
assert_eq!(vol.data_shards, 6);
|
||||
assert_eq!(vol.parity_shards, 3);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_ec_volume_invalid_vif_config_falls_back_to_defaults() {
|
||||
let tmp = TempDir::new().unwrap();
|
||||
let dir = tmp.path().to_str().unwrap();
|
||||
write_ecx_file(dir, "pics", VolumeId(1), &[]);
|
||||
|
||||
let vif = crate::storage::volume::VifVolumeInfo {
|
||||
ec_shard_config: Some(crate::storage::volume::VifEcShardConfig {
|
||||
data_shards: 10,
|
||||
parity_shards: 10,
|
||||
}),
|
||||
..Default::default()
|
||||
};
|
||||
let base = crate::storage::volume::volume_file_name(dir, "pics", VolumeId(1));
|
||||
std::fs::write(
|
||||
format!("{}.vif", base),
|
||||
serde_json::to_string_pretty(&vif).unwrap(),
|
||||
)
|
||||
.unwrap();
|
||||
|
||||
let vol = EcVolume::new(dir, dir, "pics", VolumeId(1)).unwrap();
|
||||
assert_eq!(vol.data_shards, DATA_SHARDS_COUNT as u32);
|
||||
assert_eq!(vol.parity_shards, PARITY_SHARDS_COUNT as u32);
|
||||
}
|
||||
}
|
||||
16
seaweed-volume/src/storage/erasure_coding/mod.rs
Normal file
16
seaweed-volume/src/storage/erasure_coding/mod.rs
Normal file
@@ -0,0 +1,16 @@
|
||||
//! Erasure coding module for volume data protection.
|
||||
//!
|
||||
//! Encodes a volume's .dat file into 10 data + 4 parity shards using
|
||||
//! Reed-Solomon erasure coding. Can reconstruct from any 10 of 14 shards.
|
||||
|
||||
pub mod ec_decoder;
|
||||
pub mod ec_encoder;
|
||||
pub mod ec_locate;
|
||||
pub mod ec_shard;
|
||||
pub mod ec_volume;
|
||||
|
||||
pub use ec_shard::{
|
||||
EcVolumeShard, ShardId, DATA_SHARDS_COUNT, MAX_SHARD_COUNT, MIN_TOTAL_DISKS,
|
||||
PARITY_SHARDS_COUNT, TOTAL_SHARDS_COUNT,
|
||||
};
|
||||
pub use ec_volume::EcVolume;
|
||||
116
seaweed-volume/src/storage/idx/mod.rs
Normal file
116
seaweed-volume/src/storage/idx/mod.rs
Normal file
@@ -0,0 +1,116 @@
|
||||
//! Index file (.idx) format: sequential 17-byte entries.
|
||||
//!
|
||||
//! Each entry: NeedleId(8) + Offset(5) + Size(4) = 17 bytes.
|
||||
|
||||
use crate::storage::types::*;
|
||||
use std::io::{self, Read, Seek, SeekFrom};
|
||||
|
||||
const ROWS_TO_READ: usize = 1024;
|
||||
|
||||
/// Walk all entries in an .idx file, calling `f` for each.
|
||||
/// Mirrors Go's `WalkIndexFile()`.
|
||||
pub fn walk_index_file<R, F>(reader: &mut R, start_from: u64, mut f: F) -> io::Result<()>
|
||||
where
|
||||
R: Read + Seek,
|
||||
F: FnMut(NeedleId, Offset, Size) -> io::Result<()>,
|
||||
{
|
||||
let reader_offset = start_from * NEEDLE_MAP_ENTRY_SIZE as u64;
|
||||
reader.seek(SeekFrom::Start(reader_offset))?;
|
||||
|
||||
let mut buf = vec![0u8; NEEDLE_MAP_ENTRY_SIZE * ROWS_TO_READ];
|
||||
|
||||
loop {
|
||||
let count = match reader.read(&mut buf) {
|
||||
Ok(0) => return Ok(()),
|
||||
Ok(n) => n,
|
||||
Err(ref e) if e.kind() == io::ErrorKind::UnexpectedEof => return Ok(()),
|
||||
Err(e) => return Err(e),
|
||||
};
|
||||
|
||||
let mut i = 0;
|
||||
while i + NEEDLE_MAP_ENTRY_SIZE <= count {
|
||||
let (key, offset, size) = idx_entry_from_bytes(&buf[i..i + NEEDLE_MAP_ENTRY_SIZE]);
|
||||
f(key, offset, size)?;
|
||||
i += NEEDLE_MAP_ENTRY_SIZE;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
/// Write a single index entry to a writer.
|
||||
pub fn write_index_entry<W: io::Write>(
|
||||
writer: &mut W,
|
||||
key: NeedleId,
|
||||
offset: Offset,
|
||||
size: Size,
|
||||
) -> io::Result<()> {
|
||||
let mut buf = [0u8; NEEDLE_MAP_ENTRY_SIZE];
|
||||
idx_entry_to_bytes(&mut buf, key, offset, size);
|
||||
writer.write_all(&buf)
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
use std::io::Cursor;
|
||||
|
||||
#[test]
|
||||
fn test_walk_index_file() {
|
||||
// Create a small index with 3 entries
|
||||
let mut data = Vec::new();
|
||||
let entries = vec![
|
||||
(NeedleId(1), Offset::from_actual_offset(0), Size(100)),
|
||||
(NeedleId(2), Offset::from_actual_offset(128), Size(200)),
|
||||
(NeedleId(3), Offset::from_actual_offset(384), Size(300)),
|
||||
];
|
||||
for (key, offset, size) in &entries {
|
||||
let mut buf = [0u8; NEEDLE_MAP_ENTRY_SIZE];
|
||||
idx_entry_to_bytes(&mut buf, *key, *offset, *size);
|
||||
data.extend_from_slice(&buf);
|
||||
}
|
||||
|
||||
let mut cursor = Cursor::new(data);
|
||||
let mut collected = Vec::new();
|
||||
walk_index_file(&mut cursor, 0, |key, offset, size| {
|
||||
collected.push((key, offset.to_actual_offset(), size));
|
||||
Ok(())
|
||||
})
|
||||
.unwrap();
|
||||
|
||||
assert_eq!(collected.len(), 3);
|
||||
assert_eq!(collected[0].0, NeedleId(1));
|
||||
assert_eq!(collected[0].1, 0);
|
||||
assert_eq!(collected[0].2, Size(100));
|
||||
assert_eq!(collected[1].0, NeedleId(2));
|
||||
assert_eq!(collected[2].0, NeedleId(3));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_walk_empty() {
|
||||
let mut cursor = Cursor::new(Vec::new());
|
||||
let mut count = 0;
|
||||
walk_index_file(&mut cursor, 0, |_, _, _| {
|
||||
count += 1;
|
||||
Ok(())
|
||||
})
|
||||
.unwrap();
|
||||
assert_eq!(count, 0);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_write_index_entry() {
|
||||
let mut buf = Vec::new();
|
||||
write_index_entry(
|
||||
&mut buf,
|
||||
NeedleId(42),
|
||||
Offset::from_actual_offset(8 * 10),
|
||||
Size(512),
|
||||
)
|
||||
.unwrap();
|
||||
assert_eq!(buf.len(), NEEDLE_MAP_ENTRY_SIZE);
|
||||
|
||||
let (key, offset, size) = idx_entry_from_bytes(&buf);
|
||||
assert_eq!(key, NeedleId(42));
|
||||
assert_eq!(offset.to_actual_offset(), 80);
|
||||
assert_eq!(size, Size(512));
|
||||
}
|
||||
}
|
||||
9
seaweed-volume/src/storage/mod.rs
Normal file
9
seaweed-volume/src/storage/mod.rs
Normal file
@@ -0,0 +1,9 @@
|
||||
pub mod disk_location;
|
||||
pub mod erasure_coding;
|
||||
pub mod idx;
|
||||
pub mod needle;
|
||||
pub mod needle_map;
|
||||
pub mod store;
|
||||
pub mod super_block;
|
||||
pub mod types;
|
||||
pub mod volume;
|
||||
73
seaweed-volume/src/storage/needle/crc.rs
Normal file
73
seaweed-volume/src/storage/needle/crc.rs
Normal file
@@ -0,0 +1,73 @@
|
||||
//! CRC32-Castagnoli checksum for needle data integrity.
|
||||
//!
|
||||
//! Matches Go's `crc32.MakeTable(crc32.Castagnoli)` exactly.
|
||||
//! The CRC is stored as raw u32 (not the `.Value()` legacy transform).
|
||||
|
||||
/// CRC32-Castagnoli checksum wrapper.
|
||||
#[derive(Debug, Clone, Copy, PartialEq, Eq, Default)]
|
||||
pub struct CRC(pub u32);
|
||||
|
||||
impl CRC {
|
||||
/// Compute CRC from a byte slice (starting from 0).
|
||||
pub fn new(data: &[u8]) -> Self {
|
||||
CRC(0).update(data)
|
||||
}
|
||||
|
||||
/// Update the CRC with additional bytes.
|
||||
pub fn update(self, data: &[u8]) -> Self {
|
||||
CRC(crc32c::crc32c_append(self.0, data))
|
||||
}
|
||||
|
||||
/// Legacy `.Value()` function — deprecated in Go but needed for backward compat check.
|
||||
/// Formula: (crc >> 15 | crc << 17) + 0xa282ead8
|
||||
pub fn legacy_value(&self) -> u32 {
|
||||
(self.0 >> 15 | self.0 << 17).wrapping_add(0xa282ead8)
|
||||
}
|
||||
}
|
||||
|
||||
impl From<u32> for CRC {
|
||||
fn from(v: u32) -> Self {
|
||||
CRC(v)
|
||||
}
|
||||
}
|
||||
|
||||
impl From<CRC> for u32 {
|
||||
fn from(c: CRC) -> Self {
|
||||
c.0
|
||||
}
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn test_crc_empty() {
|
||||
let crc = CRC::new(&[]);
|
||||
assert_eq!(crc.0, 0);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_crc_known_value() {
|
||||
// CRC32-C of "hello" — verify it produces a non-zero deterministic value
|
||||
let crc = CRC::new(b"hello");
|
||||
assert_ne!(crc.0, 0);
|
||||
// Same input produces same output
|
||||
assert_eq!(crc, CRC::new(b"hello"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_crc_incremental() {
|
||||
let crc1 = CRC::new(b"hello world");
|
||||
let crc2 = CRC::new(b"hello").update(b" world");
|
||||
assert_eq!(crc1, crc2);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_crc_legacy_value() {
|
||||
let crc = CRC(0x12345678);
|
||||
let v = crc.legacy_value();
|
||||
let expected = (0x12345678u32 >> 15 | 0x12345678u32 << 17).wrapping_add(0xa282ead8);
|
||||
assert_eq!(v, expected);
|
||||
}
|
||||
}
|
||||
7
seaweed-volume/src/storage/needle/mod.rs
Normal file
7
seaweed-volume/src/storage/needle/mod.rs
Normal file
@@ -0,0 +1,7 @@
|
||||
pub mod crc;
|
||||
pub mod needle;
|
||||
pub mod ttl;
|
||||
|
||||
pub use crc::CRC;
|
||||
pub use needle::Needle;
|
||||
pub use ttl::TTL;
|
||||
944
seaweed-volume/src/storage/needle/needle.rs
Normal file
944
seaweed-volume/src/storage/needle/needle.rs
Normal file
@@ -0,0 +1,944 @@
|
||||
//! Needle: the individual file object stored in a volume.
|
||||
//!
|
||||
//! Binary format (Version 2/3):
|
||||
//! Header (16 bytes): Cookie(4) + NeedleId(8) + Size(4)
|
||||
//! Body (Size bytes):
|
||||
//! DataSize(4) + Data(DataSize) + Flags(1)
|
||||
//! [if HasName]: NameSize(1) + Name(NameSize)
|
||||
//! [if HasMime]: MimeSize(1) + Mime(MimeSize)
|
||||
//! [if HasLastMod]: LastModified(5)
|
||||
//! [if HasTtl]: TTL(2)
|
||||
//! [if HasPairs]: PairsSize(2) + Pairs(PairsSize)
|
||||
//! Tail:
|
||||
//! Checksum(4) + [if V3: AppendAtNs(8)] + Padding(0-7)
|
||||
|
||||
use super::crc::CRC;
|
||||
use super::ttl::TTL;
|
||||
use crate::storage::types::*;
|
||||
|
||||
// Flag bits (matching Go constants)
|
||||
pub const FLAG_IS_COMPRESSED: u8 = 0x01;
|
||||
pub const FLAG_HAS_NAME: u8 = 0x02;
|
||||
pub const FLAG_HAS_MIME: u8 = 0x04;
|
||||
pub const FLAG_HAS_LAST_MODIFIED_DATE: u8 = 0x08;
|
||||
pub const FLAG_HAS_TTL: u8 = 0x10;
|
||||
pub const FLAG_HAS_PAIRS: u8 = 0x20;
|
||||
pub const FLAG_IS_CHUNK_MANIFEST: u8 = 0x80;
|
||||
|
||||
pub const LAST_MODIFIED_BYTES_LENGTH: usize = 5;
|
||||
pub const TTL_BYTES_LENGTH: usize = 2;
|
||||
|
||||
#[derive(Debug, Clone, Default)]
|
||||
pub struct Needle {
|
||||
pub cookie: Cookie,
|
||||
pub id: NeedleId,
|
||||
pub size: Size, // sum of body content fields
|
||||
|
||||
// Version 2+ fields
|
||||
pub data_size: u32,
|
||||
pub data: Vec<u8>,
|
||||
pub flags: u8,
|
||||
pub name_size: u8,
|
||||
pub name: Vec<u8>, // max 255 bytes
|
||||
pub mime_size: u8,
|
||||
pub mime: Vec<u8>, // max 255 bytes
|
||||
pub pairs_size: u16,
|
||||
pub pairs: Vec<u8>, // max 64KB, JSON
|
||||
pub last_modified: u64, // stored as 5 bytes on disk
|
||||
pub ttl: Option<TTL>,
|
||||
|
||||
// Tail fields
|
||||
pub checksum: CRC,
|
||||
pub append_at_ns: u64, // Version 3 only
|
||||
pub padding: Vec<u8>,
|
||||
}
|
||||
|
||||
impl Needle {
|
||||
// ---- Flag accessors (matching Go) ----
|
||||
|
||||
pub fn is_compressed(&self) -> bool {
|
||||
self.flags & FLAG_IS_COMPRESSED != 0
|
||||
}
|
||||
pub fn set_is_compressed(&mut self) {
|
||||
self.flags |= FLAG_IS_COMPRESSED;
|
||||
}
|
||||
|
||||
pub fn has_name(&self) -> bool {
|
||||
self.flags & FLAG_HAS_NAME != 0
|
||||
}
|
||||
pub fn set_has_name(&mut self) {
|
||||
self.flags |= FLAG_HAS_NAME;
|
||||
}
|
||||
|
||||
pub fn has_mime(&self) -> bool {
|
||||
self.flags & FLAG_HAS_MIME != 0
|
||||
}
|
||||
pub fn set_has_mime(&mut self) {
|
||||
self.flags |= FLAG_HAS_MIME;
|
||||
}
|
||||
|
||||
pub fn has_last_modified_date(&self) -> bool {
|
||||
self.flags & FLAG_HAS_LAST_MODIFIED_DATE != 0
|
||||
}
|
||||
pub fn set_has_last_modified_date(&mut self) {
|
||||
self.flags |= FLAG_HAS_LAST_MODIFIED_DATE;
|
||||
}
|
||||
|
||||
pub fn has_ttl(&self) -> bool {
|
||||
self.flags & FLAG_HAS_TTL != 0
|
||||
}
|
||||
pub fn set_has_ttl(&mut self) {
|
||||
self.flags |= FLAG_HAS_TTL;
|
||||
}
|
||||
|
||||
pub fn has_pairs(&self) -> bool {
|
||||
self.flags & FLAG_HAS_PAIRS != 0
|
||||
}
|
||||
pub fn set_has_pairs(&mut self) {
|
||||
self.flags |= FLAG_HAS_PAIRS;
|
||||
}
|
||||
|
||||
pub fn is_chunk_manifest(&self) -> bool {
|
||||
self.flags & FLAG_IS_CHUNK_MANIFEST != 0
|
||||
}
|
||||
pub fn set_is_chunk_manifest(&mut self) {
|
||||
self.flags |= FLAG_IS_CHUNK_MANIFEST;
|
||||
}
|
||||
|
||||
// ---- Header parsing ----
|
||||
|
||||
/// Parse the 16-byte needle header.
|
||||
pub fn parse_header(bytes: &[u8]) -> (Cookie, NeedleId, Size) {
|
||||
assert!(bytes.len() >= NEEDLE_HEADER_SIZE);
|
||||
let cookie = Cookie::from_bytes(&bytes[0..COOKIE_SIZE]);
|
||||
let id = NeedleId::from_bytes(&bytes[COOKIE_SIZE..COOKIE_SIZE + NEEDLE_ID_SIZE]);
|
||||
let size = Size::from_bytes(&bytes[COOKIE_SIZE + NEEDLE_ID_SIZE..NEEDLE_HEADER_SIZE]);
|
||||
(cookie, id, size)
|
||||
}
|
||||
|
||||
/// Parse needle header into self.
|
||||
pub fn read_header(&mut self, bytes: &[u8]) {
|
||||
let (cookie, id, size) = Self::parse_header(bytes);
|
||||
self.cookie = cookie;
|
||||
self.id = id;
|
||||
self.size = size;
|
||||
}
|
||||
|
||||
// ---- Body reading (Version 2/3) ----
|
||||
|
||||
/// Read version 2/3 body metadata only — skips copying the data payload.
|
||||
/// Sets `data_size` and all metadata fields but leaves `data` empty.
|
||||
pub fn read_body_v2_meta_only(&mut self, bytes: &[u8]) -> Result<(), NeedleError> {
|
||||
let len_bytes = bytes.len();
|
||||
let mut index = 0;
|
||||
|
||||
// DataSize (4 bytes)
|
||||
if index + 4 > len_bytes {
|
||||
return Err(NeedleError::IndexOutOfRange(1));
|
||||
}
|
||||
self.data_size = u32::from_be_bytes([
|
||||
bytes[index],
|
||||
bytes[index + 1],
|
||||
bytes[index + 2],
|
||||
bytes[index + 3],
|
||||
]);
|
||||
index += 4;
|
||||
|
||||
// Skip data bytes (do NOT copy them)
|
||||
if index + self.data_size as usize > len_bytes {
|
||||
return Err(NeedleError::IndexOutOfRange(1));
|
||||
}
|
||||
index += self.data_size as usize;
|
||||
|
||||
// Read non-data metadata
|
||||
self.read_body_v2_non_data(&bytes[index..])?;
|
||||
Ok(())
|
||||
}
|
||||
|
||||
/// Read full needle from bytes but skip copying the data payload.
|
||||
/// Sets all metadata fields, checksum, etc. but leaves `data` empty.
|
||||
pub fn read_bytes_meta_only(
|
||||
&mut self,
|
||||
bytes: &[u8],
|
||||
offset: i64,
|
||||
expected_size: Size,
|
||||
version: Version,
|
||||
) -> Result<(), NeedleError> {
|
||||
self.read_header(bytes);
|
||||
|
||||
if self.size != expected_size {
|
||||
return Err(NeedleError::SizeMismatch {
|
||||
offset,
|
||||
id: self.id,
|
||||
found: self.size,
|
||||
expected: expected_size,
|
||||
});
|
||||
}
|
||||
|
||||
let body_start = NEEDLE_HEADER_SIZE;
|
||||
let body_end = body_start + self.size.0 as usize;
|
||||
|
||||
if version == VERSION_1 {
|
||||
// V1 has no metadata — data is the entire body
|
||||
self.data_size = self.size.0 as u32;
|
||||
} else if self.size.0 == 0 {
|
||||
// Tombstones have no DataSize/body section; metadata starts at the tail.
|
||||
self.data_size = 0;
|
||||
} else {
|
||||
self.read_body_v2_meta_only(&bytes[body_start..body_end])?;
|
||||
}
|
||||
|
||||
// Read tail but skip CRC validation (no data to check against)
|
||||
self.read_tail_meta_only(&bytes[body_end..], version)?;
|
||||
Ok(())
|
||||
}
|
||||
|
||||
/// Paged meta-only parse: accepts the 20-byte header+DataSize prefix and the
|
||||
/// meta tail bytes (everything after the data payload). This avoids reading
|
||||
/// the data payload from disk at all, matching Go's `ReadNeedleMeta`.
|
||||
pub fn read_paged_meta(
|
||||
&mut self,
|
||||
header_bytes: &[u8], // first 20 bytes: NEEDLE_HEADER_SIZE + DATA_SIZE_SIZE
|
||||
meta_bytes: &[u8], // tail: non-data body metadata + checksum + timestamp + padding
|
||||
offset: i64,
|
||||
expected_size: Size,
|
||||
version: Version,
|
||||
) -> Result<(), NeedleError> {
|
||||
// Parse the 16-byte header
|
||||
self.read_header(header_bytes);
|
||||
|
||||
if self.size != expected_size {
|
||||
return Err(NeedleError::SizeMismatch {
|
||||
offset,
|
||||
id: self.id,
|
||||
found: self.size,
|
||||
expected: expected_size,
|
||||
});
|
||||
}
|
||||
|
||||
if version == VERSION_1 {
|
||||
self.data_size = self.size.0 as u32;
|
||||
} else if self.size.0 == 0 {
|
||||
// Tombstone
|
||||
self.data_size = 0;
|
||||
} else {
|
||||
// Extract DataSize from bytes 16..20
|
||||
self.data_size = u32::from_be_bytes([
|
||||
header_bytes[NEEDLE_HEADER_SIZE],
|
||||
header_bytes[NEEDLE_HEADER_SIZE + 1],
|
||||
header_bytes[NEEDLE_HEADER_SIZE + 2],
|
||||
header_bytes[NEEDLE_HEADER_SIZE + 3],
|
||||
]);
|
||||
|
||||
// meta_bytes starts with the non-data body metadata (flags, name, mime, etc.)
|
||||
// followed by the tail (checksum + timestamp + padding).
|
||||
// readNeedleDataVersion2NonData returns the index where it stopped.
|
||||
let index = self.read_body_v2_non_data(meta_bytes)?;
|
||||
self.read_tail_meta_only(&meta_bytes[index..], version)?;
|
||||
return Ok(());
|
||||
}
|
||||
|
||||
// For VERSION_1 or tombstones, meta_bytes IS the tail
|
||||
self.read_tail_meta_only(meta_bytes, version)?;
|
||||
Ok(())
|
||||
}
|
||||
|
||||
/// Read tail without CRC validation (used when data was not read).
|
||||
fn read_tail_meta_only(
|
||||
&mut self,
|
||||
tail_bytes: &[u8],
|
||||
version: Version,
|
||||
) -> Result<(), NeedleError> {
|
||||
if tail_bytes.len() < NEEDLE_CHECKSUM_SIZE {
|
||||
return Err(NeedleError::TailTooShort);
|
||||
}
|
||||
|
||||
self.checksum = CRC(u32::from_be_bytes([
|
||||
tail_bytes[0],
|
||||
tail_bytes[1],
|
||||
tail_bytes[2],
|
||||
tail_bytes[3],
|
||||
]));
|
||||
|
||||
if version == VERSION_3 {
|
||||
let ts_offset = NEEDLE_CHECKSUM_SIZE;
|
||||
if tail_bytes.len() < ts_offset + TIMESTAMP_SIZE {
|
||||
return Err(NeedleError::TailTooShort);
|
||||
}
|
||||
self.append_at_ns = u64::from_be_bytes([
|
||||
tail_bytes[ts_offset],
|
||||
tail_bytes[ts_offset + 1],
|
||||
tail_bytes[ts_offset + 2],
|
||||
tail_bytes[ts_offset + 3],
|
||||
tail_bytes[ts_offset + 4],
|
||||
tail_bytes[ts_offset + 5],
|
||||
tail_bytes[ts_offset + 6],
|
||||
tail_bytes[ts_offset + 7],
|
||||
]);
|
||||
}
|
||||
|
||||
Ok(())
|
||||
}
|
||||
|
||||
/// Read the version 2/3 body data from bytes (size bytes starting after header).
|
||||
/// Returns IndexOutOfRange errors for truncated data (matching Go's readNeedleDataVersion2).
|
||||
pub fn read_body_v2(&mut self, bytes: &[u8]) -> Result<(), NeedleError> {
|
||||
let len_bytes = bytes.len();
|
||||
let mut index = 0;
|
||||
|
||||
// DataSize (4 bytes)
|
||||
if index + 4 > len_bytes {
|
||||
return Ok(()); // tolerate EOF
|
||||
}
|
||||
self.data_size = u32::from_be_bytes([
|
||||
bytes[index],
|
||||
bytes[index + 1],
|
||||
bytes[index + 2],
|
||||
bytes[index + 3],
|
||||
]);
|
||||
index += 4;
|
||||
|
||||
// Data
|
||||
if index + self.data_size as usize > len_bytes {
|
||||
return Err(NeedleError::IndexOutOfRange(1));
|
||||
}
|
||||
self.data = bytes[index..index + self.data_size as usize].to_vec();
|
||||
index += self.data_size as usize;
|
||||
|
||||
// Read non-data metadata
|
||||
self.read_body_v2_non_data(&bytes[index..])?;
|
||||
Ok(())
|
||||
}
|
||||
|
||||
/// Read version 2/3 metadata fields (everything after Data).
|
||||
/// Returns IndexOutOfRange errors for truncated data (matching Go's readNeedleDataVersion2).
|
||||
fn read_body_v2_non_data(&mut self, bytes: &[u8]) -> Result<usize, NeedleError> {
|
||||
let len_bytes = bytes.len();
|
||||
let mut index = 0;
|
||||
|
||||
// Flags (1 byte)
|
||||
if index < len_bytes {
|
||||
self.flags = bytes[index];
|
||||
index += 1;
|
||||
} else {
|
||||
return Ok(index);
|
||||
}
|
||||
|
||||
// Name
|
||||
if index < len_bytes && self.has_name() {
|
||||
self.name_size = bytes[index];
|
||||
index += 1;
|
||||
if index + self.name_size as usize > len_bytes {
|
||||
return Err(NeedleError::IndexOutOfRange(2));
|
||||
}
|
||||
self.name = bytes[index..index + self.name_size as usize].to_vec();
|
||||
index += self.name_size as usize;
|
||||
}
|
||||
|
||||
// Mime
|
||||
if index < len_bytes && self.has_mime() {
|
||||
self.mime_size = bytes[index];
|
||||
index += 1;
|
||||
if index + self.mime_size as usize > len_bytes {
|
||||
return Err(NeedleError::IndexOutOfRange(3));
|
||||
}
|
||||
self.mime = bytes[index..index + self.mime_size as usize].to_vec();
|
||||
index += self.mime_size as usize;
|
||||
}
|
||||
|
||||
// LastModified (5 bytes)
|
||||
if index < len_bytes && self.has_last_modified_date() {
|
||||
if index + LAST_MODIFIED_BYTES_LENGTH > len_bytes {
|
||||
return Err(NeedleError::IndexOutOfRange(4));
|
||||
}
|
||||
self.last_modified = bytes_to_u64_5(&bytes[index..index + LAST_MODIFIED_BYTES_LENGTH]);
|
||||
index += LAST_MODIFIED_BYTES_LENGTH;
|
||||
}
|
||||
|
||||
// TTL (2 bytes)
|
||||
if index < len_bytes && self.has_ttl() {
|
||||
if index + TTL_BYTES_LENGTH > len_bytes {
|
||||
return Err(NeedleError::IndexOutOfRange(5));
|
||||
}
|
||||
self.ttl = Some(TTL::from_bytes(&bytes[index..index + TTL_BYTES_LENGTH]));
|
||||
index += TTL_BYTES_LENGTH;
|
||||
}
|
||||
|
||||
// Pairs
|
||||
if index < len_bytes && self.has_pairs() {
|
||||
if index + 2 > len_bytes {
|
||||
return Err(NeedleError::IndexOutOfRange(6));
|
||||
}
|
||||
self.pairs_size = u16::from_be_bytes([bytes[index], bytes[index + 1]]);
|
||||
index += 2;
|
||||
if index + self.pairs_size as usize > len_bytes {
|
||||
return Err(NeedleError::IndexOutOfRange(7));
|
||||
}
|
||||
self.pairs = bytes[index..index + self.pairs_size as usize].to_vec();
|
||||
index += self.pairs_size as usize;
|
||||
}
|
||||
|
||||
Ok(index)
|
||||
}
|
||||
|
||||
// ---- Tail reading ----
|
||||
|
||||
/// Read the needle tail (checksum + optional timestamp + padding).
|
||||
pub fn read_tail(&mut self, tail_bytes: &[u8], version: Version) -> Result<(), NeedleError> {
|
||||
if tail_bytes.len() < NEEDLE_CHECKSUM_SIZE {
|
||||
return Err(NeedleError::TailTooShort);
|
||||
}
|
||||
|
||||
let expected_checksum = CRC(u32::from_be_bytes([
|
||||
tail_bytes[0],
|
||||
tail_bytes[1],
|
||||
tail_bytes[2],
|
||||
tail_bytes[3],
|
||||
]));
|
||||
|
||||
if !self.data.is_empty() {
|
||||
let data_checksum = CRC::new(&self.data);
|
||||
// Go double-checks: n.Checksum != crc && uint32(n.Checksum) != crc.Value()
|
||||
// The crc.Value() path is a deprecated legacy transform for backward compat
|
||||
// with seaweed versions prior to commit 056c480eb.
|
||||
if expected_checksum != data_checksum
|
||||
&& expected_checksum.0 != data_checksum.legacy_value()
|
||||
{
|
||||
return Err(NeedleError::CrcMismatch {
|
||||
needle_id: self.id,
|
||||
got: data_checksum.0,
|
||||
want: expected_checksum.0,
|
||||
});
|
||||
}
|
||||
self.checksum = data_checksum;
|
||||
} else {
|
||||
self.checksum = expected_checksum;
|
||||
}
|
||||
|
||||
if version == VERSION_3 {
|
||||
let ts_offset = NEEDLE_CHECKSUM_SIZE;
|
||||
if tail_bytes.len() < ts_offset + TIMESTAMP_SIZE {
|
||||
return Err(NeedleError::TailTooShort);
|
||||
}
|
||||
self.append_at_ns = u64::from_be_bytes([
|
||||
tail_bytes[ts_offset],
|
||||
tail_bytes[ts_offset + 1],
|
||||
tail_bytes[ts_offset + 2],
|
||||
tail_bytes[ts_offset + 3],
|
||||
tail_bytes[ts_offset + 4],
|
||||
tail_bytes[ts_offset + 5],
|
||||
tail_bytes[ts_offset + 6],
|
||||
tail_bytes[ts_offset + 7],
|
||||
]);
|
||||
}
|
||||
|
||||
Ok(())
|
||||
}
|
||||
|
||||
// ---- Full read from bytes ----
|
||||
|
||||
/// Read a complete needle from its raw bytes (header + body + tail).
|
||||
pub fn read_bytes(
|
||||
&mut self,
|
||||
bytes: &[u8],
|
||||
offset: i64,
|
||||
expected_size: Size,
|
||||
version: Version,
|
||||
) -> Result<(), NeedleError> {
|
||||
self.read_header(bytes);
|
||||
|
||||
if self.size != expected_size {
|
||||
return Err(NeedleError::SizeMismatch {
|
||||
offset,
|
||||
id: self.id,
|
||||
found: self.size,
|
||||
expected: expected_size,
|
||||
});
|
||||
}
|
||||
|
||||
let body_start = NEEDLE_HEADER_SIZE;
|
||||
let body_end = body_start + self.size.0 as usize;
|
||||
|
||||
if version == VERSION_1 {
|
||||
self.data = bytes[body_start..body_end].to_vec();
|
||||
} else {
|
||||
self.read_body_v2(&bytes[body_start..body_end])?;
|
||||
}
|
||||
|
||||
self.read_tail(&bytes[body_end..], version)?;
|
||||
Ok(())
|
||||
}
|
||||
|
||||
// ---- Write (serialize) ----
|
||||
|
||||
/// Serialize the needle to bytes for writing to a .dat file (Version 2/3).
|
||||
pub fn write_bytes(&mut self, version: Version) -> Vec<u8> {
|
||||
let mut buf = Vec::with_capacity(256);
|
||||
|
||||
// Compute sizes (matching Go writeNeedleCommon)
|
||||
if self.name.len() >= 255 {
|
||||
self.name_size = 255;
|
||||
} else {
|
||||
self.name_size = self.name.len() as u8;
|
||||
}
|
||||
self.data_size = self.data.len() as u32;
|
||||
self.mime_size = self.mime.len() as u8;
|
||||
|
||||
// Compute n.Size (body size, excluding header)
|
||||
if self.data_size > 0 {
|
||||
let mut s: i32 = 4 + self.data_size as i32 + 1; // DataSize + Data + Flags
|
||||
if self.has_name() {
|
||||
s += 1 + self.name_size as i32;
|
||||
}
|
||||
if self.has_mime() {
|
||||
s += 1 + self.mime_size as i32;
|
||||
}
|
||||
if self.has_last_modified_date() {
|
||||
s += LAST_MODIFIED_BYTES_LENGTH as i32;
|
||||
}
|
||||
if self.has_ttl() {
|
||||
s += TTL_BYTES_LENGTH as i32;
|
||||
}
|
||||
if self.has_pairs() {
|
||||
s += 2 + self.pairs_size as i32;
|
||||
}
|
||||
self.size = Size(s);
|
||||
} else {
|
||||
self.size = Size(0);
|
||||
}
|
||||
|
||||
// Header: Cookie(4) + NeedleId(8) + Size(4) = 16 bytes
|
||||
let mut header = [0u8; NEEDLE_HEADER_SIZE];
|
||||
self.cookie.to_bytes(&mut header[0..COOKIE_SIZE]);
|
||||
self.id
|
||||
.to_bytes(&mut header[COOKIE_SIZE..COOKIE_SIZE + NEEDLE_ID_SIZE]);
|
||||
self.size
|
||||
.to_bytes(&mut header[COOKIE_SIZE + NEEDLE_ID_SIZE..NEEDLE_HEADER_SIZE]);
|
||||
buf.extend_from_slice(&header);
|
||||
|
||||
// Body
|
||||
if self.data_size > 0 {
|
||||
buf.extend_from_slice(&self.data_size.to_be_bytes());
|
||||
buf.extend_from_slice(&self.data);
|
||||
buf.push(self.flags);
|
||||
if self.has_name() {
|
||||
buf.push(self.name_size);
|
||||
buf.extend_from_slice(&self.name[..self.name_size as usize]);
|
||||
}
|
||||
if self.has_mime() {
|
||||
buf.push(self.mime_size);
|
||||
buf.extend_from_slice(&self.mime);
|
||||
}
|
||||
if self.has_last_modified_date() {
|
||||
// Write 5 bytes of last_modified (lower 5 bytes of u64 big-endian)
|
||||
let lm_bytes = self.last_modified.to_be_bytes();
|
||||
buf.extend_from_slice(&lm_bytes[8 - LAST_MODIFIED_BYTES_LENGTH..8]);
|
||||
}
|
||||
if self.has_ttl() {
|
||||
if let Some(ref ttl) = self.ttl {
|
||||
let mut ttl_buf = [0u8; 2];
|
||||
ttl.to_bytes(&mut ttl_buf);
|
||||
buf.extend_from_slice(&ttl_buf);
|
||||
} else {
|
||||
buf.extend_from_slice(&[0u8; 2]);
|
||||
}
|
||||
}
|
||||
if self.has_pairs() {
|
||||
buf.extend_from_slice(&self.pairs_size.to_be_bytes());
|
||||
buf.extend_from_slice(&self.pairs);
|
||||
}
|
||||
}
|
||||
|
||||
// Compute checksum
|
||||
self.checksum = CRC::new(&self.data);
|
||||
|
||||
// Tail: Checksum + [V3: AppendAtNs] + Padding
|
||||
buf.extend_from_slice(&self.checksum.0.to_be_bytes());
|
||||
if version == VERSION_3 {
|
||||
buf.extend_from_slice(&self.append_at_ns.to_be_bytes());
|
||||
}
|
||||
|
||||
// Padding to 8-byte alignment
|
||||
let padding = padding_length(self.size, version).0 as usize;
|
||||
buf.extend(std::iter::repeat(0u8).take(padding));
|
||||
|
||||
buf
|
||||
}
|
||||
|
||||
/// Total disk size of this needle including header, body, checksum, timestamp, and padding.
|
||||
pub fn disk_size(&self, version: Version) -> i64 {
|
||||
get_actual_size(self.size, version)
|
||||
}
|
||||
|
||||
/// Compute ETag string from checksum (matching Go).
|
||||
pub fn etag(&self) -> String {
|
||||
etag_from_checksum(self.checksum.0)
|
||||
}
|
||||
}
|
||||
|
||||
// ============================================================================
|
||||
// Helper functions (matching Go)
|
||||
// ============================================================================
|
||||
|
||||
/// Compute padding to align needle to NEEDLE_PADDING_SIZE (8 bytes).
|
||||
pub fn padding_length(needle_size: Size, version: Version) -> Size {
|
||||
if version == VERSION_3 {
|
||||
Size(
|
||||
NEEDLE_PADDING_SIZE as i32
|
||||
- ((NEEDLE_HEADER_SIZE as i32
|
||||
+ needle_size.0
|
||||
+ NEEDLE_CHECKSUM_SIZE as i32
|
||||
+ TIMESTAMP_SIZE as i32)
|
||||
% NEEDLE_PADDING_SIZE as i32),
|
||||
)
|
||||
} else {
|
||||
Size(
|
||||
NEEDLE_PADDING_SIZE as i32
|
||||
- ((NEEDLE_HEADER_SIZE as i32 + needle_size.0 + NEEDLE_CHECKSUM_SIZE as i32)
|
||||
% NEEDLE_PADDING_SIZE as i32),
|
||||
)
|
||||
}
|
||||
}
|
||||
|
||||
/// Body length = Size + Checksum + [Timestamp] + Padding.
|
||||
pub fn needle_body_length(needle_size: Size, version: Version) -> i64 {
|
||||
if version == VERSION_3 {
|
||||
needle_size.0 as i64
|
||||
+ NEEDLE_CHECKSUM_SIZE as i64
|
||||
+ TIMESTAMP_SIZE as i64
|
||||
+ padding_length(needle_size, version).0 as i64
|
||||
} else {
|
||||
needle_size.0 as i64
|
||||
+ NEEDLE_CHECKSUM_SIZE as i64
|
||||
+ padding_length(needle_size, version).0 as i64
|
||||
}
|
||||
}
|
||||
|
||||
/// Total actual size on disk: Header + Body.
|
||||
pub fn get_actual_size(size: Size, version: Version) -> i64 {
|
||||
NEEDLE_HEADER_SIZE as i64 + needle_body_length(size, version)
|
||||
}
|
||||
|
||||
/// Read 5 bytes as a u64 (big-endian, zero-padded high bytes).
|
||||
fn bytes_to_u64_5(bytes: &[u8]) -> u64 {
|
||||
assert!(bytes.len() >= 5);
|
||||
// The 5 bytes are the LOWER 5 bytes of a big-endian u64.
|
||||
// In Go: util.BytesToUint64(bytes[index : index+5]) reads into a uint64
|
||||
// Go's BytesToUint64 copies into the LAST 5 bytes of an 8-byte array (big-endian).
|
||||
let mut buf = [0u8; 8];
|
||||
buf[3..8].copy_from_slice(&bytes[..5]);
|
||||
u64::from_be_bytes(buf)
|
||||
}
|
||||
|
||||
/// ETag formatted as Go: hex of big-endian u32 bytes.
|
||||
pub fn etag_from_checksum(checksum: u32) -> String {
|
||||
let bits = checksum.to_be_bytes();
|
||||
format!(
|
||||
"{:02x}{:02x}{:02x}{:02x}",
|
||||
bits[0], bits[1], bits[2], bits[3]
|
||||
)
|
||||
}
|
||||
|
||||
// ============================================================================
|
||||
// FileId
|
||||
// ============================================================================
|
||||
|
||||
/// FileId = VolumeId + NeedleId + Cookie.
|
||||
/// String format: "<volume_id>,<needle_id_hex><cookie_hex>"
|
||||
#[derive(Debug, Clone, PartialEq, Eq)]
|
||||
pub struct FileId {
|
||||
pub volume_id: VolumeId,
|
||||
pub key: NeedleId,
|
||||
pub cookie: Cookie,
|
||||
}
|
||||
|
||||
impl FileId {
|
||||
pub fn new(volume_id: VolumeId, key: NeedleId, cookie: Cookie) -> Self {
|
||||
FileId {
|
||||
volume_id,
|
||||
key,
|
||||
cookie,
|
||||
}
|
||||
}
|
||||
|
||||
/// Parse "volume_id,needle_id_cookie" or "volume_id/needle_id_cookie".
|
||||
pub fn parse(s: &str) -> Result<Self, String> {
|
||||
let (vid_str, rest) = if let Some(pos) = s.find(',') {
|
||||
(&s[..pos], &s[pos + 1..])
|
||||
} else if let Some(pos) = s.find('/') {
|
||||
(&s[..pos], &s[pos + 1..])
|
||||
} else {
|
||||
return Err(format!("invalid file id: {}", s));
|
||||
};
|
||||
|
||||
let volume_id =
|
||||
VolumeId::parse(vid_str).map_err(|e| format!("invalid volume id: {}", e))?;
|
||||
let (key, cookie) = parse_needle_id_cookie(rest)?;
|
||||
Ok(FileId {
|
||||
volume_id,
|
||||
key,
|
||||
cookie,
|
||||
})
|
||||
}
|
||||
|
||||
/// Format the needle_id + cookie part as a hex string (stripping leading zeros).
|
||||
pub fn needle_id_cookie_string(&self) -> String {
|
||||
format_needle_id_cookie(self.key, self.cookie)
|
||||
}
|
||||
}
|
||||
|
||||
impl std::fmt::Display for FileId {
|
||||
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
|
||||
write!(f, "{},{}", self.volume_id, self.needle_id_cookie_string())
|
||||
}
|
||||
}
|
||||
|
||||
/// Format NeedleId + Cookie as hex, stripping leading zero bytes from NeedleId only.
|
||||
/// Matches Go: strips leading zero bytes up to NeedleIdSize (8), so cookie is always present.
|
||||
fn format_needle_id_cookie(key: NeedleId, cookie: Cookie) -> String {
|
||||
// Encode 12 bytes: 8 for NeedleId + 4 for Cookie
|
||||
let mut bytes = [0u8; 12];
|
||||
key.to_bytes(&mut bytes[0..8]);
|
||||
cookie.to_bytes(&mut bytes[8..12]);
|
||||
|
||||
// Strip leading zero bytes, but only within NeedleId portion (first 8 bytes)
|
||||
let mut nonzero_index = 0;
|
||||
while nonzero_index < NEEDLE_ID_SIZE && bytes[nonzero_index] == 0 {
|
||||
nonzero_index += 1;
|
||||
}
|
||||
hex::encode(&bytes[nonzero_index..])
|
||||
}
|
||||
|
||||
/// Parse "needle_id_cookie_hex" or "needle_id_cookie_hex_delta" into (NeedleId, Cookie).
|
||||
/// Matches Go's ParsePath + ParseNeedleIdCookie: supports an optional `_delta` suffix
|
||||
/// where delta is a decimal number added to the NeedleId (used for sub-file addressing).
|
||||
/// Rejects strings that are too short or too long.
|
||||
pub fn parse_needle_id_cookie(s: &str) -> Result<(NeedleId, Cookie), String> {
|
||||
// Go ParsePath: check for "_" suffix containing a decimal delta
|
||||
let (hex_part, delta) = if let Some(underscore_pos) = s.rfind('_') {
|
||||
if underscore_pos > 0 {
|
||||
let delta_str = &s[underscore_pos + 1..];
|
||||
let d: u64 = delta_str
|
||||
.parse()
|
||||
.map_err(|e| format!("Parse delta error: {}", e))?;
|
||||
(&s[..underscore_pos], Some(d))
|
||||
} else {
|
||||
(s, None)
|
||||
}
|
||||
} else {
|
||||
(s, None)
|
||||
};
|
||||
|
||||
// Go: len(key_hash_string) <= CookieSize*2 => error (must be > 8 hex chars)
|
||||
if hex_part.len() <= COOKIE_SIZE * 2 {
|
||||
return Err("KeyHash is too short.".to_string());
|
||||
}
|
||||
// Go: len(key_hash_string) > (NeedleIdSize+CookieSize)*2 => error (must be <= 24 hex chars)
|
||||
if hex_part.len() > (NEEDLE_ID_SIZE + COOKIE_SIZE) * 2 {
|
||||
return Err("KeyHash is too long.".to_string());
|
||||
}
|
||||
|
||||
// Split: last CookieSize*2 hex chars are cookie, rest is needle id
|
||||
let split = hex_part.len() - COOKIE_SIZE * 2;
|
||||
let needle_id_hex = &hex_part[..split];
|
||||
let cookie_hex = &hex_part[split..];
|
||||
|
||||
let needle_id_bytes = hex::decode(needle_id_hex).map_err(|e| format!("Parse needleId error: {}", e))?;
|
||||
let cookie_bytes = hex::decode(cookie_hex).map_err(|e| format!("Parse cookie error: {}", e))?;
|
||||
|
||||
// Pad needle id to 8 bytes
|
||||
let mut nid_buf = [0u8; 8];
|
||||
if needle_id_bytes.len() > 8 {
|
||||
return Err(format!("KeyHash is too long."));
|
||||
}
|
||||
let start = 8 - needle_id_bytes.len();
|
||||
nid_buf[start..].copy_from_slice(&needle_id_bytes);
|
||||
|
||||
let mut key = NeedleId::from_bytes(&nid_buf[0..8]);
|
||||
let cookie = Cookie::from_bytes(&cookie_bytes[0..4]);
|
||||
|
||||
// Apply delta if present (Go: n.Id += Uint64ToNeedleId(d))
|
||||
if let Some(d) = delta {
|
||||
key = NeedleId(key.0.wrapping_add(d));
|
||||
}
|
||||
|
||||
Ok((key, cookie))
|
||||
}
|
||||
|
||||
// ============================================================================
|
||||
// Errors
|
||||
// ============================================================================
|
||||
|
||||
#[derive(Debug, thiserror::Error)]
|
||||
pub enum NeedleError {
|
||||
#[error("size mismatch at offset {offset}: found id={id} size={found:?}, expected size={expected:?}")]
|
||||
SizeMismatch {
|
||||
offset: i64,
|
||||
id: NeedleId,
|
||||
found: Size,
|
||||
expected: Size,
|
||||
},
|
||||
|
||||
#[error("CRC mismatch for needle {needle_id}: got {got:08x}, want {want:08x}")]
|
||||
CrcMismatch {
|
||||
needle_id: NeedleId,
|
||||
got: u32,
|
||||
want: u32,
|
||||
},
|
||||
|
||||
#[error("index out of range ({0})")]
|
||||
IndexOutOfRange(u32),
|
||||
|
||||
#[error("needle tail too short")]
|
||||
TailTooShort,
|
||||
|
||||
#[error("unsupported version: {0}")]
|
||||
UnsupportedVersion(u8),
|
||||
|
||||
#[error("IO error: {0}")]
|
||||
Io(#[from] std::io::Error),
|
||||
}
|
||||
|
||||
// ============================================================================
|
||||
// Tests
|
||||
// ============================================================================
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn test_parse_header() {
|
||||
let mut buf = [0u8; NEEDLE_HEADER_SIZE];
|
||||
let cookie = Cookie(0xdeadbeef);
|
||||
let id = NeedleId(0x123456789abcdef0);
|
||||
let size = Size(1024);
|
||||
cookie.to_bytes(&mut buf[0..4]);
|
||||
id.to_bytes(&mut buf[4..12]);
|
||||
size.to_bytes(&mut buf[12..16]);
|
||||
|
||||
let (c, i, s) = Needle::parse_header(&buf);
|
||||
assert_eq!(c, cookie);
|
||||
assert_eq!(i, id);
|
||||
assert_eq!(s, size);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_needle_write_read_round_trip_v3() {
|
||||
let mut n = Needle::default();
|
||||
n.cookie = Cookie(42);
|
||||
n.id = NeedleId(100);
|
||||
n.data = b"hello world".to_vec();
|
||||
n.flags = 0;
|
||||
n.set_has_name();
|
||||
n.name = b"test.txt".to_vec();
|
||||
n.name_size = 8;
|
||||
n.set_has_mime();
|
||||
n.mime = b"text/plain".to_vec();
|
||||
n.mime_size = 10;
|
||||
n.set_has_last_modified_date();
|
||||
n.last_modified = 1234567890;
|
||||
n.set_has_ttl();
|
||||
n.ttl = Some(TTL {
|
||||
count: 5,
|
||||
unit: super::super::ttl::TTL_UNIT_DAY,
|
||||
});
|
||||
n.append_at_ns = 999999999;
|
||||
|
||||
let bytes = n.write_bytes(VERSION_3);
|
||||
|
||||
// Verify total size matches expected
|
||||
let expected_size = get_actual_size(n.size, VERSION_3);
|
||||
assert_eq!(bytes.len() as i64, expected_size);
|
||||
|
||||
// Read it back
|
||||
let mut n2 = Needle::default();
|
||||
n2.read_bytes(&bytes, 0, n.size, VERSION_3).unwrap();
|
||||
|
||||
assert_eq!(n2.cookie, n.cookie);
|
||||
assert_eq!(n2.id, n.id);
|
||||
assert_eq!(n2.data, n.data);
|
||||
assert_eq!(n2.name, n.name);
|
||||
assert_eq!(n2.mime, n.mime);
|
||||
assert_eq!(n2.last_modified, n.last_modified);
|
||||
assert_eq!(n2.ttl, n.ttl);
|
||||
assert_eq!(n2.checksum, n.checksum);
|
||||
assert_eq!(n2.append_at_ns, n.append_at_ns);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_needle_write_read_round_trip_v2() {
|
||||
let mut n = Needle::default();
|
||||
n.cookie = Cookie(77);
|
||||
n.id = NeedleId(200);
|
||||
n.data = b"data v2".to_vec();
|
||||
n.flags = 0;
|
||||
|
||||
let bytes = n.write_bytes(VERSION_2);
|
||||
let expected_size = get_actual_size(n.size, VERSION_2);
|
||||
assert_eq!(bytes.len() as i64, expected_size);
|
||||
|
||||
let mut n2 = Needle::default();
|
||||
n2.read_bytes(&bytes, 0, n.size, VERSION_2).unwrap();
|
||||
|
||||
assert_eq!(n2.data, n.data);
|
||||
assert_eq!(n2.checksum, n.checksum);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_read_bytes_meta_only_handles_tombstone_v3() {
|
||||
let mut tombstone = Needle::default();
|
||||
tombstone.cookie = Cookie(0x1234abcd);
|
||||
tombstone.id = NeedleId(300);
|
||||
tombstone.append_at_ns = 999_999;
|
||||
|
||||
let bytes = tombstone.write_bytes(VERSION_3);
|
||||
|
||||
let mut meta = Needle::default();
|
||||
meta.read_bytes_meta_only(&bytes, 0, Size(0), VERSION_3)
|
||||
.unwrap();
|
||||
|
||||
assert_eq!(meta.cookie, tombstone.cookie);
|
||||
assert_eq!(meta.id, tombstone.id);
|
||||
assert_eq!(meta.size, Size(0));
|
||||
assert_eq!(meta.data_size, 0);
|
||||
assert_eq!(meta.append_at_ns, tombstone.append_at_ns);
|
||||
assert_eq!(meta.checksum, tombstone.checksum);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_padding_alignment() {
|
||||
// All actual sizes should be multiples of 8
|
||||
for size_val in 0..50 {
|
||||
let s = Size(size_val);
|
||||
let actual_v2 = get_actual_size(s, VERSION_2);
|
||||
let actual_v3 = get_actual_size(s, VERSION_3);
|
||||
assert_eq!(actual_v2 % 8, 0, "V2 size {} not aligned", size_val);
|
||||
assert_eq!(actual_v3 % 8, 0, "V3 size {} not aligned", size_val);
|
||||
}
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_file_id_parse() {
|
||||
let fid = FileId::parse("3,01637037d6").unwrap();
|
||||
assert_eq!(fid.volume_id, VolumeId(3));
|
||||
// The hex "01637037d6" is 5 bytes = 0x0163..., padded to 12 bytes
|
||||
assert!(!fid.key.is_empty() || !fid.cookie.0 == 0);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_file_id_round_trip() {
|
||||
let fid = FileId::new(VolumeId(5), NeedleId(0x123456), Cookie(0xabcd));
|
||||
let s = fid.to_string();
|
||||
let fid2 = FileId::parse(&s).unwrap();
|
||||
assert_eq!(fid, fid2);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_needle_id_cookie_format() {
|
||||
let s = format_needle_id_cookie(NeedleId(1), Cookie(0x12345678));
|
||||
let (key, cookie) = parse_needle_id_cookie(&s).unwrap();
|
||||
assert_eq!(key, NeedleId(1));
|
||||
assert_eq!(cookie, Cookie(0x12345678));
|
||||
}
|
||||
}
|
||||
302
seaweed-volume/src/storage/needle/ttl.rs
Normal file
302
seaweed-volume/src/storage/needle/ttl.rs
Normal file
@@ -0,0 +1,302 @@
|
||||
//! Time-to-live encoding for needles.
|
||||
//!
|
||||
//! TTL is stored as 2 bytes: Count(1) + Unit(1).
|
||||
//! Supported units: minute(m), hour(h), day(d), week(w), month(M), year(y).
|
||||
|
||||
use std::fmt;
|
||||
|
||||
/// TTL unit constants (matching Go).
|
||||
pub const TTL_UNIT_EMPTY: u8 = 0;
|
||||
pub const TTL_UNIT_MINUTE: u8 = 1;
|
||||
pub const TTL_UNIT_HOUR: u8 = 2;
|
||||
pub const TTL_UNIT_DAY: u8 = 3;
|
||||
pub const TTL_UNIT_WEEK: u8 = 4;
|
||||
pub const TTL_UNIT_MONTH: u8 = 5;
|
||||
pub const TTL_UNIT_YEAR: u8 = 6;
|
||||
|
||||
pub const TTL_BYTES_LENGTH: usize = 2;
|
||||
|
||||
#[derive(Debug, Clone, Copy, PartialEq, Eq, Default)]
|
||||
pub struct TTL {
|
||||
pub count: u8,
|
||||
pub unit: u8,
|
||||
}
|
||||
|
||||
impl TTL {
|
||||
pub const EMPTY: TTL = TTL { count: 0, unit: 0 };
|
||||
|
||||
pub fn is_empty(&self) -> bool {
|
||||
self.count == 0 && self.unit == 0
|
||||
}
|
||||
|
||||
/// Load from 2 bytes.
|
||||
pub fn from_bytes(input: &[u8]) -> Self {
|
||||
if input.len() < 2 {
|
||||
return TTL::EMPTY;
|
||||
}
|
||||
TTL {
|
||||
count: input[0],
|
||||
unit: input[1],
|
||||
}
|
||||
}
|
||||
|
||||
/// Serialize to 2 bytes.
|
||||
pub fn to_bytes(&self, output: &mut [u8]) {
|
||||
assert!(output.len() >= 2);
|
||||
output[0] = self.count;
|
||||
output[1] = self.unit;
|
||||
}
|
||||
|
||||
/// Encode as u32: (count << 8) | unit.
|
||||
pub fn to_u32(&self) -> u32 {
|
||||
if self.count == 0 {
|
||||
return 0;
|
||||
}
|
||||
((self.count as u32) << 8) + (self.unit as u32)
|
||||
}
|
||||
|
||||
/// Decode from u32.
|
||||
pub fn from_u32(v: u32) -> Self {
|
||||
if v == 0 {
|
||||
return TTL::EMPTY;
|
||||
}
|
||||
TTL {
|
||||
count: (v >> 8) as u8,
|
||||
unit: (v & 0xFF) as u8,
|
||||
}
|
||||
}
|
||||
|
||||
/// Convert to total seconds.
|
||||
pub fn to_seconds(&self) -> u64 {
|
||||
unit_to_seconds(self.count as u64, self.unit)
|
||||
}
|
||||
|
||||
/// Parse from string like "3m", "4h", "5d", "6w", "7M", "8y".
|
||||
/// If the string is all digits (no unit suffix), defaults to minutes.
|
||||
/// Matches Go's ReadTTL which calls fitTtlCount to normalize:
|
||||
/// e.g. "120m" -> 2h, "7d" -> 1w, "24h" -> 1d.
|
||||
pub fn read(s: &str) -> Result<Self, String> {
|
||||
let s = s.trim();
|
||||
if s.is_empty() {
|
||||
return Ok(TTL::EMPTY);
|
||||
}
|
||||
let last_byte = s.as_bytes()[s.len() - 1];
|
||||
let (num_str, unit_byte) = if last_byte >= b'0' && last_byte <= b'9' {
|
||||
// All digits — default to minutes (matching Go)
|
||||
(s, b'm')
|
||||
} else {
|
||||
(&s[..s.len() - 1], last_byte)
|
||||
};
|
||||
let count: u32 = num_str
|
||||
.parse()
|
||||
.map_err(|e| format!("invalid TTL count: {}", e))?;
|
||||
let unit = match unit_byte {
|
||||
b'm' => TTL_UNIT_MINUTE,
|
||||
b'h' => TTL_UNIT_HOUR,
|
||||
b'd' => TTL_UNIT_DAY,
|
||||
b'w' => TTL_UNIT_WEEK,
|
||||
b'M' => TTL_UNIT_MONTH,
|
||||
b'y' => TTL_UNIT_YEAR,
|
||||
_ => return Err(format!("unknown TTL unit: {}", unit_byte as char)),
|
||||
};
|
||||
// Match Go's ReadTTL: normalize via fitTtlCount
|
||||
Ok(fit_ttl_count(count, unit))
|
||||
}
|
||||
|
||||
/// Minutes representation.
|
||||
pub fn minutes(&self) -> u32 {
|
||||
(self.to_seconds() / 60) as u32
|
||||
}
|
||||
}
|
||||
|
||||
fn unit_to_seconds(count: u64, unit: u8) -> u64 {
|
||||
match unit {
|
||||
TTL_UNIT_EMPTY => 0,
|
||||
TTL_UNIT_MINUTE => count * 60,
|
||||
TTL_UNIT_HOUR => count * 60 * 60,
|
||||
TTL_UNIT_DAY => count * 60 * 60 * 24,
|
||||
TTL_UNIT_WEEK => count * 60 * 60 * 24 * 7,
|
||||
TTL_UNIT_MONTH => count * 60 * 60 * 24 * 30,
|
||||
TTL_UNIT_YEAR => count * 60 * 60 * 24 * 365,
|
||||
_ => 0,
|
||||
}
|
||||
}
|
||||
|
||||
/// Fit a count+unit into a TTL that fits in a single byte count.
|
||||
/// Converts to seconds first, then finds the coarsest unit that fits.
|
||||
/// Matches Go's fitTtlCount called from ReadTTL.
|
||||
fn fit_ttl_count(count: u32, unit: u8) -> TTL {
|
||||
if count == 0 || unit == TTL_UNIT_EMPTY {
|
||||
return TTL::EMPTY;
|
||||
}
|
||||
|
||||
// Always convert to seconds and normalize (matches Go).
|
||||
let seconds = unit_to_seconds(count as u64, unit);
|
||||
if seconds == 0 {
|
||||
return TTL::EMPTY;
|
||||
}
|
||||
|
||||
const YEAR_SECS: u64 = 3600 * 24 * 365;
|
||||
const MONTH_SECS: u64 = 3600 * 24 * 30;
|
||||
const WEEK_SECS: u64 = 3600 * 24 * 7;
|
||||
const DAY_SECS: u64 = 3600 * 24;
|
||||
const HOUR_SECS: u64 = 3600;
|
||||
const MINUTE_SECS: u64 = 60;
|
||||
|
||||
// First pass: try exact fits from largest to smallest
|
||||
if seconds % YEAR_SECS == 0 && seconds / YEAR_SECS < 256 {
|
||||
return TTL { count: (seconds / YEAR_SECS) as u8, unit: TTL_UNIT_YEAR };
|
||||
}
|
||||
if seconds % MONTH_SECS == 0 && seconds / MONTH_SECS < 256 {
|
||||
return TTL { count: (seconds / MONTH_SECS) as u8, unit: TTL_UNIT_MONTH };
|
||||
}
|
||||
if seconds % WEEK_SECS == 0 && seconds / WEEK_SECS < 256 {
|
||||
return TTL { count: (seconds / WEEK_SECS) as u8, unit: TTL_UNIT_WEEK };
|
||||
}
|
||||
if seconds % DAY_SECS == 0 && seconds / DAY_SECS < 256 {
|
||||
return TTL { count: (seconds / DAY_SECS) as u8, unit: TTL_UNIT_DAY };
|
||||
}
|
||||
if seconds % HOUR_SECS == 0 && seconds / HOUR_SECS < 256 {
|
||||
return TTL { count: (seconds / HOUR_SECS) as u8, unit: TTL_UNIT_HOUR };
|
||||
}
|
||||
// Minutes: truncating division
|
||||
if seconds / MINUTE_SECS < 256 {
|
||||
return TTL { count: (seconds / MINUTE_SECS) as u8, unit: TTL_UNIT_MINUTE };
|
||||
}
|
||||
// Second pass: truncating division from smallest to largest
|
||||
if seconds / HOUR_SECS < 256 {
|
||||
return TTL { count: (seconds / HOUR_SECS) as u8, unit: TTL_UNIT_HOUR };
|
||||
}
|
||||
if seconds / DAY_SECS < 256 {
|
||||
return TTL { count: (seconds / DAY_SECS) as u8, unit: TTL_UNIT_DAY };
|
||||
}
|
||||
if seconds / WEEK_SECS < 256 {
|
||||
return TTL { count: (seconds / WEEK_SECS) as u8, unit: TTL_UNIT_WEEK };
|
||||
}
|
||||
if seconds / MONTH_SECS < 256 {
|
||||
return TTL { count: (seconds / MONTH_SECS) as u8, unit: TTL_UNIT_MONTH };
|
||||
}
|
||||
if seconds / YEAR_SECS < 256 {
|
||||
return TTL { count: (seconds / YEAR_SECS) as u8, unit: TTL_UNIT_YEAR };
|
||||
}
|
||||
TTL::EMPTY
|
||||
}
|
||||
|
||||
fn unit_to_char(unit: u8) -> char {
|
||||
match unit {
|
||||
TTL_UNIT_MINUTE => 'm',
|
||||
TTL_UNIT_HOUR => 'h',
|
||||
TTL_UNIT_DAY => 'd',
|
||||
TTL_UNIT_WEEK => 'w',
|
||||
TTL_UNIT_MONTH => 'M',
|
||||
TTL_UNIT_YEAR => 'y',
|
||||
_ => ' ',
|
||||
}
|
||||
}
|
||||
|
||||
impl fmt::Display for TTL {
|
||||
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
|
||||
if self.count == 0 || self.unit == TTL_UNIT_EMPTY {
|
||||
return write!(f, "");
|
||||
}
|
||||
write!(f, "{}{}", self.count, unit_to_char(self.unit))
|
||||
}
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn test_ttl_parse() {
|
||||
let ttl = TTL::read("3m").unwrap();
|
||||
assert_eq!(
|
||||
ttl,
|
||||
TTL {
|
||||
count: 3,
|
||||
unit: TTL_UNIT_MINUTE
|
||||
}
|
||||
);
|
||||
assert_eq!(ttl.to_seconds(), 180);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_ttl_parse_hours() {
|
||||
// 24h normalizes to 1d via fitTtlCount
|
||||
let ttl = TTL::read("24h").unwrap();
|
||||
assert_eq!(ttl.to_seconds(), 86400);
|
||||
assert_eq!(ttl, TTL { count: 1, unit: TTL_UNIT_DAY });
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_ttl_display() {
|
||||
let ttl = TTL {
|
||||
count: 5,
|
||||
unit: TTL_UNIT_DAY,
|
||||
};
|
||||
assert_eq!(ttl.to_string(), "5d");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_ttl_bytes_round_trip() {
|
||||
let ttl = TTL {
|
||||
count: 10,
|
||||
unit: TTL_UNIT_WEEK,
|
||||
};
|
||||
let mut buf = [0u8; 2];
|
||||
ttl.to_bytes(&mut buf);
|
||||
let ttl2 = TTL::from_bytes(&buf);
|
||||
assert_eq!(ttl, ttl2);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_ttl_u32_round_trip() {
|
||||
let ttl = TTL {
|
||||
count: 42,
|
||||
unit: TTL_UNIT_HOUR,
|
||||
};
|
||||
let v = ttl.to_u32();
|
||||
let ttl2 = TTL::from_u32(v);
|
||||
assert_eq!(ttl, ttl2);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_ttl_empty() {
|
||||
assert!(TTL::EMPTY.is_empty());
|
||||
assert_eq!(TTL::EMPTY.to_seconds(), 0);
|
||||
assert_eq!(TTL::EMPTY.to_u32(), 0);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_ttl_overflow_normalizes() {
|
||||
// Go's ReadTTL calls fitTtlCount: 300m = 18000s = 5h (exact fit)
|
||||
let ttl = TTL::read("300m").unwrap();
|
||||
assert_eq!(ttl, TTL { count: 5, unit: TTL_UNIT_HOUR });
|
||||
|
||||
// 256h = 921600s. Doesn't fit in hours (256 >= 256), doesn't fit exact in days.
|
||||
// Second pass: 921600/86400 = 10 (truncated) < 256 -> 10d
|
||||
let ttl = TTL::read("256h").unwrap();
|
||||
assert_eq!(ttl, TTL { count: 10, unit: TTL_UNIT_DAY });
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_ttl_normalizes_unit() {
|
||||
// Go's ReadTTL calls fitTtlCount which normalizes to coarsest unit.
|
||||
// 120m -> 2h, 7d -> 1w, 24h -> 1d.
|
||||
let ttl = TTL::read("120m").unwrap();
|
||||
assert_eq!(ttl, TTL { count: 2, unit: TTL_UNIT_HOUR });
|
||||
|
||||
let ttl = TTL::read("7d").unwrap();
|
||||
assert_eq!(ttl, TTL { count: 1, unit: TTL_UNIT_WEEK });
|
||||
|
||||
let ttl = TTL::read("24h").unwrap();
|
||||
assert_eq!(ttl, TTL { count: 1, unit: TTL_UNIT_DAY });
|
||||
|
||||
// Values that don't simplify stay as-is
|
||||
let ttl = TTL::read("5d").unwrap();
|
||||
assert_eq!(ttl, TTL { count: 5, unit: TTL_UNIT_DAY });
|
||||
|
||||
let ttl = TTL::read("3m").unwrap();
|
||||
assert_eq!(ttl, TTL { count: 3, unit: TTL_UNIT_MINUTE });
|
||||
}
|
||||
}
|
||||
1438
seaweed-volume/src/storage/needle_map.rs
Normal file
1438
seaweed-volume/src/storage/needle_map.rs
Normal file
File diff suppressed because it is too large
Load Diff
375
seaweed-volume/src/storage/needle_map/compact_map.rs
Normal file
375
seaweed-volume/src/storage/needle_map/compact_map.rs
Normal file
@@ -0,0 +1,375 @@
|
||||
//! CompactMap: memory-efficient in-memory map of NeedleId -> (Offset, Size).
|
||||
//!
|
||||
//! Port of Go's CompactMap from weed/storage/needle_map/compact_map.go.
|
||||
//! Uses segmented sorted arrays with compressed keys (u16 instead of u64)
|
||||
//! to achieve ~10 bytes per entry instead of ~40-48 bytes with HashMap.
|
||||
//!
|
||||
//! NeedleId is split into: chunk = id / SEGMENT_CHUNK_SIZE, compact_key = id % SEGMENT_CHUNK_SIZE.
|
||||
//! Each segment stores up to SEGMENT_CHUNK_SIZE entries in a sorted Vec, searched via binary search.
|
||||
//! Best case (ordered inserts): O(1). Worst case: O(log n) per segment.
|
||||
|
||||
use std::collections::HashMap;
|
||||
|
||||
use super::NeedleValue;
|
||||
use crate::storage::types::*;
|
||||
|
||||
/// Maximum entries per segment. Must be <= u16::MAX (65535).
|
||||
const SEGMENT_CHUNK_SIZE: u64 = 50_000;
|
||||
|
||||
/// Compact key: only the low bits of NeedleId within a segment.
|
||||
type CompactKey = u16;
|
||||
|
||||
/// Segment chunk identifier: NeedleId / SEGMENT_CHUNK_SIZE.
|
||||
type Chunk = u64;
|
||||
|
||||
/// Compact entry: 10 bytes (2 + 4 + 4) vs 16 bytes for full NeedleId + NeedleValue.
|
||||
#[derive(Clone, Copy)]
|
||||
struct CompactEntry {
|
||||
key: CompactKey, // 2 bytes
|
||||
offset: [u8; OFFSET_SIZE], // 4 bytes
|
||||
size: Size, // 4 bytes
|
||||
}
|
||||
|
||||
impl CompactEntry {
|
||||
fn to_needle_value(&self) -> NeedleValue {
|
||||
NeedleValue {
|
||||
offset: Offset::from_bytes(&self.offset),
|
||||
size: self.size,
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
/// A sorted segment of compact entries for a given chunk.
|
||||
struct Segment {
|
||||
list: Vec<CompactEntry>,
|
||||
chunk: Chunk,
|
||||
first_key: CompactKey,
|
||||
last_key: CompactKey,
|
||||
}
|
||||
|
||||
impl Segment {
|
||||
fn new(chunk: Chunk) -> Self {
|
||||
Segment {
|
||||
list: Vec::new(),
|
||||
chunk,
|
||||
first_key: u16::MAX,
|
||||
last_key: 0,
|
||||
}
|
||||
}
|
||||
|
||||
fn compact_key(&self, id: NeedleId) -> CompactKey {
|
||||
(id.0 - SEGMENT_CHUNK_SIZE * self.chunk) as CompactKey
|
||||
}
|
||||
|
||||
/// Binary search for a compact key. Returns (index, found).
|
||||
/// If not found, index is the insertion point.
|
||||
fn bsearch(&self, id: NeedleId) -> (usize, bool) {
|
||||
let ck = self.compact_key(id);
|
||||
|
||||
if self.list.is_empty() {
|
||||
return (0, false);
|
||||
}
|
||||
if ck == self.first_key {
|
||||
return (0, true);
|
||||
}
|
||||
if ck < self.first_key {
|
||||
return (0, false);
|
||||
}
|
||||
if ck == self.last_key {
|
||||
return (self.list.len() - 1, true);
|
||||
}
|
||||
if ck > self.last_key {
|
||||
return (self.list.len(), false);
|
||||
}
|
||||
|
||||
let i = self.list.partition_point(|e| e.key < ck);
|
||||
if i < self.list.len() && self.list[i].key == ck {
|
||||
(i, true)
|
||||
} else {
|
||||
(i, false)
|
||||
}
|
||||
}
|
||||
|
||||
/// Insert or update. Returns old NeedleValue if updating.
|
||||
fn set(&mut self, id: NeedleId, offset: Offset, size: Size) -> Option<NeedleValue> {
|
||||
let (i, found) = self.bsearch(id);
|
||||
|
||||
if found {
|
||||
let old = self.list[i].to_needle_value();
|
||||
let mut offset_bytes = [0u8; OFFSET_SIZE];
|
||||
offset.to_bytes(&mut offset_bytes);
|
||||
self.list[i].offset = offset_bytes;
|
||||
self.list[i].size = size;
|
||||
return Some(old);
|
||||
}
|
||||
|
||||
// Insert at sorted position
|
||||
let ck = self.compact_key(id);
|
||||
let mut offset_bytes = [0u8; OFFSET_SIZE];
|
||||
offset.to_bytes(&mut offset_bytes);
|
||||
|
||||
let entry = CompactEntry {
|
||||
key: ck,
|
||||
offset: offset_bytes,
|
||||
size,
|
||||
};
|
||||
|
||||
// Match Go panic: don't exceed segment capacity
|
||||
if self.list.len() >= SEGMENT_CHUNK_SIZE as usize {
|
||||
panic!(
|
||||
"attempted to write more than {} entries on CompactMapSegment",
|
||||
SEGMENT_CHUNK_SIZE
|
||||
);
|
||||
}
|
||||
|
||||
if self.list.len() == SEGMENT_CHUNK_SIZE as usize - 1 {
|
||||
// Pin capacity to exact size when maxing out
|
||||
let mut new_list = Vec::with_capacity(SEGMENT_CHUNK_SIZE as usize);
|
||||
new_list.extend_from_slice(&self.list[..i]);
|
||||
new_list.push(entry);
|
||||
new_list.extend_from_slice(&self.list[i..]);
|
||||
self.list = new_list;
|
||||
} else {
|
||||
self.list.insert(i, entry);
|
||||
}
|
||||
|
||||
if ck < self.first_key {
|
||||
self.first_key = ck;
|
||||
}
|
||||
if ck > self.last_key {
|
||||
self.last_key = ck;
|
||||
}
|
||||
|
||||
None
|
||||
}
|
||||
|
||||
fn get(&self, id: NeedleId) -> Option<NeedleValue> {
|
||||
let (i, found) = self.bsearch(id);
|
||||
if found {
|
||||
Some(self.list[i].to_needle_value())
|
||||
} else {
|
||||
None
|
||||
}
|
||||
}
|
||||
|
||||
/// Mark as deleted by negating size. Returns previous size if not already deleted.
|
||||
/// Matches Go behavior: checks !IsDeleted() (i.e., size >= 0).
|
||||
fn delete(&mut self, id: NeedleId) -> Option<Size> {
|
||||
let (i, found) = self.bsearch(id);
|
||||
if found && !self.list[i].size.is_deleted() {
|
||||
let old_size = self.list[i].size;
|
||||
if self.list[i].size.0 == 0 {
|
||||
self.list[i].size = TOMBSTONE_FILE_SIZE;
|
||||
} else {
|
||||
self.list[i].size = Size(-self.list[i].size.0);
|
||||
}
|
||||
Some(old_size)
|
||||
} else {
|
||||
None
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
/// Memory-efficient map of NeedleId -> (Offset, Size).
|
||||
/// Segments NeedleIds into chunks of 50,000 and stores compact 10-byte entries
|
||||
/// in sorted arrays, using only 2 bytes for the key within each segment.
|
||||
pub struct CompactMap {
|
||||
segments: HashMap<Chunk, Segment>,
|
||||
}
|
||||
|
||||
impl CompactMap {
|
||||
pub fn new() -> Self {
|
||||
CompactMap {
|
||||
segments: HashMap::new(),
|
||||
}
|
||||
}
|
||||
|
||||
fn _segment_for_key(&mut self, id: NeedleId) -> &mut Segment {
|
||||
let chunk = id.0 / SEGMENT_CHUNK_SIZE;
|
||||
self.segments
|
||||
.entry(chunk)
|
||||
.or_insert_with(|| Segment::new(chunk))
|
||||
}
|
||||
|
||||
/// Insert or update. Returns old NeedleValue if updating.
|
||||
pub fn set(&mut self, id: NeedleId, offset: Offset, size: Size) -> Option<NeedleValue> {
|
||||
let chunk = id.0 / SEGMENT_CHUNK_SIZE;
|
||||
let segment = self
|
||||
.segments
|
||||
.entry(chunk)
|
||||
.or_insert_with(|| Segment::new(chunk));
|
||||
segment.set(id, offset, size)
|
||||
}
|
||||
|
||||
pub fn get(&self, id: NeedleId) -> Option<NeedleValue> {
|
||||
let chunk = id.0 / SEGMENT_CHUNK_SIZE;
|
||||
self.segments.get(&chunk)?.get(id)
|
||||
}
|
||||
|
||||
/// Mark as deleted. Returns previous size if was valid.
|
||||
pub fn delete(&mut self, id: NeedleId) -> Option<Size> {
|
||||
let chunk = id.0 / SEGMENT_CHUNK_SIZE;
|
||||
self.segments.get_mut(&chunk)?.delete(id)
|
||||
}
|
||||
|
||||
/// Remove entry entirely (used during idx loading).
|
||||
pub fn remove(&mut self, id: NeedleId) -> Option<NeedleValue> {
|
||||
let chunk = id.0 / SEGMENT_CHUNK_SIZE;
|
||||
let segment = self.segments.get_mut(&chunk)?;
|
||||
let (i, found) = segment.bsearch(id);
|
||||
if found {
|
||||
let entry = segment.list.remove(i);
|
||||
// Update first/last keys
|
||||
if segment.list.is_empty() {
|
||||
segment.first_key = u16::MAX;
|
||||
segment.last_key = 0;
|
||||
} else {
|
||||
segment.first_key = segment.list[0].key;
|
||||
segment.last_key = segment.list[segment.list.len() - 1].key;
|
||||
}
|
||||
Some(entry.to_needle_value())
|
||||
} else {
|
||||
None
|
||||
}
|
||||
}
|
||||
|
||||
/// Iterate all entries in ascending NeedleId order.
|
||||
pub fn ascending_visit<F, E>(&self, mut f: F) -> Result<(), E>
|
||||
where
|
||||
F: FnMut(NeedleId, &NeedleValue) -> Result<(), E>,
|
||||
{
|
||||
let mut chunks: Vec<Chunk> = self.segments.keys().copied().collect();
|
||||
chunks.sort_unstable();
|
||||
|
||||
for chunk in chunks {
|
||||
let segment = &self.segments[&chunk];
|
||||
for entry in &segment.list {
|
||||
let id = NeedleId(SEGMENT_CHUNK_SIZE * segment.chunk + entry.key as u64);
|
||||
let nv = entry.to_needle_value();
|
||||
f(id, &nv)?;
|
||||
}
|
||||
}
|
||||
Ok(())
|
||||
}
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
fn offset(v: u32) -> Offset {
|
||||
Offset::from_actual_offset(v as i64 * NEEDLE_PADDING_SIZE as i64)
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_compact_map_basic() {
|
||||
let mut m = CompactMap::new();
|
||||
|
||||
// Insert
|
||||
assert!(m.set(NeedleId(1), offset(100), Size(50)).is_none());
|
||||
assert!(m.set(NeedleId(2), offset(200), Size(60)).is_none());
|
||||
|
||||
// Get
|
||||
let nv = m.get(NeedleId(1)).unwrap();
|
||||
assert_eq!(nv.size, Size(50));
|
||||
|
||||
// Update returns old value
|
||||
let old = m.set(NeedleId(1), offset(300), Size(70)).unwrap();
|
||||
assert_eq!(old.size, Size(50));
|
||||
|
||||
// Get updated value
|
||||
let nv = m.get(NeedleId(1)).unwrap();
|
||||
assert_eq!(nv.size, Size(70));
|
||||
|
||||
// Miss
|
||||
assert!(m.get(NeedleId(999)).is_none());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_compact_map_delete() {
|
||||
let mut m = CompactMap::new();
|
||||
m.set(NeedleId(1), offset(100), Size(50));
|
||||
|
||||
// Delete returns old size
|
||||
let old = m.delete(NeedleId(1)).unwrap();
|
||||
assert_eq!(old, Size(50));
|
||||
|
||||
// Get returns deleted (negative size)
|
||||
let nv = m.get(NeedleId(1)).unwrap();
|
||||
assert!(nv.size.is_deleted());
|
||||
|
||||
// Delete again returns None (already deleted)
|
||||
assert!(m.delete(NeedleId(1)).is_none());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_compact_map_zero_size_delete() {
|
||||
let mut m = CompactMap::new();
|
||||
m.set(NeedleId(1), offset(100), Size(0));
|
||||
|
||||
let old = m.delete(NeedleId(1)).unwrap();
|
||||
assert_eq!(old, Size(0));
|
||||
|
||||
let nv = m.get(NeedleId(1)).unwrap();
|
||||
assert_eq!(nv.size, TOMBSTONE_FILE_SIZE);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_compact_map_cross_segment() {
|
||||
let mut m = CompactMap::new();
|
||||
|
||||
// Insert across multiple segments
|
||||
m.set(NeedleId(1), offset(1), Size(1));
|
||||
m.set(NeedleId(50_000), offset(2), Size(2));
|
||||
m.set(NeedleId(100_000), offset(3), Size(3));
|
||||
|
||||
assert_eq!(m.get(NeedleId(1)).unwrap().size, Size(1));
|
||||
assert_eq!(m.get(NeedleId(50_000)).unwrap().size, Size(2));
|
||||
assert_eq!(m.get(NeedleId(100_000)).unwrap().size, Size(3));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_compact_map_ascending_visit() {
|
||||
let mut m = CompactMap::new();
|
||||
m.set(NeedleId(100_005), offset(3), Size(3));
|
||||
m.set(NeedleId(5), offset(1), Size(1));
|
||||
m.set(NeedleId(50_005), offset(2), Size(2));
|
||||
|
||||
let mut visited = Vec::new();
|
||||
m.ascending_visit(|id, nv| {
|
||||
visited.push((id, nv.size));
|
||||
Ok::<_, String>(())
|
||||
})
|
||||
.unwrap();
|
||||
|
||||
assert_eq!(visited.len(), 3);
|
||||
assert_eq!(visited[0].0, NeedleId(5));
|
||||
assert_eq!(visited[1].0, NeedleId(50_005));
|
||||
assert_eq!(visited[2].0, NeedleId(100_005));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_compact_map_remove() {
|
||||
let mut m = CompactMap::new();
|
||||
m.set(NeedleId(1), offset(100), Size(50));
|
||||
m.set(NeedleId(2), offset(200), Size(60));
|
||||
|
||||
let removed = m.remove(NeedleId(1)).unwrap();
|
||||
assert_eq!(removed.size, Size(50));
|
||||
|
||||
assert!(m.get(NeedleId(1)).is_none());
|
||||
assert_eq!(m.get(NeedleId(2)).unwrap().size, Size(60));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_compact_map_reverse_insert_order() {
|
||||
let mut m = CompactMap::new();
|
||||
// Insert in reverse order to test sorted insert
|
||||
for i in (0..100).rev() {
|
||||
m.set(NeedleId(i), offset(i as u32), Size(i as i32));
|
||||
}
|
||||
for i in 0..100 {
|
||||
assert_eq!(m.get(NeedleId(i)).unwrap().size, Size(i as i32));
|
||||
}
|
||||
}
|
||||
}
|
||||
1297
seaweed-volume/src/storage/store.rs
Normal file
1297
seaweed-volume/src/storage/store.rs
Normal file
File diff suppressed because it is too large
Load Diff
289
seaweed-volume/src/storage/super_block.rs
Normal file
289
seaweed-volume/src/storage/super_block.rs
Normal file
@@ -0,0 +1,289 @@
|
||||
//! SuperBlock: the 8-byte (+ optional extra) header at the start of every .dat file.
|
||||
//!
|
||||
//! Byte layout:
|
||||
//! [0] Version
|
||||
//! [1] ReplicaPlacement byte
|
||||
//! [2..4] TTL (2 bytes)
|
||||
//! [4..6] CompactionRevision (u16 big-endian)
|
||||
//! [6..8] ExtraSize (u16 big-endian)
|
||||
//! [8..] Extra data (protobuf, ExtraSize bytes) — only for Version 2/3
|
||||
|
||||
use crate::storage::needle::ttl::TTL;
|
||||
use crate::storage::types::Version;
|
||||
|
||||
pub const SUPER_BLOCK_SIZE: usize = 8;
|
||||
|
||||
/// SuperBlock metadata at the start of a volume .dat file.
|
||||
#[derive(Debug, Clone)]
|
||||
pub struct SuperBlock {
|
||||
pub version: Version,
|
||||
pub replica_placement: ReplicaPlacement,
|
||||
pub ttl: TTL,
|
||||
pub compaction_revision: u16,
|
||||
pub extra_size: u16,
|
||||
pub extra_data: Vec<u8>, // raw protobuf bytes (SuperBlockExtra)
|
||||
}
|
||||
|
||||
impl SuperBlock {
|
||||
/// Total block size on disk (base 8 + extra).
|
||||
pub fn block_size(&self) -> usize {
|
||||
match self.version.0 {
|
||||
2 | 3 => SUPER_BLOCK_SIZE + self.extra_size as usize,
|
||||
_ => SUPER_BLOCK_SIZE,
|
||||
}
|
||||
}
|
||||
|
||||
/// Serialize to bytes.
|
||||
pub fn to_bytes(&self) -> Vec<u8> {
|
||||
let mut header = vec![0u8; SUPER_BLOCK_SIZE];
|
||||
header[0] = self.version.0;
|
||||
header[1] = self.replica_placement.to_byte();
|
||||
self.ttl.to_bytes(&mut header[2..4]);
|
||||
header[4..6].copy_from_slice(&self.compaction_revision.to_be_bytes());
|
||||
|
||||
if !self.extra_data.is_empty() {
|
||||
// Go checks extraSize > 256*256-2 and calls glog.Fatalf; guard against u16 overflow.
|
||||
assert!(
|
||||
self.extra_data.len() <= 65534,
|
||||
"super block extra data too large: {} > 65534",
|
||||
self.extra_data.len()
|
||||
);
|
||||
let extra_size = self.extra_data.len() as u16;
|
||||
header[6..8].copy_from_slice(&extra_size.to_be_bytes());
|
||||
header.extend_from_slice(&self.extra_data);
|
||||
}
|
||||
|
||||
header
|
||||
}
|
||||
|
||||
/// Parse from bytes (must be at least SUPER_BLOCK_SIZE bytes).
|
||||
pub fn from_bytes(bytes: &[u8]) -> Result<Self, SuperBlockError> {
|
||||
if bytes.len() < SUPER_BLOCK_SIZE {
|
||||
return Err(SuperBlockError::TooShort(bytes.len()));
|
||||
}
|
||||
|
||||
let version = Version(bytes[0]);
|
||||
let replica_placement = ReplicaPlacement::from_byte(bytes[1])?;
|
||||
let ttl = TTL::from_bytes(&bytes[2..4]);
|
||||
let compaction_revision = u16::from_be_bytes([bytes[4], bytes[5]]);
|
||||
let extra_size = u16::from_be_bytes([bytes[6], bytes[7]]);
|
||||
|
||||
let extra_data = if extra_size > 0 && bytes.len() >= SUPER_BLOCK_SIZE + extra_size as usize
|
||||
{
|
||||
bytes[SUPER_BLOCK_SIZE..SUPER_BLOCK_SIZE + extra_size as usize].to_vec()
|
||||
} else {
|
||||
vec![]
|
||||
};
|
||||
|
||||
Ok(SuperBlock {
|
||||
version,
|
||||
replica_placement,
|
||||
ttl,
|
||||
compaction_revision,
|
||||
extra_size,
|
||||
extra_data,
|
||||
})
|
||||
}
|
||||
|
||||
pub fn initialized(&self) -> bool {
|
||||
true // ReplicaPlacement and TTL are always valid after construction
|
||||
}
|
||||
}
|
||||
|
||||
impl Default for SuperBlock {
|
||||
fn default() -> Self {
|
||||
SuperBlock {
|
||||
version: Version::current(),
|
||||
replica_placement: ReplicaPlacement::default(),
|
||||
ttl: TTL::EMPTY,
|
||||
compaction_revision: 0,
|
||||
extra_size: 0,
|
||||
extra_data: vec![],
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// ============================================================================
|
||||
// ReplicaPlacement
|
||||
// ============================================================================
|
||||
|
||||
/// Replication strategy encoded as a single byte.
|
||||
///
|
||||
/// Byte value = DiffDataCenterCount * 100 + DiffRackCount * 10 + SameRackCount
|
||||
///
|
||||
/// Examples:
|
||||
/// "000" → no replication (1 copy total)
|
||||
/// "010" → 1 copy in different rack (2 copies total)
|
||||
/// "100" → 1 copy in different datacenter
|
||||
/// "200" → 2 copies in different datacenters
|
||||
#[derive(Debug, Clone, Copy, PartialEq, Eq, Default)]
|
||||
pub struct ReplicaPlacement {
|
||||
pub same_rack_count: u8,
|
||||
pub diff_rack_count: u8,
|
||||
pub diff_data_center_count: u8,
|
||||
}
|
||||
|
||||
impl ReplicaPlacement {
|
||||
/// Parse from a string like "000", "010", "100".
|
||||
/// Accepts 0-3 character strings, padding with leading zeros to match Go behavior.
|
||||
/// E.g. "" -> "000", "1" -> "001", "01" -> "001", "010" -> "010"
|
||||
pub fn from_string(s: &str) -> Result<Self, SuperBlockError> {
|
||||
let s = s.trim();
|
||||
if s.is_empty() {
|
||||
return Ok(ReplicaPlacement::default());
|
||||
}
|
||||
// Pad with leading zeros to 3 chars, matching Go's NewReplicaPlacementFromString
|
||||
let padded = match s.len() {
|
||||
1 => format!("00{}", s),
|
||||
2 => format!("0{}", s),
|
||||
3 => s.to_string(),
|
||||
_ => return Err(SuperBlockError::InvalidReplicaPlacement(s.to_string())),
|
||||
};
|
||||
let chars: Vec<char> = padded.chars().collect();
|
||||
let dc = chars[0]
|
||||
.to_digit(10)
|
||||
.ok_or_else(|| SuperBlockError::InvalidReplicaPlacement(s.to_string()))?
|
||||
as u8;
|
||||
let rack = chars[1]
|
||||
.to_digit(10)
|
||||
.ok_or_else(|| SuperBlockError::InvalidReplicaPlacement(s.to_string()))?
|
||||
as u8;
|
||||
let same = chars[2]
|
||||
.to_digit(10)
|
||||
.ok_or_else(|| SuperBlockError::InvalidReplicaPlacement(s.to_string()))?
|
||||
as u8;
|
||||
// Go validates: value = dc*100 + rack*10 + same must fit in a byte
|
||||
let value = dc as u16 * 100 + rack as u16 * 10 + same as u16;
|
||||
if value > 255 {
|
||||
return Err(SuperBlockError::InvalidReplicaPlacement(s.to_string()));
|
||||
}
|
||||
Ok(ReplicaPlacement {
|
||||
diff_data_center_count: dc,
|
||||
diff_rack_count: rack,
|
||||
same_rack_count: same,
|
||||
})
|
||||
}
|
||||
|
||||
/// Parse from a single byte.
|
||||
pub fn from_byte(b: u8) -> Result<Self, SuperBlockError> {
|
||||
Ok(ReplicaPlacement {
|
||||
diff_data_center_count: b / 100,
|
||||
diff_rack_count: (b % 100) / 10,
|
||||
same_rack_count: b % 10,
|
||||
})
|
||||
}
|
||||
|
||||
/// Encode as a single byte.
|
||||
pub fn to_byte(&self) -> u8 {
|
||||
self.diff_data_center_count * 100 + self.diff_rack_count * 10 + self.same_rack_count
|
||||
}
|
||||
|
||||
/// Total number of copies (including the original).
|
||||
pub fn get_copy_count(&self) -> u8 {
|
||||
self.diff_data_center_count + self.diff_rack_count + self.same_rack_count + 1
|
||||
}
|
||||
|
||||
/// Whether this placement requires replication (more than 1 copy).
|
||||
pub fn has_replication(&self) -> bool {
|
||||
self.get_copy_count() > 1
|
||||
}
|
||||
}
|
||||
|
||||
impl std::fmt::Display for ReplicaPlacement {
|
||||
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
|
||||
write!(
|
||||
f,
|
||||
"{}{}{}",
|
||||
self.diff_data_center_count, self.diff_rack_count, self.same_rack_count
|
||||
)
|
||||
}
|
||||
}
|
||||
|
||||
// ============================================================================
|
||||
// Errors
|
||||
// ============================================================================
|
||||
|
||||
#[derive(Debug, thiserror::Error)]
|
||||
pub enum SuperBlockError {
|
||||
#[error("super block too short: {0} bytes")]
|
||||
TooShort(usize),
|
||||
|
||||
#[error("invalid replica placement: {0}")]
|
||||
InvalidReplicaPlacement(String),
|
||||
}
|
||||
|
||||
// ============================================================================
|
||||
// Tests
|
||||
// ============================================================================
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
use crate::storage::types::*;
|
||||
|
||||
#[test]
|
||||
fn test_super_block_round_trip() {
|
||||
let sb = SuperBlock {
|
||||
version: VERSION_3,
|
||||
replica_placement: ReplicaPlacement::from_string("010").unwrap(),
|
||||
ttl: TTL { count: 5, unit: 3 },
|
||||
compaction_revision: 42,
|
||||
extra_size: 0,
|
||||
extra_data: vec![],
|
||||
};
|
||||
|
||||
let bytes = sb.to_bytes();
|
||||
assert_eq!(bytes.len(), SUPER_BLOCK_SIZE);
|
||||
|
||||
let sb2 = SuperBlock::from_bytes(&bytes).unwrap();
|
||||
assert_eq!(sb2.version, sb.version);
|
||||
assert_eq!(sb2.replica_placement, sb.replica_placement);
|
||||
assert_eq!(sb2.ttl, sb.ttl);
|
||||
assert_eq!(sb2.compaction_revision, sb.compaction_revision);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_super_block_with_extra() {
|
||||
let sb = SuperBlock {
|
||||
version: VERSION_3,
|
||||
replica_placement: ReplicaPlacement::default(),
|
||||
ttl: TTL::EMPTY,
|
||||
compaction_revision: 0,
|
||||
extra_size: 3,
|
||||
extra_data: vec![1, 2, 3],
|
||||
};
|
||||
|
||||
let bytes = sb.to_bytes();
|
||||
assert_eq!(bytes.len(), SUPER_BLOCK_SIZE + 3);
|
||||
|
||||
let sb2 = SuperBlock::from_bytes(&bytes).unwrap();
|
||||
assert_eq!(sb2.extra_data, vec![1, 2, 3]);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_replica_placement_byte_round_trip() {
|
||||
let rp = ReplicaPlacement::from_string("123").unwrap();
|
||||
assert_eq!(rp.diff_data_center_count, 1);
|
||||
assert_eq!(rp.diff_rack_count, 2);
|
||||
assert_eq!(rp.same_rack_count, 3);
|
||||
assert_eq!(rp.to_byte(), 123);
|
||||
assert_eq!(rp.get_copy_count(), 7); // 1+2+3+1
|
||||
|
||||
let rp2 = ReplicaPlacement::from_byte(123).unwrap();
|
||||
assert_eq!(rp, rp2);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_replica_placement_no_replication() {
|
||||
let rp = ReplicaPlacement::from_string("000").unwrap();
|
||||
assert!(!rp.has_replication());
|
||||
assert_eq!(rp.get_copy_count(), 1);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_replica_placement_display() {
|
||||
let rp = ReplicaPlacement::from_string("010").unwrap();
|
||||
assert_eq!(rp.to_string(), "010");
|
||||
assert!(rp.has_replication());
|
||||
}
|
||||
}
|
||||
679
seaweed-volume/src/storage/types.rs
Normal file
679
seaweed-volume/src/storage/types.rs
Normal file
@@ -0,0 +1,679 @@
|
||||
//! Core storage types: NeedleId, Offset, Size, Cookie, DiskType.
|
||||
//!
|
||||
//! These types define the binary-compatible on-disk format matching the Go implementation.
|
||||
//! CRITICAL: Byte layout must match exactly for cross-compatibility.
|
||||
|
||||
use std::fmt;
|
||||
|
||||
// ============================================================================
|
||||
// Constants
|
||||
// ============================================================================
|
||||
|
||||
pub const NEEDLE_ID_SIZE: usize = 8;
|
||||
pub const NEEDLE_ID_EMPTY: u64 = 0;
|
||||
pub const COOKIE_SIZE: usize = 4;
|
||||
pub const SIZE_SIZE: usize = 4;
|
||||
pub const NEEDLE_HEADER_SIZE: usize = COOKIE_SIZE + NEEDLE_ID_SIZE + SIZE_SIZE; // 16
|
||||
pub const DATA_SIZE_SIZE: usize = 4;
|
||||
pub const TIMESTAMP_SIZE: usize = 8;
|
||||
pub const NEEDLE_PADDING_SIZE: usize = 8;
|
||||
pub const NEEDLE_CHECKSUM_SIZE: usize = 4;
|
||||
|
||||
/// 5-byte offset mode (matching Go production builds with `-tags 5BytesOffset`).
|
||||
/// Max volume size: 8TB. Index entry: 17 bytes (8 + 5 + 4).
|
||||
#[cfg(feature = "5bytes")]
|
||||
pub const OFFSET_SIZE: usize = 5;
|
||||
#[cfg(feature = "5bytes")]
|
||||
pub const MAX_POSSIBLE_VOLUME_SIZE: u64 = 4 * 1024 * 1024 * 1024 * 8 * 256; // 8TB
|
||||
|
||||
/// 4-byte offset mode (matching Go default build without `5BytesOffset`).
|
||||
/// Max volume size: 32GB. Index entry: 16 bytes (8 + 4 + 4).
|
||||
#[cfg(not(feature = "5bytes"))]
|
||||
pub const OFFSET_SIZE: usize = 4;
|
||||
#[cfg(not(feature = "5bytes"))]
|
||||
pub const MAX_POSSIBLE_VOLUME_SIZE: u64 = 4 * 1024 * 1024 * 1024 * 8; // 32GB
|
||||
|
||||
pub const NEEDLE_MAP_ENTRY_SIZE: usize = NEEDLE_ID_SIZE + OFFSET_SIZE + SIZE_SIZE;
|
||||
|
||||
// ============================================================================
|
||||
// NeedleId
|
||||
// ============================================================================
|
||||
|
||||
/// 64-bit unique identifier for a needle within a volume.
|
||||
#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash, PartialOrd, Ord, Default)]
|
||||
pub struct NeedleId(pub u64);
|
||||
|
||||
impl NeedleId {
|
||||
pub fn to_bytes(&self, bytes: &mut [u8]) {
|
||||
assert!(bytes.len() >= NEEDLE_ID_SIZE);
|
||||
bytes[0..8].copy_from_slice(&self.0.to_be_bytes());
|
||||
}
|
||||
|
||||
pub fn from_bytes(bytes: &[u8]) -> Self {
|
||||
assert!(bytes.len() >= NEEDLE_ID_SIZE);
|
||||
NeedleId(u64::from_be_bytes([
|
||||
bytes[0], bytes[1], bytes[2], bytes[3], bytes[4], bytes[5], bytes[6], bytes[7],
|
||||
]))
|
||||
}
|
||||
|
||||
pub fn is_empty(&self) -> bool {
|
||||
self.0 == 0
|
||||
}
|
||||
|
||||
/// Parse a hex string into a NeedleId.
|
||||
pub fn parse(s: &str) -> Result<Self, std::num::ParseIntError> {
|
||||
u64::from_str_radix(s, 16).map(NeedleId)
|
||||
}
|
||||
}
|
||||
|
||||
impl fmt::Display for NeedleId {
|
||||
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
|
||||
write!(f, "{:x}", self.0)
|
||||
}
|
||||
}
|
||||
|
||||
impl From<u64> for NeedleId {
|
||||
fn from(v: u64) -> Self {
|
||||
NeedleId(v)
|
||||
}
|
||||
}
|
||||
|
||||
impl From<NeedleId> for u64 {
|
||||
fn from(v: NeedleId) -> Self {
|
||||
v.0
|
||||
}
|
||||
}
|
||||
|
||||
// ============================================================================
|
||||
// Cookie
|
||||
// ============================================================================
|
||||
|
||||
/// Random 32-bit value to mitigate brute-force lookups.
|
||||
#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash, Default)]
|
||||
pub struct Cookie(pub u32);
|
||||
|
||||
impl Cookie {
|
||||
pub fn to_bytes(&self, bytes: &mut [u8]) {
|
||||
assert!(bytes.len() >= COOKIE_SIZE);
|
||||
bytes[0..4].copy_from_slice(&self.0.to_be_bytes());
|
||||
}
|
||||
|
||||
pub fn from_bytes(bytes: &[u8]) -> Self {
|
||||
assert!(bytes.len() >= COOKIE_SIZE);
|
||||
Cookie(u32::from_be_bytes([bytes[0], bytes[1], bytes[2], bytes[3]]))
|
||||
}
|
||||
|
||||
/// Parse a hex string into a Cookie.
|
||||
pub fn parse(s: &str) -> Result<Self, std::num::ParseIntError> {
|
||||
u32::from_str_radix(s, 16).map(Cookie)
|
||||
}
|
||||
}
|
||||
|
||||
impl fmt::Display for Cookie {
|
||||
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
|
||||
write!(f, "{:x}", self.0)
|
||||
}
|
||||
}
|
||||
|
||||
impl From<u32> for Cookie {
|
||||
fn from(v: u32) -> Self {
|
||||
Cookie(v)
|
||||
}
|
||||
}
|
||||
|
||||
// ============================================================================
|
||||
// Size
|
||||
// ============================================================================
|
||||
|
||||
/// Needle size as stored in the index. Negative = deleted.
|
||||
///
|
||||
/// - Positive: valid needle with that many bytes of body content
|
||||
/// - TombstoneFileSize (-1): tombstone marker
|
||||
/// - Other negative: deleted, absolute value was the original size
|
||||
#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash, Default)]
|
||||
pub struct Size(pub i32);
|
||||
|
||||
/// Special marker for a tombstone (deletion marker) entry.
|
||||
pub const TOMBSTONE_FILE_SIZE: Size = Size(-1);
|
||||
|
||||
impl Size {
|
||||
pub fn is_tombstone(&self) -> bool {
|
||||
self.0 == TOMBSTONE_FILE_SIZE.0
|
||||
}
|
||||
|
||||
pub fn is_deleted(&self) -> bool {
|
||||
self.0 < 0 || self.0 == TOMBSTONE_FILE_SIZE.0
|
||||
}
|
||||
|
||||
pub fn is_valid(&self) -> bool {
|
||||
self.0 > 0 && !self.is_tombstone()
|
||||
}
|
||||
|
||||
/// Raw storage size. For tombstones returns 0; for negative returns abs value.
|
||||
pub fn raw(&self) -> u32 {
|
||||
if self.is_tombstone() {
|
||||
return 0;
|
||||
}
|
||||
if self.0 < 0 {
|
||||
return (self.0 * -1) as u32;
|
||||
}
|
||||
self.0 as u32
|
||||
}
|
||||
|
||||
pub fn to_bytes(&self, bytes: &mut [u8]) {
|
||||
assert!(bytes.len() >= SIZE_SIZE);
|
||||
bytes[0..4].copy_from_slice(&(self.0 as u32).to_be_bytes());
|
||||
}
|
||||
|
||||
pub fn from_bytes(bytes: &[u8]) -> Self {
|
||||
assert!(bytes.len() >= SIZE_SIZE);
|
||||
let v = u32::from_be_bytes([bytes[0], bytes[1], bytes[2], bytes[3]]);
|
||||
Size(v as i32)
|
||||
}
|
||||
}
|
||||
|
||||
impl From<i32> for Size {
|
||||
fn from(v: i32) -> Self {
|
||||
Size(v)
|
||||
}
|
||||
}
|
||||
|
||||
impl From<Size> for i32 {
|
||||
fn from(v: Size) -> Self {
|
||||
v.0
|
||||
}
|
||||
}
|
||||
|
||||
// ============================================================================
|
||||
// Offset
|
||||
// ============================================================================
|
||||
|
||||
/// Offset encoding for needle positions in .dat files.
|
||||
///
|
||||
/// The offset is stored divided by NEEDLE_PADDING_SIZE (8).
|
||||
///
|
||||
/// With `5bytes` feature (default, matching Go production builds):
|
||||
/// 5 bytes can address up to 8TB.
|
||||
/// On-disk layout: [b3][b2][b1][b0][b4] (big-endian 4 bytes + 1 high byte)
|
||||
///
|
||||
/// Without `5bytes` feature (matching Go default build):
|
||||
/// 4 bytes can address up to 32GB.
|
||||
/// On-disk layout: [b3][b2][b1][b0] (big-endian 4 bytes)
|
||||
#[derive(Debug, Clone, Copy, PartialEq, Eq, PartialOrd, Ord, Hash, Default)]
|
||||
pub struct Offset {
|
||||
pub b0: u8,
|
||||
pub b1: u8,
|
||||
pub b2: u8,
|
||||
pub b3: u8,
|
||||
#[cfg(feature = "5bytes")]
|
||||
pub b4: u8,
|
||||
}
|
||||
|
||||
impl Offset {
|
||||
/// Convert to the actual byte offset in the .dat file.
|
||||
pub fn to_actual_offset(&self) -> i64 {
|
||||
let stored = self.b0 as i64
|
||||
+ (self.b1 as i64) * 256
|
||||
+ (self.b2 as i64) * 65536
|
||||
+ (self.b3 as i64) * 16777216;
|
||||
#[cfg(feature = "5bytes")]
|
||||
let stored = stored + (self.b4 as i64) * 4294967296; // 1 << 32
|
||||
stored * NEEDLE_PADDING_SIZE as i64
|
||||
}
|
||||
|
||||
/// Create an Offset from an actual byte offset.
|
||||
pub fn from_actual_offset(offset: i64) -> Self {
|
||||
let smaller = offset / NEEDLE_PADDING_SIZE as i64;
|
||||
Offset {
|
||||
b0: smaller as u8,
|
||||
b1: (smaller >> 8) as u8,
|
||||
b2: (smaller >> 16) as u8,
|
||||
b3: (smaller >> 24) as u8,
|
||||
#[cfg(feature = "5bytes")]
|
||||
b4: (smaller >> 32) as u8,
|
||||
}
|
||||
}
|
||||
|
||||
/// Serialize to bytes in the .idx file format.
|
||||
/// 5-byte layout: [b3][b2][b1][b0][b4]
|
||||
/// 4-byte layout: [b3][b2][b1][b0]
|
||||
pub fn to_bytes(&self, bytes: &mut [u8]) {
|
||||
assert!(bytes.len() >= OFFSET_SIZE);
|
||||
bytes[0] = self.b3;
|
||||
bytes[1] = self.b2;
|
||||
bytes[2] = self.b1;
|
||||
bytes[3] = self.b0;
|
||||
#[cfg(feature = "5bytes")]
|
||||
{
|
||||
bytes[4] = self.b4;
|
||||
}
|
||||
}
|
||||
|
||||
/// Deserialize from bytes in the .idx file format.
|
||||
pub fn from_bytes(bytes: &[u8]) -> Self {
|
||||
assert!(bytes.len() >= OFFSET_SIZE);
|
||||
Offset {
|
||||
b3: bytes[0],
|
||||
b2: bytes[1],
|
||||
b1: bytes[2],
|
||||
b0: bytes[3],
|
||||
#[cfg(feature = "5bytes")]
|
||||
b4: bytes[4],
|
||||
}
|
||||
}
|
||||
|
||||
pub fn is_zero(&self) -> bool {
|
||||
#[cfg(feature = "5bytes")]
|
||||
{
|
||||
self.b0 == 0 && self.b1 == 0 && self.b2 == 0 && self.b3 == 0 && self.b4 == 0
|
||||
}
|
||||
#[cfg(not(feature = "5bytes"))]
|
||||
{
|
||||
self.b0 == 0 && self.b1 == 0 && self.b2 == 0 && self.b3 == 0
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
impl fmt::Display for Offset {
|
||||
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
|
||||
write!(f, "{}", self.to_actual_offset())
|
||||
}
|
||||
}
|
||||
|
||||
// ============================================================================
|
||||
// DiskType
|
||||
// ============================================================================
|
||||
|
||||
#[derive(Debug, Clone, PartialEq, Eq, Hash)]
|
||||
pub enum DiskType {
|
||||
HardDrive,
|
||||
Ssd,
|
||||
Custom(String),
|
||||
}
|
||||
|
||||
impl DiskType {
|
||||
pub fn from_string(s: &str) -> Self {
|
||||
match s.to_lowercase().as_str() {
|
||||
"" | "hdd" => DiskType::HardDrive,
|
||||
"ssd" => DiskType::Ssd,
|
||||
other => DiskType::Custom(other.to_string()),
|
||||
}
|
||||
}
|
||||
|
||||
pub fn readable_string(&self) -> &str {
|
||||
match self {
|
||||
DiskType::HardDrive => "hdd",
|
||||
DiskType::Ssd => "ssd",
|
||||
DiskType::Custom(s) => s,
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
impl fmt::Display for DiskType {
|
||||
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
|
||||
match self {
|
||||
DiskType::HardDrive => write!(f, ""),
|
||||
DiskType::Ssd => write!(f, "ssd"),
|
||||
DiskType::Custom(s) => write!(f, "{}", s),
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
impl Default for DiskType {
|
||||
fn default() -> Self {
|
||||
DiskType::HardDrive
|
||||
}
|
||||
}
|
||||
|
||||
// ============================================================================
|
||||
// VolumeId
|
||||
// ============================================================================
|
||||
|
||||
/// Volume identifier, stored as u32.
|
||||
#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash, PartialOrd, Ord, Default)]
|
||||
pub struct VolumeId(pub u32);
|
||||
|
||||
impl VolumeId {
|
||||
pub fn parse(s: &str) -> Result<Self, std::num::ParseIntError> {
|
||||
s.parse::<u32>().map(VolumeId)
|
||||
}
|
||||
|
||||
pub fn next(&self) -> VolumeId {
|
||||
VolumeId(self.0 + 1)
|
||||
}
|
||||
}
|
||||
|
||||
impl fmt::Display for VolumeId {
|
||||
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
|
||||
write!(f, "{}", self.0)
|
||||
}
|
||||
}
|
||||
|
||||
impl From<u32> for VolumeId {
|
||||
fn from(v: u32) -> Self {
|
||||
VolumeId(v)
|
||||
}
|
||||
}
|
||||
|
||||
// ============================================================================
|
||||
// Version
|
||||
// ============================================================================
|
||||
|
||||
/// Needle storage format version.
|
||||
#[derive(Debug, Clone, Copy, PartialEq, Eq, PartialOrd, Ord, Hash)]
|
||||
pub struct Version(pub u8);
|
||||
|
||||
pub const VERSION_1: Version = Version(1);
|
||||
pub const VERSION_2: Version = Version(2);
|
||||
pub const VERSION_3: Version = Version(3);
|
||||
|
||||
impl Version {
|
||||
pub fn current() -> Self {
|
||||
VERSION_3
|
||||
}
|
||||
|
||||
pub fn is_supported(&self) -> bool {
|
||||
self.0 >= 1 && self.0 <= 3
|
||||
}
|
||||
}
|
||||
|
||||
impl Default for Version {
|
||||
fn default() -> Self {
|
||||
VERSION_3
|
||||
}
|
||||
}
|
||||
|
||||
impl From<u8> for Version {
|
||||
fn from(v: u8) -> Self {
|
||||
Version(v)
|
||||
}
|
||||
}
|
||||
|
||||
// ============================================================================
|
||||
// ReadOption
|
||||
// ============================================================================
|
||||
|
||||
/// Options controlling needle read behavior, matching Go's `ReadOption` in store.go.
|
||||
///
|
||||
/// Fields are split into request-side options (set by the caller) and response-side
|
||||
/// flags (set during the read to communicate status back).
|
||||
#[derive(Debug, Clone)]
|
||||
pub struct ReadOption {
|
||||
// -- request --
|
||||
/// If true, allow reading needles that have been soft-deleted.
|
||||
pub read_deleted: bool,
|
||||
/// If true, attempt to read only metadata for large needles (> PagedReadLimit).
|
||||
pub attempt_meta_only: bool,
|
||||
/// If true, the caller requires metadata only (no data payload).
|
||||
pub must_meta_only: bool,
|
||||
|
||||
// -- response --
|
||||
/// Set to true when the read actually returned metadata only.
|
||||
pub is_meta_only: bool,
|
||||
/// Compaction revision at the time of the read (for consistency during streaming).
|
||||
pub volume_revision: u16,
|
||||
/// Set to true when the offset exceeded MaxPossibleVolumeSize (4-byte offset wrap).
|
||||
pub is_out_of_range: bool,
|
||||
|
||||
// -- slow-read / streaming --
|
||||
/// When true, the read lock is acquired and released per chunk instead of held
|
||||
/// for the entire read, reducing write latency at the cost of higher read P99.
|
||||
pub has_slow_read: bool,
|
||||
/// Buffer size for chunked streaming reads (used with `has_slow_read`).
|
||||
pub read_buffer_size: i32,
|
||||
}
|
||||
|
||||
impl Default for ReadOption {
|
||||
fn default() -> Self {
|
||||
ReadOption {
|
||||
read_deleted: false,
|
||||
attempt_meta_only: false,
|
||||
must_meta_only: false,
|
||||
is_meta_only: false,
|
||||
volume_revision: 0,
|
||||
is_out_of_range: false,
|
||||
has_slow_read: false,
|
||||
read_buffer_size: 0,
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// ============================================================================
|
||||
// NeedleMapEntry helpers (for .idx file)
|
||||
// ============================================================================
|
||||
|
||||
/// Parse a single .idx file entry (17 bytes) into (NeedleId, Offset, Size).
|
||||
pub fn idx_entry_from_bytes(bytes: &[u8]) -> (NeedleId, Offset, Size) {
|
||||
assert!(bytes.len() >= NEEDLE_MAP_ENTRY_SIZE);
|
||||
let key = NeedleId::from_bytes(&bytes[..NEEDLE_ID_SIZE]);
|
||||
let offset = Offset::from_bytes(&bytes[NEEDLE_ID_SIZE..NEEDLE_ID_SIZE + OFFSET_SIZE]);
|
||||
let size = Size::from_bytes(
|
||||
&bytes[NEEDLE_ID_SIZE + OFFSET_SIZE..NEEDLE_ID_SIZE + OFFSET_SIZE + SIZE_SIZE],
|
||||
);
|
||||
(key, offset, size)
|
||||
}
|
||||
|
||||
/// Write a single .idx file entry (17 bytes).
|
||||
pub fn idx_entry_to_bytes(bytes: &mut [u8], key: NeedleId, offset: Offset, size: Size) {
|
||||
assert!(bytes.len() >= NEEDLE_MAP_ENTRY_SIZE);
|
||||
key.to_bytes(&mut bytes[..NEEDLE_ID_SIZE]);
|
||||
offset.to_bytes(&mut bytes[NEEDLE_ID_SIZE..NEEDLE_ID_SIZE + OFFSET_SIZE]);
|
||||
size.to_bytes(
|
||||
&mut bytes[NEEDLE_ID_SIZE + OFFSET_SIZE..NEEDLE_ID_SIZE + OFFSET_SIZE + SIZE_SIZE],
|
||||
);
|
||||
}
|
||||
|
||||
// ============================================================================
|
||||
// Tests
|
||||
// ============================================================================
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn test_needle_id_round_trip() {
|
||||
let id = NeedleId(0x123456789abcdef0);
|
||||
let mut buf = [0u8; 8];
|
||||
id.to_bytes(&mut buf);
|
||||
let id2 = NeedleId::from_bytes(&buf);
|
||||
assert_eq!(id, id2);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_needle_id_display() {
|
||||
let id = NeedleId(255);
|
||||
assert_eq!(id.to_string(), "ff");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_needle_id_parse() {
|
||||
let id = NeedleId::parse("ff").unwrap();
|
||||
assert_eq!(id, NeedleId(255));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_cookie_round_trip() {
|
||||
let cookie = Cookie(0xdeadbeef);
|
||||
let mut buf = [0u8; 4];
|
||||
cookie.to_bytes(&mut buf);
|
||||
let cookie2 = Cookie::from_bytes(&buf);
|
||||
assert_eq!(cookie, cookie2);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_size_semantics() {
|
||||
assert!(Size(100).is_valid());
|
||||
assert!(!Size(100).is_deleted());
|
||||
assert!(!Size(100).is_tombstone());
|
||||
assert_eq!(Size(100).raw(), 100);
|
||||
|
||||
assert!(Size(-50).is_deleted());
|
||||
assert!(!Size(-50).is_tombstone());
|
||||
assert_eq!(Size(-50).raw(), 50);
|
||||
|
||||
assert!(TOMBSTONE_FILE_SIZE.is_deleted());
|
||||
assert!(TOMBSTONE_FILE_SIZE.is_tombstone());
|
||||
assert_eq!(TOMBSTONE_FILE_SIZE.raw(), 0);
|
||||
|
||||
assert!(!Size(0).is_valid());
|
||||
assert!(!Size(0).is_deleted());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_size_round_trip() {
|
||||
let size = Size(12345);
|
||||
let mut buf = [0u8; 4];
|
||||
size.to_bytes(&mut buf);
|
||||
let size2 = Size::from_bytes(&buf);
|
||||
assert_eq!(size, size2);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_size_negative_round_trip() {
|
||||
// Negative sizes round-trip through u32 bit pattern
|
||||
let size = Size(-50);
|
||||
let mut buf = [0u8; 4];
|
||||
size.to_bytes(&mut buf);
|
||||
let size2 = Size::from_bytes(&buf);
|
||||
assert_eq!(size, size2);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_offset_round_trip() {
|
||||
// Test with a known actual offset
|
||||
let actual_offset: i64 = 8 * 1000000; // must be multiple of 8
|
||||
let offset = Offset::from_actual_offset(actual_offset);
|
||||
assert_eq!(offset.to_actual_offset(), actual_offset);
|
||||
|
||||
// Test byte serialization
|
||||
let mut buf = [0u8; 5];
|
||||
offset.to_bytes(&mut buf);
|
||||
let offset2 = Offset::from_bytes(&buf);
|
||||
assert_eq!(offset.to_actual_offset(), offset2.to_actual_offset());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_offset_zero() {
|
||||
let offset = Offset::default();
|
||||
assert!(offset.is_zero());
|
||||
assert_eq!(offset.to_actual_offset(), 0);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_offset_max() {
|
||||
// Max stored value depends on offset size
|
||||
#[cfg(feature = "5bytes")]
|
||||
let max_stored: i64 = (1i64 << 40) - 1; // 5-byte max
|
||||
#[cfg(not(feature = "5bytes"))]
|
||||
let max_stored: i64 = (1i64 << 32) - 1; // 4-byte max
|
||||
let max_actual = max_stored * NEEDLE_PADDING_SIZE as i64;
|
||||
let offset = Offset::from_actual_offset(max_actual);
|
||||
assert_eq!(offset.to_actual_offset(), max_actual);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_offset_size_constants() {
|
||||
#[cfg(feature = "5bytes")]
|
||||
{
|
||||
assert_eq!(OFFSET_SIZE, 5);
|
||||
assert_eq!(NEEDLE_MAP_ENTRY_SIZE, 17); // 8 + 5 + 4
|
||||
assert_eq!(MAX_POSSIBLE_VOLUME_SIZE, 4 * 1024 * 1024 * 1024 * 8 * 256);
|
||||
// 8TB
|
||||
}
|
||||
#[cfg(not(feature = "5bytes"))]
|
||||
{
|
||||
assert_eq!(OFFSET_SIZE, 4);
|
||||
assert_eq!(NEEDLE_MAP_ENTRY_SIZE, 16); // 8 + 4 + 4
|
||||
assert_eq!(MAX_POSSIBLE_VOLUME_SIZE, 4 * 1024 * 1024 * 1024 * 8); // 32GB
|
||||
}
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_idx_entry_round_trip() {
|
||||
let key = NeedleId(0xdeadbeef12345678);
|
||||
let offset = Offset::from_actual_offset(8 * 999);
|
||||
let size = Size(4096);
|
||||
|
||||
let mut buf = [0u8; NEEDLE_MAP_ENTRY_SIZE];
|
||||
idx_entry_to_bytes(&mut buf, key, offset, size);
|
||||
|
||||
let (key2, offset2, size2) = idx_entry_from_bytes(&buf);
|
||||
assert_eq!(key, key2);
|
||||
assert_eq!(offset.to_actual_offset(), offset2.to_actual_offset());
|
||||
assert_eq!(size, size2);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_volume_id() {
|
||||
let vid = VolumeId::parse("42").unwrap();
|
||||
assert_eq!(vid, VolumeId(42));
|
||||
assert_eq!(vid.to_string(), "42");
|
||||
assert_eq!(vid.next(), VolumeId(43));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_version() {
|
||||
assert!(VERSION_1.is_supported());
|
||||
assert!(VERSION_2.is_supported());
|
||||
assert!(VERSION_3.is_supported());
|
||||
assert!(!Version(0).is_supported());
|
||||
assert!(!Version(4).is_supported());
|
||||
assert_eq!(Version::current(), VERSION_3);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_disk_type() {
|
||||
assert_eq!(DiskType::from_string(""), DiskType::HardDrive);
|
||||
assert_eq!(DiskType::from_string("hdd"), DiskType::HardDrive);
|
||||
assert_eq!(DiskType::from_string("SSD"), DiskType::Ssd);
|
||||
assert_eq!(
|
||||
DiskType::from_string("nvme"),
|
||||
DiskType::Custom("nvme".to_string())
|
||||
);
|
||||
assert_eq!(DiskType::HardDrive.readable_string(), "hdd");
|
||||
assert_eq!(DiskType::Ssd.readable_string(), "ssd");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_read_option_default() {
|
||||
let ro = ReadOption::default();
|
||||
assert!(!ro.read_deleted);
|
||||
assert!(!ro.attempt_meta_only);
|
||||
assert!(!ro.must_meta_only);
|
||||
assert!(!ro.is_meta_only);
|
||||
assert_eq!(ro.volume_revision, 0);
|
||||
assert!(!ro.is_out_of_range);
|
||||
assert!(!ro.has_slow_read);
|
||||
assert_eq!(ro.read_buffer_size, 0);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_read_option_custom() {
|
||||
let ro = ReadOption {
|
||||
read_deleted: true,
|
||||
attempt_meta_only: true,
|
||||
has_slow_read: true,
|
||||
read_buffer_size: 1024 * 1024,
|
||||
..ReadOption::default()
|
||||
};
|
||||
assert!(ro.read_deleted);
|
||||
assert!(ro.attempt_meta_only);
|
||||
assert!(!ro.must_meta_only);
|
||||
assert!(!ro.is_meta_only);
|
||||
assert!(ro.has_slow_read);
|
||||
assert_eq!(ro.read_buffer_size, 1024 * 1024);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_read_option_clone() {
|
||||
let ro = ReadOption {
|
||||
is_out_of_range: true,
|
||||
volume_revision: 42,
|
||||
..ReadOption::default()
|
||||
};
|
||||
let ro2 = ro.clone();
|
||||
assert!(ro2.is_out_of_range);
|
||||
assert_eq!(ro2.volume_revision, 42);
|
||||
}
|
||||
}
|
||||
4246
seaweed-volume/src/storage/volume.rs
Normal file
4246
seaweed-volume/src/storage/volume.rs
Normal file
File diff suppressed because it is too large
Load Diff
79
seaweed-volume/src/version.rs
Normal file
79
seaweed-volume/src/version.rs
Normal file
@@ -0,0 +1,79 @@
|
||||
//! Version helpers aligned with Go's util/version package.
|
||||
|
||||
use std::sync::OnceLock;
|
||||
|
||||
#[cfg(feature = "5bytes")]
|
||||
const SIZE_LIMIT: &str = "8000GB"; // Matches Go production builds (5BytesOffset)
|
||||
#[cfg(not(feature = "5bytes"))]
|
||||
const SIZE_LIMIT: &str = "30GB"; // Matches Go default build (!5BytesOffset)
|
||||
|
||||
pub fn size_limit() -> &'static str {
|
||||
SIZE_LIMIT
|
||||
}
|
||||
|
||||
pub fn commit() -> &'static str {
|
||||
option_env!("SEAWEEDFS_COMMIT")
|
||||
.or(option_env!("GIT_COMMIT"))
|
||||
.or(option_env!("GIT_SHA"))
|
||||
.unwrap_or("")
|
||||
}
|
||||
|
||||
pub fn version_number() -> &'static str {
|
||||
static VERSION_NUMBER: OnceLock<String> = OnceLock::new();
|
||||
VERSION_NUMBER
|
||||
.get_or_init(|| {
|
||||
parse_go_version_number().unwrap_or_else(|| env!("CARGO_PKG_VERSION").to_string())
|
||||
})
|
||||
.as_str()
|
||||
}
|
||||
|
||||
pub fn version() -> &'static str {
|
||||
static VERSION: OnceLock<String> = OnceLock::new();
|
||||
VERSION
|
||||
.get_or_init(|| format!("{} {}", size_limit(), version_number()))
|
||||
.as_str()
|
||||
}
|
||||
|
||||
pub fn full_version() -> &'static str {
|
||||
static FULL: OnceLock<String> = OnceLock::new();
|
||||
FULL.get_or_init(|| format!("{} {}", version(), commit()))
|
||||
.as_str()
|
||||
}
|
||||
|
||||
pub fn server_header() -> &'static str {
|
||||
static HEADER: OnceLock<String> = OnceLock::new();
|
||||
HEADER
|
||||
.get_or_init(|| format!("SeaweedFS Volume {}", version()))
|
||||
.as_str()
|
||||
}
|
||||
|
||||
fn parse_go_version_number() -> Option<String> {
|
||||
let src = include_str!(concat!(
|
||||
env!("CARGO_MANIFEST_DIR"),
|
||||
"/../weed/util/version/constants.go"
|
||||
));
|
||||
let mut major: Option<u32> = None;
|
||||
let mut minor: Option<u32> = None;
|
||||
for line in src.lines() {
|
||||
let l = line.trim();
|
||||
if l.starts_with("MAJOR_VERSION") {
|
||||
major = parse_int32_line(l);
|
||||
} else if l.starts_with("MINOR_VERSION") {
|
||||
minor = parse_int32_line(l);
|
||||
}
|
||||
if major.is_some() && minor.is_some() {
|
||||
break;
|
||||
}
|
||||
}
|
||||
match (major, minor) {
|
||||
(Some(maj), Some(min)) => Some(format!("{}.{}", maj, format!("{:02}", min))),
|
||||
_ => None,
|
||||
}
|
||||
}
|
||||
|
||||
fn parse_int32_line(line: &str) -> Option<u32> {
|
||||
let start = line.find("int32(")? + "int32(".len();
|
||||
let rest = &line[start..];
|
||||
let end = rest.find(')')?;
|
||||
rest[..end].trim().parse::<u32>().ok()
|
||||
}
|
||||
Reference in New Issue
Block a user