Rust volume server implementation with CI (#8539)
* Match Go gRPC client transport defaults
* Honor Go HTTP idle timeout
* Honor maintenanceMBps during volume copy
* Honor images.fix.orientation on uploads
* Honor cpuprofile when pprof is disabled
* Match Go memory status payloads
* Propagate request IDs across gRPC calls
* Format pending Rust source updates
* Match Go stats endpoint payloads
* Serve Go volume server UI assets
* Enforce Go HTTP whitelist guards
* Align Rust metrics admin-port test with Go behavior
* Format pending Rust server updates
* Honor access.ui without per-request JWT checks
* Honor keepLocalDatFile in tier upload shortcut
* Honor Go remote volume write mode
* Load tier backends from master config
* Check master config before loading volumes
* Remove vif files on volume destroy
* Delete remote tier data on volume destroy
* Honor vif version defaults and overrides
* Reject mismatched vif bytes offsets
* Load remote-only tiered volumes
* Report Go tail offsets in sync status
* Stream remote dat in incremental copy
* Honor collection vif for EC shard config
* Persist EC expireAtSec in vif metadata
* Stream remote volume reads through HTTP
* Serve HTTP ranges from backend source
* Match Go ReadAllNeedles scan order
* Match Go CopyFile zero-stop metadata
* Delete EC volumes with collection cleanup
* Drop deleted collection metrics
* Match Go tombstone ReadNeedleMeta
* Match Go TTL parsing: all-digit default to minutes, two-pass fit algorithm
* Match Go needle ID/cookie formatting and name size computation
* Match Go image ext checks: webp resize only, no crop; empty healthz body
* Match Go Prometheus metric names and add missing handler counter constants
* Match Go ReplicaPlacement short string parsing with zero-padding
* Add missing EC constants MAX_SHARD_COUNT and MIN_TOTAL_DISKS
* Add walk_ecx_stats for accurate EC volume file counts and size
* Match Go VolumeStatus dat file size, EC shard stats, and disk pct precision
* Match Go needle map: unconditional delete counter, fix redb idx walk offset
* Add CompactMapSegment overflow panic guard matching Go
* Match Go volume: vif creation, version from superblock, TTL expiry, dedup data_size, garbage_level fallback
* Match Go 304 Not Modified: return bare status with no headers
* Match Go JWT error message: use "wrong jwt" instead of detailed error
* Match Go read handler bare 400, delete error prefix, download throttle timeout
* Match Go pretty JSON 1-space indent and "Deletion Failed:" error prefix
* Match Go heartbeat: keep is_heartbeating on error, add EC shard identification
* Match Go needle ReadBytes V2: tolerate EOF on truncated body
* Match Go volume: cookie check on any existing needle, return DataSize, 128KB meta guard
* Match Go DeleteCollection: propagate destroy errors
* Match Go gRPC: BatchDelete no flag, IncrementalCopy error, FetchAndWrite concurrent, VolumeUnmount/DeleteCollection errors, tail draining, query error code
* Match Go Content-Disposition RFC 6266 formatting with RFC 2231 encoding
* Match Go Guard isWriteActive: combine whitelist and signing key check
* Match Go DeleteCollectionMetrics: use partial label matching
* Match Go heartbeat: send state-only delta on volume state changes
* Match Go ReadNeedleMeta paged I/O: read header+tail only, skip data; add EIO tracking
* Match Go ScrubVolume INDEX mode dispatch; add VolumeCopy preallocation and EC NeedleStatus TODOs
* Add read_ec_shard_needle for full needle reconstruction from local EC shards
* Make heartbeat master config helpers pub for VolumeCopy preallocation
* Match Go gRPC: VolumeCopy preallocation, EC NeedleStatus full read, error message wording
* Match Go HTTP responses: omitempty fields, 2-space JSON indent, JWT JSON error, delete pretty/JSONP, 304 Last-Modified, raw write error
* Match Go WriteNeedleBlob V3 timestamp patching, fix makeup_diff double padding, count==0 read handling
* Add rebuild_ecx_file for EC index reconstruction from data shards
* Match Go gRPC: tail header first-chunk-only, EC cleanup on failure, copy append mode, ecx rebuild, compact cancellation
* Add EC volume read and delete support in HTTP handlers
* Add per-shard EC mount/unmount, location predicate search, idx directory for EC
* Add CheckVolumeDataIntegrity on volume load matching Go
* Match Go gRPC: EC multi-disk placement, per-shard mount/unmount, no auto-mount on reconstruct, streaming ReadAll/EcShardRead, ReceiveFile cleanup, version check, proxy streaming, redirect Content-Type
* Match Go heartbeat metric accounting
* Match Go duplicate UUID heartbeat retries
* Delete expired EC volumes during heartbeat
* Match Go volume heartbeat pruning
* Honor master preallocate in volume max
* Report remote storage info in heartbeats
* Emit EC heartbeat deltas on shard changes
* Match Go throttle boundary: use <= instead of <, fix pretty JSON to 1-space
* Match Go write_needle_blob monotonic appendAtNs via get_append_at_ns
* Match Go VolumeUnmount: idempotent success when volume not found
* Match Go TTL Display: return empty string when unit is Empty
Go checks `t.Unit == Empty` separately and returns "" for TTLs
with nonzero count but Empty unit. Rust only checked is_empty()
(count==0 && unit==0), so count>0 with unit=0 would format as
"5 " instead of "".
* Match Go error behavior for truncated needle data in read_body_v2
Go's readNeedleDataVersion2 returns "index out of range %d" errors
(indices 1-7) when needle body or metadata fields are truncated.
Rust was silently tolerating truncation and returning Ok. Now returns
NeedleError::IndexOutOfRange with the matching index for each field.
* Match Go download throttle: return JSON error instead of plain text
* Match Go crop params: default x1/y1 to 0 when not provided
* Match Go ScrubEcVolume: accumulate total_files from EC shards
* Match Go ScrubVolume: count total_files even on scrub error
* Match Go VolumeEcShardsCopy: set ignore_source_file_not_found for .vif
* Match Go VolumeTailSender: send needle_header on every chunk
* Match Go read_super_block: apply replication override from .vif
* Match Go check_volume_data_integrity: verify all 10 entries, detect trailing corruption
* Match Go WriteNeedleBlob: dedup check before writing during replication
* handlers: use meta-only reads for HEAD
* handlers: align range parsing and responses with Go
* handlers: align upload parsing with Go
* deps: enable webp support
* Make 5bytes the default feature for idx entry compatibility
* Match Go TTL: preserve original unit when count fits in byte
* Fix EC locate_needle: use get_actual_size for full needle size
* Fix raw body POST: only parse multipart when Content-Type contains form-data
* Match Go ReceiveFile: return protocol errors in response body, not gRPC status
* add docs
* Match Go VolumeEcShardsCopy: append to .ecj file instead of truncating
* Match Go ParsePath: support _delta suffix on file IDs for sub-file addressing
* Match Go chunk manifest: add Accept-Ranges, Content-Disposition, filename fallback, MIME detection
* Match Go privateStoreHandler: use proper JSON error for unsupported methods
* Match Go Destroy: add only_empty parameter to reject non-empty volume deletion
* Fix compilation: set_read_only_persist and set_writable return ()
These methods fire-and-forget save_vif internally, so gRPC callers
should not try to chain .map_err() on the unit return type.
* Match Go SaveVolumeInfo: check writability and propagate errors in save_vif
* Match Go VolumeDelete: propagate only_empty to delete_volume for defense in depth
The gRPC VolumeDelete handler had a pre-check for only_empty but then
passed false to store.delete_volume(), bypassing the store-level check.
Go passes req.OnlyEmpty directly to DeleteVolume. Now Rust does the same
for defense in depth against TOCTOU races (though the store write lock
makes this unlikely).
* Match Go ProcessRangeRequest: return full content for empty/oversized ranges
Go returns nil from ProcessRangeRequest when ranges are empty or total
range size exceeds content length, causing the caller to serve the full
content as a normal 200 response. Rust was returning an empty 200 body.
* Match Go Query: quote JSON keys in output records
Go's ToJson produces valid JSON with quoted keys like {"name":"Alice"}.
Rust was producing invalid JSON with unquoted keys like {name:"Alice"}.
* Match Go VolumeCopy: reject when no suitable disk location exists
Go returns ErrVolumeNoSpaceLeft when no location matches the disk type
and has sufficient space. Rust had an unsafe fallback that silently
picked the first location regardless of type or available space.
* Match Go DeleteVolumeNeedle: check noWriteOrDelete before allowing delete
Go checks v.noWriteOrDelete before proceeding with needle deletion,
returning "volume is read only" if true. Rust was skipping this check.
* Match Go ReceiveFile: prefer HardDrive location for EC and use response-level write errors
Two fixes: (1) Go prefers HardDriveType disk location for EC volumes,
falling back to first location. Returns "no storage location available"
when no locations exist. (2) Write failures are now response-level
errors (in response body) instead of gRPC status errors, matching Go.
* Match Go CopyFile: sync EC volume journal to disk before copying
Go calls ecVolume.Sync() before copying EC volume files to ensure the
.ecj journal is flushed to disk. Added sync_to_disk() to EcVolume and
call it in the CopyFile EC branch.
* Match Go readSuperBlock: propagate replication parse errors
Go returns an error when parsing the replication string from the .vif
file fails. Rust was silently ignoring the parse failure and using the
super block's replication as-is.
* Match Go TTL expiry: remove append_at_ns > 0 guard
Go computes TTL expiry from AppendAtNs without guarding against zero.
When append_at_ns is 0, the expiry is epoch + TTL which is in the past,
correctly returning NotFound. Rust's extra guard skipped the check,
incorrectly returning success for such needles.
* Match Go delete_collection: skip volumes with compaction in progress
Go checks !v.isCompactionInProgress.Load() before destroying a volume
during collection deletion, skipping compacting volumes. Also changed
destroy errors to log instead of aborting the entire collection delete.
* Match Go MarkReadonly/MarkWritable: always notify master even on local error
Go always notifies the master regardless of whether the local
set_read_only_persist or set_writable step fails. The Rust code was
using `?` which short-circuited on error, skipping the final master
notification. Save the result and defer the `?` until after the
notify call.
* Match Go PostHandler: return 500 for all write errors
Go returns 500 (InternalServerError) for all write failures. Rust was
returning 404 for volume-not-found and 403 for read-only volumes.
* Match Go makeupDiff: validate .cpd compaction revision is old + 1
Go reads the new .cpd file's super block and verifies the compaction
revision is exactly old + 1. Rust only validated the old revision.
* Match Go VolumeStatus: check data backend before returning status
Go checks v.DataBackend != nil before building the status response,
returning an error if missing. Rust was silently returning size 0.
* Match Go PostHandler: always include mime field in upload response JSON
Go always serializes the mime field even when empty ("mime":""). Rust was
omitting it when empty due to Option<String> with skip_serializing_if.
* Match Go FindFreeLocation: account for EC shards in free slot calculation
Go subtracts EC shard equivalents when computing available volume slots.
Rust was only comparing volume count, potentially over-counting free
slots on locations with many EC shards.
* Match Go privateStoreHandler: use INVALID as metrics label for unsupported methods
Go records the method as INVALID in metrics for unsupported HTTP methods.
Rust was using the actual method name.
* Match Go volume: add commit_compact guard and scrub data size validation
Two fixes: (1) commit_compact now checks/sets is_compacting flag to
prevent concurrent commits, matching Go's CompareAndSwap guard.
(2) scrub now validates total needle sizes against .dat file size.
* Match Go gRPC: fix TailSender error propagation, EcShardsInfo all slots, EcShardRead .ecx check
Three fixes: (1) VolumeTailSender now propagates binary search errors
instead of silently falling back to start. (2) VolumeEcShardsInfo
returns entries for all shard slots including unmounted. (3)
VolumeEcShardRead checks .ecx index for deletions instead of .ecj.
* Match Go metrics: add BuildInfo gauge and connection tracking functions
Go exposes a BuildInfo Prometheus metric with version labels, and tracks
open connections via stats.ConnectionOpen/Close. Added both to Rust.
* Match Go NeedleMap.Delete: use !is_deleted() instead of is_valid()
Go's CompactMap.Delete checks !IsDeleted() not IsValid(), so needles
with size==0 (live but anomalous) can still be deleted. The Rust code
was using is_valid() which returns false for size==0, preventing
deletion of such needles.
* Match Go fitTtlCount: always normalize TTL to coarsest unit
Go's fitTtlCount always converts to seconds first, then finds the
coarsest unit that fits in one byte (e.g., 120m → 2h). Rust had an
early return for count<=255 that skipped normalization, producing
different binary encodings for the same duration.
* Match Go BuildInfo metric: correct name and add missing labels
Go uses SeaweedFS_build_info (Namespace=SeaweedFS, Subsystem=build,
Name=info) with labels [version, commit, sizelimit, goos, goarch].
Rust had SeaweedFS_volumeServer_buildInfo with only [version].
* Match Go HTTP handlers: fix UploadResult fields, DiskStatus JSON, chunk manifest ETag
- UploadResult.mime: add skip_serializing_if to omit empty MIME (Go uses omitempty)
- UploadResult.contentMd5: only include when request provided Content-MD5 header
- Content-MD5 response header: only set when request provided it
- DiskStatuses: use camelCase field names (percentFree, percentUsed, diskType)
to match Go's protobuf JSON marshaling
- Chunk manifest: preserve needle ETag in expanded response headers
* Match Go volume: fix version(), integrity check, scrub, and commit_compact
- version(): use self.version() instead of self.super_block.version in
read_all_needles, check_volume_data_integrity, scan_raw_needles_from
to respect volumeInfo.version override
- check_volume_data_integrity: initialize healthy_index_size to idx_size
(matching Go) and continue on EOF instead of returning error
- scrub(): count deleted needles in total_read since they still occupy
space in the .dat file (matches Go's totalRead += actualSize for deleted)
- commit_compact: clean up .cpd/.cpx files on makeup_diff failure
(matches Go's error path cleanup)
* Match Go write queue: add 4MB batch byte limit
Go's startWorker breaks the batch at either 128 requests or 4MB of
accumulated write data. Rust only had the 128-request limit, allowing
large writes to accumulate unbounded latency.
* Add TTL normalization tests for Go parity verification
Test that fit_ttl_count normalizes 120m→2h, 24h→1d, 7d→1w even
when count fits in a byte, matching Go's fitTtlCount behavior.
* Match Go FindFreeLocation: account for EC shards in free slot calculation
Go's free volume count subtracts both regular volumes and EC volumes
from max_volume_count. Rust was only counting regular volumes, which
could over-report available slots when EC shards are mounted.
* Match Go EC volume: mark deletions in .ecx and replay .ecj at startup
Go's DeleteNeedleFromEcx marks needles as deleted in the .ecx index
in-place (writing TOMBSTONE_FILE_SIZE at the size field) in addition
to appending to the .ecj journal. Go's RebuildEcxFile replays .ecj
entries into .ecx on startup, then removes the .ecj file.
Rust was only appending to .ecj without marking .ecx, which meant
deleted EC needles remained readable via .ecx binary search. This
fix:
- Opens .ecx in read/write mode (was read-only)
- Adds mark_needle_deleted_in_ecx: binary search + in-place write
- Calls it from journal_delete before appending to .ecj
- Adds rebuild_ecx_from_journal: replays .ecj into .ecx on startup
* Match Go check_all_ec_shards_deleted: use MAX_SHARD_COUNT instead of hardcoded 14
Go's TotalShardsCount is DataShardsCount + ParityShardsCount = 14 by
default, but custom EC configs via .vif can have more shards (up to
MaxShardCount = 32). Using MAX_SHARD_COUNT ensures all shard files
are checked regardless of EC configuration.
* Match Go EC locate: subtract 1 from shard size and use datFileSize override
Go's LocateEcShardNeedleInterval passes shard.ecdFileSize-1 to
LocateData (shards are padded, -1 avoids overcounting large block
rows). When datFileSize is known, Go uses datFileSize/DataShards
instead. Rust was passing the raw shard file size without adjustment.
* Fix TTL parsing and DiskStatus field names to match Go exactly
TTL::read: Go's ReadTTL preserves the original unit (7d stays 7d,
not 1w) and errors on count > 255. The previous normalization change
was incorrect — Go only normalizes internally via fitTtlCount, not
during string parsing.
DiskStatus: Go uses encoding/json on protobuf structs, which reads
the json struct tags (snake_case: percent_free, percent_used,
disk_type), not the protobuf JSON names (camelCase). Revert to
snake_case to match Go's actual output.
* Fix heartbeat: check leader != current master before redirect, process duplicated UUIDs first
Match Go's volume_grpc_client_to_master.go behavior:
1. Only trigger leader redirect when the leader address differs from the
current master (prevents unnecessary reconnect loops when master confirms
its own address).
2. Process duplicated_uuids before leader redirect check, matching Go's
ordering where duplicate UUID detection takes priority.
* Remove SetState version check to match Go behavior
Go's SetState unconditionally applies the state without any version
mismatch check. The Rust version had an extra optimistic concurrency
check that would reject valid requests from Go clients that don't
track versions.
* Fix TTL::read() to normalize via fit_ttl_count matching Go's ReadTTL
Go's ReadTTL calls fitTtlCount which converts to seconds and normalizes
to the coarsest unit that fits in a byte count (e.g. 120m->2h, 7d->1w,
24h->1d). The Rust version was preserving the original unit, producing
different binary encodings on disk and in heartbeat messages.
* Always return Content-MD5 header and JSON field on successful writes
Go always sets Content-MD5 in the response regardless of whether the
request included it. The Rust version was conditionally including it
only when the request provided Content-MD5.
* Include name and size in UploadResult JSON even when empty/zero
Go's encoding/json always includes empty strings and zero values in
the upload response. The Rust version was using skip_serializing_if
to omit them, causing JSON structure differences.
* Include deleted needles in scan_raw_needles_from to match Go
Go's ScanVolumeFileFrom visits ALL needles including deleted ones.
Skipping deleted entries during incremental copy would cause tombstones
to not be propagated, making deleted files reappear on the receiving side.
* Match Go NeedleMap.Delete: always write tombstone to idx file
Go's NeedleMap.Delete unconditionally writes a tombstone entry to the
idx file and updates metrics, even if the needle doesn't exist or is
already deleted. This is important for replication where every delete
operation must produce an idx write. The Rust version was skipping the
tombstone write for non-existent or already-deleted needles.
* Limit MIME type to 255 bytes matching Go's CreateNeedleFromRequest
* Title-case Seaweed-* pair keys to match Go HTTP header canonicalization
* Unify DiskType::Hdd into HardDrive to match Go's single HardDriveType
* Skip tombstone entries in walk_ecx_stats total_size matching Go's Raw()
* Return EMPTY TTL when computed seconds is zero matching Go's fitTtlCount
* Include disk-space-low in Volume.is_read_only() matching Go
* Log error on CIDR parse failure in whitelist matching Go's glog.Errorf
* Log cookie mismatch in gRPC Query matching Go's V(0).Infof
* Fix is_expired volume_size comparison to use < matching Go
Go checks `volumeSize < super_block.SuperBlockSize` (strict less-than),
but Rust used `<=`. This meant Rust would fail to expire a volume that
is exactly SUPER_BLOCK_SIZE bytes.
* Apply Go's JWT expiry defaults: 10s write, 60s read
Go calls v.SetDefault("jwt.signing.expires_after_seconds", 10) and
v.SetDefault("jwt.signing.read.expires_after_seconds", 60). Rust
defaulted to 0 for both, which meant tokens would never expire when
security.toml has a signing key but omits expires_after_seconds.
* Stop [grpc.volume].ca from overriding [grpc].ca matching Go
Go reads the gRPC CA file only from config.GetString("grpc.ca"), i.e.
the [grpc] section. The [grpc.volume] section only provides cert and
key. Rust was also reading ca from [grpc.volume] which would silently
override the [grpc].ca value when both were present.
* Fix free_volume_count to use EC shard count matching Go
Was counting EC volumes instead of EC shards, which underestimates EC
space usage. One EC volume with 14 shards uses ~1.4 volume slots, not 1.
Now uses Go's formula: ((max - volumes) * DataShardsCount - ecShardCount) / DataShardsCount.
* Include preallocate in compaction space check matching Go
Go uses max(preallocate, estimatedCompactSize) for the free space check.
Rust was only using the estimated volume size, which could start a
compaction that fails mid-way if preallocate exceeds the volume size.
* Check gzip magic bytes before setting Content-Encoding matching Go
Go checks both Accept-Encoding contains "gzip" AND IsGzippedContent
(data starts with 0x1f 0x8b) before setting Content-Encoding: gzip.
Rust only checked Accept-Encoding, which could incorrectly declare
gzip encoding for non-gzip compressed data.
* Only set upload response name when needle HasName matching Go
Go checks reqNeedle.HasName() before setting ret.Name. Rust always set
the name from the filename variable, which could return the fid portion
of the path as the name for raw PUT requests without a filename.
* Treat MaxVolumeCount==0 as unlimited matching Go's hasFreeDiskLocation
Go's hasFreeDiskLocation returns true immediately when MaxVolumeCount
is 0, treating it as unlimited. Rust was computing effective_free as
<= 0 for max==0, rejecting the location. This could fail volume
creation during early startup before the first heartbeat adjusts max.
* Read lastAppendAtNs from deleted V3 entries in integrity check
Go's doCheckAndFixVolumeData reads AppendAtNs from both live entries
(verifyNeedleIntegrity) and deleted tombstones (verifyDeletedNeedleIntegrity).
Rust was skipping deleted entries, which could result in a stale
last_append_at_ns if the last index entry is a deletion.
* Return empty body for empty/oversized range requests matching Go
Go's ProcessRangeRequest returns nil (empty body, 200 OK) when
parsed ranges are empty or combined range size exceeds total content
size. The Rust buffered path incorrectly returned the full file data
for both cases. The streaming path already handled this correctly.
* Dispatch ScrubEcVolume by mode matching Go's INDEX/LOCAL/FULL
Go's ScrubEcVolume switches on mode: INDEX calls v.ScrubIndex()
(ecx integrity only), LOCAL calls v.ScrubLocal(), FULL calls
vs.store.ScrubEcVolume(). Rust was ignoring the mode and always
running verify_ec_shards. Now INDEX mode checks ecx index integrity
(sorted overlap detection + file size validation) without shard I/O,
while LOCAL/FULL modes run the existing shard verification.
* Fix TTL test expectation: 7d normalizes to 1w matching Go's fitTtlCount
Go's ReadTTL calls fitTtlCount which normalizes to the coarsest unit
that fits: 7 days = 1 week, so "7d" becomes {Count:1, Unit:Week}
which displays as "1w". Both Go and Rust normalize identically.
* Add version mismatch check to SetState matching Go's State.Update
Go's State.Update compares the incoming version with the stored
version and returns "version mismatch" error if they differ. This
provides optimistic concurrency control. The Rust implementation
was accepting any version unconditionally.
* Use unquoted keys in Query JSON output matching Go's json.ToJson
Go's json.ToJson produces records with unquoted keys like
{score:12} not {"score":12}. This is a custom format used
internally by SeaweedFS for query results.
* Fix TTL test expectation in VolumeNeedleStatus: 7d normalizes to 1w
Same normalization as the HTTP test: Go's ReadTTL calls fitTtlCount
which converts 7 days to 1 week.
* Include ETag header in 304 Not Modified responses matching Go behavior
Go sets ETag on the response writer (via SetEtag) before the
If-Modified-Since and If-None-Match conditional checks, so both
304 response paths include the ETag header. The Rust implementation
was only adding ETag to 200 responses.
* Remove needle-name fallback in chunk manifest filename resolution
Go's tryHandleChunkedFile only falls back from URL filename to
manifest name. Rust had an extra fallback to needle.name that
Go does not perform, which could produce different
Content-Disposition filenames for chunk manifests.
* Validate JWT nbf (Not Before) claim matching Go's jwt-go/v5
Go's jwt.ParseWithClaims validates the nbf claim when present,
rejecting tokens whose nbf is in the future. The Rust jsonwebtoken
crate defaults validate_nbf to false, so tokens with future nbf
were incorrectly accepted.
* Set isHeartbeating to true at startup matching Go's VolumeServer init
Go unconditionally sets isHeartbeating: true in the VolumeServer
struct literal. Rust was starting with false when masters are
configured, causing /healthz to return 503 until the first
heartbeat succeeds.
* Call store.close() on shutdown matching Go's Shutdown()
Go's Shutdown() calls vs.store.Close() which closes all volumes
and flushes file handles. The Rust server was relying on process
exit for cleanup, which could leave data unflushed.
* Include server ID in maintenance mode error matching Go's format
Go returns "volume server %s is in maintenance mode" with the
store ID. Rust was returning a generic "maintenance mode" message.
* Fix DiskType test: use HardDrive variant matching Go's HddType=""
Go maps both "" and "hdd" to HardDriveType (empty string). The
Rust enum variant is HardDrive, not Hdd. The test referenced a
nonexistent Hdd variant causing compilation failure.
* Do not include ETag in 304 responses matching Go's GetOrHeadHandler
Go sets ETag at L235 AFTER the If-Modified-Since and If-None-Match
304 return paths, so Go's 304 responses do not include the ETag header.
The Rust code was incorrectly including ETag in both 304 response paths.
* Return 400 on malformed query strings in PostHandler matching Go's ParseForm
Go's r.ParseForm() returns HTTP 400 with "form parse error: ..." when
the query string is malformed. Rust was silently falling back to empty
query params via unwrap_or_default().
* Load EC volume version from .vif matching Go's NewEcVolume
Go sets ev.Version = needle.Version(volumeInfo.Version) from the .vif
file. Rust was always using Version::current() (V3), which would produce
wrong needle actual size calculations for volumes created with V1 or V2.
* Sync .ecx file before close matching Go's EcVolume.Close
Go calls ev.ecxFile.Sync() before closing to ensure in-place deletion
marks are flushed to disk. Without this, deletion marks written via
MarkNeedleDeleted could be lost on crash.
* Validate SuperBlock extra data size matching Go's Bytes() guard
Go checks extraSize > 256*256-2 and calls glog.Fatalf to prevent
corrupt super block headers. Rust was silently truncating via u16 cast,
which would write an incorrect extra_size field.
* Update quinn-proto 0.11.13 -> 0.11.14 to fix GHSA-6xvm-j4wr-6v98
Fixes Dependency Review CI failure: quinn-proto < 0.11.14 is vulnerable
to unauthenticated remote DoS via panic in QUIC transport parameter
parsing.
* Skip TestMultipartUploadUsesFormFieldsForTimestampAndTTL for Go server
Go's r.FormValue() cannot read multipart text fields after
r.MultipartReader() consumes the body, so ts/ttl sent as multipart
form fields only work with the Rust volume server. Skip this test
when VOLUME_SERVER_IMPL != "rust" to fix CI failure.
* Flush .ecx in EC volume sync_to_disk matching Go's Sync()
Go's EcVolume.Sync() flushes both the .ecj journal and the .ecx index
to disk. The Rust version only flushed .ecj, leaving in-place deletion
marks in .ecx unpersisted until close(). This could cause data
inconsistency if the server crashes after marking a needle deleted in
.ecx but before close().
* Remove .vif file in EC volume destroy matching Go's Destroy()
Go's EcVolume.Destroy() removes .ecx, .ecj, and .vif files. The Rust
version only removed .ecx and .ecj, leaving orphaned .vif files on
disk after EC volume destruction (e.g., after TTL expiry).
* Fix is_expired to use <= for SuperBlockSize check matching Go
Go checks contentSize <= SuperBlockSize to detect empty volumes (no
needles). Rust used < which would incorrectly allow a volume with
exactly SuperBlockSize bytes (header only, no data) to proceed to
the TTL expiry check and potentially be marked as expired.
* Fix read_append_at_ns to read timestamps from tombstone entries
Go reads the full needle body for all entries including tombstones
(deleted needles with size=0) to extract the actual AppendAtNs
timestamp. The Rust version returned 0 early for size <= 0 entries,
which would cause the binary search in incremental copy to produce
incorrect results for positions containing deleted needles.
Now uses get_actual_size to compute the on-disk size (which handles
tombstones correctly) and only returns 0 when the actual size is 0.
* Add X-Request-Id response header matching Go's requestIDMiddleware
Go sets both X-Request-Id and x-amz-request-id response headers.
The Rust server only set x-amz-request-id, missing X-Request-Id.
* Add skip_serializing_if for UploadResult name and size fields
Go's UploadResult uses json:"name,omitempty" and json:"size,omitempty",
omitting these fields from JSON when they are zero values (empty
string / 0). The Rust struct always serialized them, producing
"name":"" and "size":0 where Go would omit them.
* Support JSONP/pretty-print for write success responses
Go's writeJsonQuiet checks for callback (JSONP) and pretty query
parameters on all JSON responses including write success. The Rust
write success path used axum::Json directly, bypassing JSONP and
pretty-print support. Now uses json_result_with_query to match Go.
* Include actual limit in file size limit error message
Go returns "file over the limited %d bytes" with the actual limit
value included. Rust returned a generic "file size limit exceeded"
without the limit value, making it harder to debug.
* Extract extension from 2-segment URL paths for image operations
Go's parseURLPath extracts the file extension from all URL formats
including 2-segment paths like /vid,fid.jpg. The Rust version only
handled 3-segment paths (/vid/fid/filename.ext), so extensions in
2-segment paths were lost. This caused image resize/crop operations
requested via query params to be silently skipped for those paths.
* Add size_hint to TrackedBody so throttled downloads get Content-Length
TrackedBody (used for download throttling) did not implement
size_hint(), causing HTTP/1.1 to fall back to chunked transfer
encoding instead of setting Content-Length. Go always sets
Content-Length explicitly for non-range responses.
* Add Last-Modified, pairs, and S3 headers to chunk manifest responses
Go sets Last-Modified, needle pairs, and S3 pass-through headers on
the response writer BEFORE calling tryHandleChunkedFile. Since the
Rust chunk manifest handler created fresh response headers and
returned early, these headers were missing from chunk manifest
responses. Now passes last_modified_str into the chunk manifest
handler and applies pairs and S3 pass-through query params
(response-cache-control, response-content-encoding, etc.) to the
chunk manifest response headers.
* Fix multipart fallback to use first part data when no filename
Go reads the first part's data unconditionally, then looks for a
part with a filename. If none found, Go uses the first part's data
(with empty filename). Rust only captured parts with filenames, so
when no part had a filename it fell back to the raw multipart body
bytes (including boundary delimiters), producing corrupt needle data.
* Set HasName and HasMime flags for empty values matching Go
Go's CreateNeedleFromRequest sets HasName and HasMime flags even when
the filename or MIME type is empty (len < 256 is true for len 0).
Rust skipped empty values, causing the on-disk needle format to
differ: Go-written needles include extra bytes for the empty name/mime
size fields, changing the serialized needle size in the idx entry.
This ensures binary format compatibility between Go and Rust servers.
* Add is_stopping guard to vacuum_volume_commit matching Go
Go's CommitCompactVolume (store_vacuum.go L53-54) checks
s.isStopping before committing compaction to prevent file
swaps during shutdown. The Rust handler was missing this
check, which could allow compaction commits while the
server is stopping.
* Remove disk_type from required status fields since Go omits it
Go's default DiskType is "" (HardDriveType), and protobuf's omitempty
tag causes empty strings to be dropped from JSON output.
* test: honor rust env in dual volume harness
* grpc: notify master after volume lifecycle changes
* http: proxy to replicas before download-limit timeout
* test: pass readMode to rust volume harnesses
* fix store free-location predicate selection
* fix volume copy disk placement and heartbeat notification
* fix chunk manifest delete replication
* fix write replication to survive client disconnects
* fix download limit proxy and wait flow
* fix crop gating for streamed reads
* fix upload limit wait counter behavior
* fix chunk manifest image transforms
* fix has_resize_ops to check width/height > 0 instead of is_some()
Go's shouldResizeImages condition is `width > 0 || height > 0`, so
`?width=0` correctly evaluates to false. Rust was using `is_some()`
which made `?width=0` evaluate to true, unnecessarily disabling
streaming reads for those requests.
* fix Content-MD5 to only compute and return when provided by client
Go only computes the MD5 of uncompressed data when a Content-MD5
header or multipart field is provided. Rust was always computing and
returning it. Also fix the mismatch error message to include size,
matching Go's format.
* fix save_vif to compute ExpireAtSec from TTL
Go's SaveVolumeInfo always computes ExpireAtSec = now + ttlSeconds
when the volume has a TTL. The save_vif path (used by set_read_only
and set_writable) was missing this computation, causing .vif files
to be written without the correct expiration timestamp for TTL volumes.
* fix set_writable to not modify no_write_can_delete
Go's MarkVolumeWritable only sets noWriteOrDelete=false and persists.
Rust was additionally setting no_write_can_delete=has_remote_file,
which could incorrectly change the write mode for remote-file volumes
when the master explicitly asks to make the volume writable.
* fix write_needle_blob_and_index to error on too-small V3 blob
Go returns an error when the needle blob is too small for timestamp
patching. Rust was silently skipping the patch and writing the blob
with a stale/zero timestamp, which could cause data integrity issues
during incremental replication that relies on AppendAtNs ordering.
* fix VolumeEcShardsToVolume to validate dataShards range
Go validates that dataShards is > 0 and <= MaxShardCount before
proceeding with EC-to-volume reconstruction. Without this check,
a zero or excessively large data_shards value could cause confusing
downstream failures.
* fix destroy to use VolumeError::NotEmpty instead of generic Io error
The dedicated NotEmpty variant exists in the enum but was not being
used. This makes error matching consistent with Go's ErrVolumeNotEmpty.
* fix SetState to persist state to disk with rollback on failure
Go's State.Update saves VolumeServerState to a state.pb file after
each SetState call, and rolls back the in-memory state if persistence
fails. Rust was only updating in-memory atomics, so maintenance mode
would be lost on server restart. Now saves protobuf-encoded state.pb
and loads it on startup.
* fix VolumeTierMoveDatToRemote to close local dat backend after upload
Go calls v.LoadRemoteFile() after saving volume info, which closes
the local DataBackend before transitioning to remote storage. Without
this, the volume holds a stale file handle to the deleted local .dat
file, causing reads to fail until server restart.
* fix VolumeTierMoveDatFromRemote to close remote dat backend after download
Go calls v.DataBackend.Close() and sets DataBackend=nil after removing
the remote file reference. Without this, the stale remote backend
state lingers and reads may not discover the newly downloaded local
.dat file until server restart.
* fix redirect to use internal url instead of public_url
Go's proxyReqToTargetServer builds the redirect Location header from
loc.Url (the internal URL), not publicUrl. Using public_url could
cause redirect failures when internal and external URLs differ.
* fix redirect test and add state_file_path to integration test
Update redirect unit test to expect internal url (matching the
previous fix). Add missing state_file_path field to the integration
test VolumeServerState constructor.
* fix FetchAndWriteNeedle to await all writes before checking errors
Go uses a WaitGroup to await all writes (local + replicas) before
checking errors. Rust was short-circuiting on local write failure,
which could leave replica writes in-flight without waiting for
completion.
* fix shutdown to send deregister heartbeat before pre_stop delay
Go's StopHeartbeat() closes stopChan immediately on interrupt, causing
the heartbeat goroutine to send the deregister heartbeat right away,
before the preStopSeconds delay. Rust was only setting is_stopping=true
without waking the heartbeat loop, so the deregister was delayed until
after the pre_stop sleep. Now we call volume_state_notify.notify_one()
to wake the heartbeat immediately.
* fix heartbeat response ordering to check duplicate UUIDs first
Go processes heartbeat responses in this order: DuplicatedUuids first,
then volume options (prealloc/size limit), then leader redirect. Rust
was applying volume options before checking for duplicate UUIDs, which
meant volume option changes would take effect even when the response
contained a duplicate UUID error that should cause an immediate return.
* the test thread was blocked
* fix(deps): update aws-lc-sys 0.38.0 → 0.39.0 to resolve security advisories
Bumps aws-lc-rs 1.16.1 → 1.16.2, pulling in aws-lc-sys 0.39.0 which
fixes GHSA-394x-vwmw-crm3 (X.509 Name Constraints wildcard/unicode
bypass) and GHSA-9f94-5g5w-gf6r (CRL Distribution Point scope check
logic error).
* fix: match Go Content-MD5 mismatch error message format
Go uses "Content-MD5 did not match md5 of file data expected [X]
received [Y] size Z" while Rust had a shorter format. Match the
exact Go error string so clients see identical messages.
* fix: match Go Bearer token length check (> 7, not >= 7)
Go requires len(bearer) > 7 ensuring at least one char after
"Bearer ". Rust used >= 7 which would accept an empty token.
* fix(deps): drop legacy rustls 0.21 to resolve rustls-webpki GHSA-pwjx-qhcg-rvj4
aws-sdk-s3's default "rustls" feature enables tls-rustls in
aws-smithy-runtime, which pulls in legacy-rustls-ring (rustls 0.21
→ rustls-webpki 0.101.7, moderate CRL advisory). Replace with
explicit default-https-client which uses only rustls 0.23 /
rustls-webpki 0.103.9.
* fix: use uploaded filename for auto-compression extension detection
Go extracts the file extension from pu.FileName (the uploaded
filename) for auto-compression decisions. Rust was using the URL
path, which typically has no extension for SeaweedFS file IDs.
* fix: add CRC legacy Value() backward-compat check on needle read
Go double-checks CRC: n.Checksum != crc && uint32(n.Checksum) !=
crc.Value(). The Value() path is a deprecated transform for compat
with seaweed versions prior to commit 056c480eb. Rust had the
legacy_value() method but wasn't using it in validation.
* fix: remove /stats/* endpoints to match Go (commented out since L130)
Go's volume_server.go has the /stats/counter, /stats/memory, and
/stats/disk endpoints commented out (lines 130-134). Remove them
from the Rust router along with the now-unused whitelist_guard
middleware.
* fix: filter application/octet-stream MIME for chunk manifests
Go's tryHandleChunkedFile (L334) filters out application/octet-stream
from chunk manifest MIME types, falling back to extension-based
detection. Rust was returning the stored MIME as-is for manifests.
* fix: VolumeMarkWritable returns error before notifying master
Go returns early at L200 if MarkVolumeWritable fails, before
reaching the master notification at L206. Rust was notifying master
even on failure, creating inconsistent state where master thinks
the volume is writable but local marking failed.
* fix: check volume existence before maintenance in MarkReadonly/Writable
Go's VolumeMarkReadonly (L239-241) and VolumeMarkWritable (L253-255)
look up the volume first, then call makeVolumeReadonly/Writable which
checks maintenance. Rust was checking maintenance first, returning
"maintenance mode" instead of "not found" for missing volumes.
* feat: implement ScrubVolume mark_broken_volumes_readonly (PR #8360)
Add the mark_broken_volumes_readonly flag from PR #8360:
- Sync proto field (tag 3) to local volume_server.proto
- After scrubbing, if flag is set, call makeVolumeReadonly on each
broken volume (notify master, mark local readonly, notify again)
- Collect errors via joined error semantics matching Go's errors.Join
- Factor out make_volume_readonly helper reused by both
VolumeMarkReadonly and ScrubVolume
Also refactors VolumeMarkReadonly to use the shared helper.
* fix(deps): update rustls-webpki 0.103.9 → 0.103.10 (GHSA-pwjx-qhcg-rvj4)
CRL Distribution Point matching logic fix for moderate severity
advisory about CRLs not considered authoritative.
* test: update integration tests for removed /stats/* endpoints
Replace tests that expected /stats/* routes to return 200/401 with
tests confirming they now fall through to the store handler (400),
matching Go's commented-out stats endpoints.
* docs: fix misleading comment about default offset feature
The comment said "4-byte offsets unless explicitly built with 5-byte
support" but the default feature enables 5bytes. This is intentional
for production parity with Go -tags 5BytesOffset builds. Fix the
comment to match reality.
This commit is contained in:
242
.github/workflows/rust-volume-server-tests.yml
vendored
Normal file
242
.github/workflows/rust-volume-server-tests.yml
vendored
Normal file
@@ -0,0 +1,242 @@
|
||||
name: "Rust Volume Server Tests"
|
||||
|
||||
on:
|
||||
pull_request:
|
||||
branches: [ master ]
|
||||
paths:
|
||||
- 'seaweed-volume/**'
|
||||
- 'test/volume_server/**'
|
||||
- 'weed/pb/volume_server.proto'
|
||||
- 'weed/pb/volume_server_pb/**'
|
||||
- '.github/workflows/rust-volume-server-tests.yml'
|
||||
push:
|
||||
branches: [ master, main ]
|
||||
paths:
|
||||
- 'seaweed-volume/**'
|
||||
- 'test/volume_server/**'
|
||||
- 'weed/pb/volume_server.proto'
|
||||
- 'weed/pb/volume_server_pb/**'
|
||||
- '.github/workflows/rust-volume-server-tests.yml'
|
||||
|
||||
concurrency:
|
||||
group: ${{ github.workflow }}-${{ github.head_ref || github.ref }}
|
||||
cancel-in-progress: true
|
||||
|
||||
permissions:
|
||||
contents: read
|
||||
|
||||
|
||||
jobs:
|
||||
rust-unit-tests:
|
||||
name: Rust Unit Tests
|
||||
runs-on: ubuntu-22.04
|
||||
timeout-minutes: 15
|
||||
|
||||
steps:
|
||||
- name: Checkout code
|
||||
uses: actions/checkout@v6
|
||||
|
||||
- name: Install protobuf compiler
|
||||
run: sudo apt-get update && sudo apt-get install -y protobuf-compiler
|
||||
|
||||
- name: Install Rust toolchain
|
||||
uses: dtolnay/rust-toolchain@stable
|
||||
|
||||
- name: Cache cargo registry and target
|
||||
uses: actions/cache@v4
|
||||
with:
|
||||
path: |
|
||||
~/.cargo/registry
|
||||
~/.cargo/git
|
||||
seaweed-volume/target
|
||||
key: rust-${{ hashFiles('seaweed-volume/Cargo.lock') }}
|
||||
restore-keys: |
|
||||
rust-
|
||||
|
||||
- name: Build Rust volume server
|
||||
run: cd seaweed-volume && cargo build --release
|
||||
|
||||
- name: Run Rust unit tests
|
||||
run: cd seaweed-volume && cargo test
|
||||
|
||||
rust-integration-tests:
|
||||
name: Rust Integration Tests
|
||||
runs-on: ubuntu-22.04
|
||||
timeout-minutes: 30
|
||||
|
||||
steps:
|
||||
- name: Checkout code
|
||||
uses: actions/checkout@v6
|
||||
|
||||
- name: Set up Go
|
||||
uses: actions/setup-go@v6
|
||||
with:
|
||||
go-version-file: 'go.mod'
|
||||
|
||||
- name: Install protobuf compiler
|
||||
run: sudo apt-get update && sudo apt-get install -y protobuf-compiler
|
||||
|
||||
- name: Install Rust toolchain
|
||||
uses: dtolnay/rust-toolchain@stable
|
||||
|
||||
- name: Cache cargo registry and target
|
||||
uses: actions/cache@v4
|
||||
with:
|
||||
path: |
|
||||
~/.cargo/registry
|
||||
~/.cargo/git
|
||||
seaweed-volume/target
|
||||
key: rust-${{ hashFiles('seaweed-volume/Cargo.lock') }}
|
||||
restore-keys: |
|
||||
rust-
|
||||
|
||||
- name: Build Go weed binary
|
||||
run: |
|
||||
cd weed
|
||||
go build -o weed .
|
||||
chmod +x weed
|
||||
./weed version
|
||||
|
||||
- name: Build Rust volume binary
|
||||
run: cd seaweed-volume && cargo build --release
|
||||
|
||||
- name: Run integration tests
|
||||
env:
|
||||
WEED_BINARY: ${{ github.workspace }}/weed/weed
|
||||
RUST_VOLUME_BINARY: ${{ github.workspace }}/seaweed-volume/target/release/weed-volume
|
||||
run: |
|
||||
echo "Running Rust volume server integration tests..."
|
||||
go test -v -count=1 -timeout=15m ./test/volume_server/rust/...
|
||||
|
||||
- name: Collect logs on failure
|
||||
if: failure()
|
||||
run: |
|
||||
mkdir -p /tmp/rust-volume-server-it-logs
|
||||
find /tmp -maxdepth 1 -type d -name "seaweedfs_volume_server_it_*" -print -exec cp -r {} /tmp/rust-volume-server-it-logs/ \; || true
|
||||
|
||||
- name: Archive logs on failure
|
||||
if: failure()
|
||||
uses: actions/upload-artifact@v7
|
||||
with:
|
||||
name: rust-volume-server-integration-test-logs
|
||||
path: /tmp/rust-volume-server-it-logs/
|
||||
if-no-files-found: warn
|
||||
retention-days: 7
|
||||
|
||||
- name: Test summary
|
||||
if: always()
|
||||
run: |
|
||||
echo "## Rust Volume Server Integration Test Summary" >> "$GITHUB_STEP_SUMMARY"
|
||||
echo "- Suite: test/volume_server/rust" >> "$GITHUB_STEP_SUMMARY"
|
||||
echo "- Command: go test -v -count=1 -timeout=15m ./test/volume_server/rust/..." >> "$GITHUB_STEP_SUMMARY"
|
||||
|
||||
rust-volume-go-tests:
|
||||
name: Go Tests with Rust Volume (${{ matrix.test-type }} - Shard ${{ matrix.shard }})
|
||||
runs-on: ubuntu-22.04
|
||||
timeout-minutes: 45
|
||||
strategy:
|
||||
fail-fast: false
|
||||
matrix:
|
||||
test-type: [grpc, http]
|
||||
shard: [1, 2, 3]
|
||||
|
||||
steps:
|
||||
- name: Checkout code
|
||||
uses: actions/checkout@v6
|
||||
|
||||
- name: Set up Go
|
||||
uses: actions/setup-go@v6
|
||||
with:
|
||||
go-version-file: 'go.mod'
|
||||
|
||||
- name: Install protobuf compiler
|
||||
run: sudo apt-get update && sudo apt-get install -y protobuf-compiler
|
||||
|
||||
- name: Install Rust toolchain
|
||||
uses: dtolnay/rust-toolchain@stable
|
||||
|
||||
- name: Cache cargo registry and target
|
||||
uses: actions/cache@v4
|
||||
with:
|
||||
path: |
|
||||
~/.cargo/registry
|
||||
~/.cargo/git
|
||||
seaweed-volume/target
|
||||
key: rust-${{ hashFiles('seaweed-volume/Cargo.lock') }}
|
||||
restore-keys: |
|
||||
rust-
|
||||
|
||||
- name: Build Go weed binary
|
||||
run: |
|
||||
cd weed
|
||||
go build -o weed .
|
||||
chmod +x weed
|
||||
./weed version
|
||||
|
||||
- name: Build Rust volume binary
|
||||
run: cd seaweed-volume && cargo build --release
|
||||
|
||||
- name: Run volume server integration tests with Rust volume
|
||||
env:
|
||||
WEED_BINARY: ${{ github.workspace }}/weed/weed
|
||||
RUST_VOLUME_BINARY: ${{ github.workspace }}/seaweed-volume/target/release/weed-volume
|
||||
VOLUME_SERVER_IMPL: rust
|
||||
run: |
|
||||
if [ "${{ matrix.test-type }}" == "grpc" ]; then
|
||||
if [ "${{ matrix.shard }}" == "1" ]; then
|
||||
TEST_PATTERN="^Test[A-H]"
|
||||
elif [ "${{ matrix.shard }}" == "2" ]; then
|
||||
TEST_PATTERN="^Test[I-S]"
|
||||
else
|
||||
TEST_PATTERN="^Test[T-Z]"
|
||||
fi
|
||||
else
|
||||
if [ "${{ matrix.shard }}" == "1" ]; then
|
||||
TEST_PATTERN="^Test[A-G]"
|
||||
elif [ "${{ matrix.shard }}" == "2" ]; then
|
||||
TEST_PATTERN="^Test[H-R]"
|
||||
else
|
||||
TEST_PATTERN="^Test[S-Z]"
|
||||
fi
|
||||
fi
|
||||
echo "Running Go volume server tests with Rust volume for ${{ matrix.test-type }} (Shard ${{ matrix.shard }}, pattern: ${TEST_PATTERN})..."
|
||||
go test -v -count=1 -tags 5BytesOffset -timeout=30m ./test/volume_server/${{ matrix.test-type }}/... -run "${TEST_PATTERN}"
|
||||
|
||||
- name: Collect logs on failure
|
||||
if: failure()
|
||||
run: |
|
||||
mkdir -p /tmp/rust-volume-go-test-logs
|
||||
find /tmp -maxdepth 1 -type d -name "seaweedfs_volume_server_it_*" -print -exec cp -r {} /tmp/rust-volume-go-test-logs/ \; || true
|
||||
|
||||
- name: Archive logs on failure
|
||||
if: failure()
|
||||
uses: actions/upload-artifact@v7
|
||||
with:
|
||||
name: rust-volume-go-test-logs-${{ matrix.test-type }}-shard${{ matrix.shard }}
|
||||
path: /tmp/rust-volume-go-test-logs/
|
||||
if-no-files-found: warn
|
||||
retention-days: 7
|
||||
|
||||
- name: Test summary
|
||||
if: always()
|
||||
run: |
|
||||
if [ "${{ matrix.test-type }}" == "grpc" ]; then
|
||||
if [ "${{ matrix.shard }}" == "1" ]; then
|
||||
TEST_PATTERN="^Test[A-H]"
|
||||
elif [ "${{ matrix.shard }}" == "2" ]; then
|
||||
TEST_PATTERN="^Test[I-S]"
|
||||
else
|
||||
TEST_PATTERN="^Test[T-Z]"
|
||||
fi
|
||||
else
|
||||
if [ "${{ matrix.shard }}" == "1" ]; then
|
||||
TEST_PATTERN="^Test[A-G]"
|
||||
elif [ "${{ matrix.shard }}" == "2" ]; then
|
||||
TEST_PATTERN="^Test[H-R]"
|
||||
else
|
||||
TEST_PATTERN="^Test[S-Z]"
|
||||
fi
|
||||
fi
|
||||
echo "## Rust Volume - Go Test Summary (${{ matrix.test-type }} - Shard ${{ matrix.shard }})" >> "$GITHUB_STEP_SUMMARY"
|
||||
echo "- Suite: test/volume_server/${{ matrix.test-type }} (Pattern: ${TEST_PATTERN})" >> "$GITHUB_STEP_SUMMARY"
|
||||
echo "- Volume server: Rust (VOLUME_SERVER_IMPL=rust)" >> "$GITHUB_STEP_SUMMARY"
|
||||
165
.github/workflows/rust_binaries_dev.yml
vendored
Normal file
165
.github/workflows/rust_binaries_dev.yml
vendored
Normal file
@@ -0,0 +1,165 @@
|
||||
name: "rust: build dev volume server binaries"
|
||||
|
||||
on:
|
||||
push:
|
||||
branches: [ master ]
|
||||
paths:
|
||||
- 'seaweed-volume/**'
|
||||
- '.github/workflows/rust_binaries_dev.yml'
|
||||
|
||||
permissions:
|
||||
contents: read
|
||||
|
||||
jobs:
|
||||
|
||||
cleanup:
|
||||
permissions:
|
||||
contents: write
|
||||
runs-on: ubuntu-latest
|
||||
steps:
|
||||
- name: Delete old Rust volume dev assets
|
||||
uses: mknejp/delete-release-assets@v1
|
||||
with:
|
||||
token: ${{ github.token }}
|
||||
tag: dev
|
||||
fail-if-no-assets: false
|
||||
assets: |
|
||||
weed-volume-*
|
||||
|
||||
build-rust-volume-dev-linux:
|
||||
permissions:
|
||||
contents: write
|
||||
needs: cleanup
|
||||
runs-on: ubuntu-22.04
|
||||
strategy:
|
||||
matrix:
|
||||
include:
|
||||
- target: x86_64-unknown-linux-gnu
|
||||
asset_suffix: linux-amd64
|
||||
|
||||
steps:
|
||||
- uses: actions/checkout@v6
|
||||
|
||||
- name: Install protobuf compiler
|
||||
run: sudo apt-get update && sudo apt-get install -y protobuf-compiler
|
||||
|
||||
- name: Install Rust toolchain
|
||||
uses: dtolnay/rust-toolchain@stable
|
||||
|
||||
- name: Cache cargo registry and target
|
||||
uses: actions/cache@v4
|
||||
with:
|
||||
path: |
|
||||
~/.cargo/registry
|
||||
~/.cargo/git
|
||||
seaweed-volume/target
|
||||
key: rust-dev-${{ matrix.target }}-${{ hashFiles('seaweed-volume/Cargo.lock') }}
|
||||
restore-keys: |
|
||||
rust-dev-${{ matrix.target }}-
|
||||
|
||||
- name: Set BUILD_TIME
|
||||
run: echo BUILD_TIME=$(date -u +%Y%m%d-%H%M) >> "$GITHUB_ENV"
|
||||
|
||||
- name: Build Rust volume server (large disk)
|
||||
env:
|
||||
SEAWEEDFS_COMMIT: ${{ github.sha }}
|
||||
run: cd seaweed-volume && cargo build --release
|
||||
|
||||
- name: Package large disk binary
|
||||
run: |
|
||||
cp seaweed-volume/target/release/weed-volume weed-volume-large-disk
|
||||
tar czf "weed-volume-large-disk-${{ env.BUILD_TIME }}-${{ matrix.asset_suffix }}.tar.gz" weed-volume-large-disk
|
||||
rm weed-volume-large-disk
|
||||
|
||||
- name: Build Rust volume server (normal)
|
||||
env:
|
||||
SEAWEEDFS_COMMIT: ${{ github.sha }}
|
||||
run: cd seaweed-volume && cargo build --release --no-default-features
|
||||
|
||||
- name: Package normal binary
|
||||
run: |
|
||||
cp seaweed-volume/target/release/weed-volume weed-volume-normal
|
||||
tar czf "weed-volume-${{ env.BUILD_TIME }}-${{ matrix.asset_suffix }}.tar.gz" weed-volume-normal
|
||||
rm weed-volume-normal
|
||||
|
||||
- name: Upload dev release assets
|
||||
uses: softprops/action-gh-release@v2
|
||||
with:
|
||||
tag_name: dev
|
||||
prerelease: true
|
||||
files: |
|
||||
weed-volume-large-disk-${{ env.BUILD_TIME }}-${{ matrix.asset_suffix }}.tar.gz
|
||||
weed-volume-${{ env.BUILD_TIME }}-${{ matrix.asset_suffix }}.tar.gz
|
||||
env:
|
||||
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
|
||||
|
||||
build-rust-volume-dev-darwin:
|
||||
permissions:
|
||||
contents: write
|
||||
needs: build-rust-volume-dev-linux
|
||||
runs-on: macos-latest
|
||||
strategy:
|
||||
matrix:
|
||||
include:
|
||||
- target: aarch64-apple-darwin
|
||||
asset_suffix: darwin-arm64
|
||||
- target: x86_64-apple-darwin
|
||||
asset_suffix: darwin-amd64
|
||||
|
||||
steps:
|
||||
- uses: actions/checkout@v6
|
||||
|
||||
- name: Install protobuf compiler
|
||||
run: brew install protobuf
|
||||
|
||||
- name: Install Rust toolchain
|
||||
uses: dtolnay/rust-toolchain@stable
|
||||
with:
|
||||
targets: ${{ matrix.target }}
|
||||
|
||||
- name: Cache cargo registry and target
|
||||
uses: actions/cache@v4
|
||||
with:
|
||||
path: |
|
||||
~/.cargo/registry
|
||||
~/.cargo/git
|
||||
seaweed-volume/target
|
||||
key: rust-dev-${{ matrix.target }}-${{ hashFiles('seaweed-volume/Cargo.lock') }}
|
||||
restore-keys: |
|
||||
rust-dev-${{ matrix.target }}-
|
||||
|
||||
- name: Set BUILD_TIME
|
||||
run: echo BUILD_TIME=$(date -u +%Y%m%d-%H%M) >> "$GITHUB_ENV"
|
||||
|
||||
- name: Build Rust volume server (large disk)
|
||||
env:
|
||||
SEAWEEDFS_COMMIT: ${{ github.sha }}
|
||||
run: cd seaweed-volume && cargo build --release --target ${{ matrix.target }}
|
||||
|
||||
- name: Package large disk binary
|
||||
run: |
|
||||
cp seaweed-volume/target/${{ matrix.target }}/release/weed-volume weed-volume-large-disk
|
||||
tar czf "weed-volume-large-disk-${{ env.BUILD_TIME }}-${{ matrix.asset_suffix }}.tar.gz" weed-volume-large-disk
|
||||
rm weed-volume-large-disk
|
||||
|
||||
- name: Build Rust volume server (normal)
|
||||
env:
|
||||
SEAWEEDFS_COMMIT: ${{ github.sha }}
|
||||
run: cd seaweed-volume && cargo build --release --target ${{ matrix.target }} --no-default-features
|
||||
|
||||
- name: Package normal binary
|
||||
run: |
|
||||
cp seaweed-volume/target/${{ matrix.target }}/release/weed-volume weed-volume-normal
|
||||
tar czf "weed-volume-${{ env.BUILD_TIME }}-${{ matrix.asset_suffix }}.tar.gz" weed-volume-normal
|
||||
rm weed-volume-normal
|
||||
|
||||
- name: Upload dev release assets
|
||||
uses: softprops/action-gh-release@v2
|
||||
with:
|
||||
tag_name: dev
|
||||
prerelease: true
|
||||
files: |
|
||||
weed-volume-large-disk-${{ env.BUILD_TIME }}-${{ matrix.asset_suffix }}.tar.gz
|
||||
weed-volume-${{ env.BUILD_TIME }}-${{ matrix.asset_suffix }}.tar.gz
|
||||
env:
|
||||
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
|
||||
215
.github/workflows/rust_binaries_release.yml
vendored
Normal file
215
.github/workflows/rust_binaries_release.yml
vendored
Normal file
@@ -0,0 +1,215 @@
|
||||
name: "rust: build versioned volume server binaries"
|
||||
|
||||
on:
|
||||
push:
|
||||
tags:
|
||||
- '*'
|
||||
|
||||
workflow_dispatch:
|
||||
|
||||
permissions:
|
||||
contents: read
|
||||
|
||||
jobs:
|
||||
|
||||
build-rust-volume-linux:
|
||||
permissions:
|
||||
contents: write
|
||||
runs-on: ubuntu-22.04
|
||||
strategy:
|
||||
matrix:
|
||||
include:
|
||||
- target: x86_64-unknown-linux-gnu
|
||||
asset_suffix: linux_amd64
|
||||
- target: aarch64-unknown-linux-gnu
|
||||
asset_suffix: linux_arm64
|
||||
cross: true
|
||||
|
||||
steps:
|
||||
- uses: actions/checkout@v6
|
||||
|
||||
- name: Install protobuf compiler
|
||||
run: sudo apt-get update && sudo apt-get install -y protobuf-compiler
|
||||
|
||||
- name: Install Rust toolchain
|
||||
uses: dtolnay/rust-toolchain@stable
|
||||
with:
|
||||
targets: ${{ matrix.target }}
|
||||
|
||||
- name: Install cross-compilation tools
|
||||
if: matrix.cross
|
||||
run: |
|
||||
sudo apt-get install -y gcc-aarch64-linux-gnu
|
||||
echo "CARGO_TARGET_AARCH64_UNKNOWN_LINUX_GNU_LINKER=aarch64-linux-gnu-gcc" >> "$GITHUB_ENV"
|
||||
|
||||
- name: Cache cargo registry and target
|
||||
uses: actions/cache@v4
|
||||
with:
|
||||
path: |
|
||||
~/.cargo/registry
|
||||
~/.cargo/git
|
||||
seaweed-volume/target
|
||||
key: rust-release-${{ matrix.target }}-${{ hashFiles('seaweed-volume/Cargo.lock') }}
|
||||
restore-keys: |
|
||||
rust-release-${{ matrix.target }}-
|
||||
|
||||
- name: Build Rust volume server (large disk)
|
||||
env:
|
||||
SEAWEEDFS_COMMIT: ${{ github.sha }}
|
||||
run: |
|
||||
cd seaweed-volume
|
||||
cargo build --release --target ${{ matrix.target }}
|
||||
|
||||
- name: Build Rust volume server (normal)
|
||||
env:
|
||||
SEAWEEDFS_COMMIT: ${{ github.sha }}
|
||||
run: |
|
||||
cd seaweed-volume
|
||||
cargo build --release --target ${{ matrix.target }} --no-default-features
|
||||
|
||||
- name: Package binaries
|
||||
run: |
|
||||
# Large disk (default, 5bytes feature)
|
||||
cp seaweed-volume/target/${{ matrix.target }}/release/weed-volume weed-volume-large-disk
|
||||
tar czf weed-volume_large_disk_${{ matrix.asset_suffix }}.tar.gz weed-volume-large-disk
|
||||
rm weed-volume-large-disk
|
||||
|
||||
# Normal volume size
|
||||
cp seaweed-volume/target/${{ matrix.target }}/release/weed-volume weed-volume-normal
|
||||
tar czf weed-volume_${{ matrix.asset_suffix }}.tar.gz weed-volume-normal
|
||||
rm weed-volume-normal
|
||||
|
||||
- name: Upload release assets
|
||||
uses: softprops/action-gh-release@v2
|
||||
with:
|
||||
files: |
|
||||
weed-volume_large_disk_${{ matrix.asset_suffix }}.tar.gz
|
||||
weed-volume_${{ matrix.asset_suffix }}.tar.gz
|
||||
env:
|
||||
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
|
||||
|
||||
build-rust-volume-darwin:
|
||||
permissions:
|
||||
contents: write
|
||||
runs-on: macos-latest
|
||||
strategy:
|
||||
matrix:
|
||||
include:
|
||||
- target: x86_64-apple-darwin
|
||||
asset_suffix: darwin_amd64
|
||||
- target: aarch64-apple-darwin
|
||||
asset_suffix: darwin_arm64
|
||||
|
||||
steps:
|
||||
- uses: actions/checkout@v6
|
||||
|
||||
- name: Install protobuf compiler
|
||||
run: brew install protobuf
|
||||
|
||||
- name: Install Rust toolchain
|
||||
uses: dtolnay/rust-toolchain@stable
|
||||
with:
|
||||
targets: ${{ matrix.target }}
|
||||
|
||||
- name: Cache cargo registry and target
|
||||
uses: actions/cache@v4
|
||||
with:
|
||||
path: |
|
||||
~/.cargo/registry
|
||||
~/.cargo/git
|
||||
seaweed-volume/target
|
||||
key: rust-release-${{ matrix.target }}-${{ hashFiles('seaweed-volume/Cargo.lock') }}
|
||||
restore-keys: |
|
||||
rust-release-${{ matrix.target }}-
|
||||
|
||||
- name: Build Rust volume server (large disk)
|
||||
env:
|
||||
SEAWEEDFS_COMMIT: ${{ github.sha }}
|
||||
run: |
|
||||
cd seaweed-volume
|
||||
cargo build --release --target ${{ matrix.target }}
|
||||
|
||||
- name: Build Rust volume server (normal)
|
||||
env:
|
||||
SEAWEEDFS_COMMIT: ${{ github.sha }}
|
||||
run: |
|
||||
cd seaweed-volume
|
||||
cargo build --release --target ${{ matrix.target }} --no-default-features
|
||||
|
||||
- name: Package binaries
|
||||
run: |
|
||||
cp seaweed-volume/target/${{ matrix.target }}/release/weed-volume weed-volume-large-disk
|
||||
tar czf weed-volume_large_disk_${{ matrix.asset_suffix }}.tar.gz weed-volume-large-disk
|
||||
rm weed-volume-large-disk
|
||||
|
||||
cp seaweed-volume/target/${{ matrix.target }}/release/weed-volume weed-volume-normal
|
||||
tar czf weed-volume_${{ matrix.asset_suffix }}.tar.gz weed-volume-normal
|
||||
rm weed-volume-normal
|
||||
|
||||
- name: Upload release assets
|
||||
uses: softprops/action-gh-release@v2
|
||||
with:
|
||||
files: |
|
||||
weed-volume_large_disk_${{ matrix.asset_suffix }}.tar.gz
|
||||
weed-volume_${{ matrix.asset_suffix }}.tar.gz
|
||||
env:
|
||||
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
|
||||
|
||||
build-rust-volume-windows:
|
||||
permissions:
|
||||
contents: write
|
||||
runs-on: windows-latest
|
||||
|
||||
steps:
|
||||
- uses: actions/checkout@v6
|
||||
|
||||
- name: Install protobuf compiler
|
||||
run: choco install protoc -y
|
||||
|
||||
- name: Install Rust toolchain
|
||||
uses: dtolnay/rust-toolchain@stable
|
||||
|
||||
- name: Cache cargo registry and target
|
||||
uses: actions/cache@v4
|
||||
with:
|
||||
path: |
|
||||
~/.cargo/registry
|
||||
~/.cargo/git
|
||||
seaweed-volume/target
|
||||
key: rust-release-windows-${{ hashFiles('seaweed-volume/Cargo.lock') }}
|
||||
restore-keys: |
|
||||
rust-release-windows-
|
||||
|
||||
- name: Build Rust volume server (large disk)
|
||||
env:
|
||||
SEAWEEDFS_COMMIT: ${{ github.sha }}
|
||||
run: |
|
||||
cd seaweed-volume
|
||||
cargo build --release
|
||||
|
||||
- name: Build Rust volume server (normal)
|
||||
env:
|
||||
SEAWEEDFS_COMMIT: ${{ github.sha }}
|
||||
run: |
|
||||
cd seaweed-volume
|
||||
cargo build --release --no-default-features
|
||||
|
||||
- name: Package binaries
|
||||
shell: bash
|
||||
run: |
|
||||
cp seaweed-volume/target/release/weed-volume.exe weed-volume-large-disk.exe
|
||||
7z a weed-volume_large_disk_windows_amd64.zip weed-volume-large-disk.exe
|
||||
rm weed-volume-large-disk.exe
|
||||
|
||||
cp seaweed-volume/target/release/weed-volume.exe weed-volume-normal.exe
|
||||
7z a weed-volume_windows_amd64.zip weed-volume-normal.exe
|
||||
rm weed-volume-normal.exe
|
||||
|
||||
- name: Upload release assets
|
||||
uses: softprops/action-gh-release@v2
|
||||
with:
|
||||
files: |
|
||||
weed-volume_large_disk_windows_amd64.zip
|
||||
weed-volume_windows_amd64.zip
|
||||
env:
|
||||
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
|
||||
1
.gitignore
vendored
1
.gitignore
vendored
@@ -143,3 +143,4 @@ test/s3/iam/.test_env
|
||||
weed_bin
|
||||
telemetry/server/telemetry-server
|
||||
.aider*
|
||||
/seaweed-volume/docs
|
||||
|
||||
790
VOLUME_SERVER_RUST_PLAN.md
Normal file
790
VOLUME_SERVER_RUST_PLAN.md
Normal file
@@ -0,0 +1,790 @@
|
||||
# Execution Plan: SeaweedFS Volume Server — Go to Rust Port
|
||||
|
||||
## Scope Summary
|
||||
|
||||
| Component | Go Source | Lines (non-test) | Description |
|
||||
|---|---|---|---|
|
||||
| CLI & startup | `weed/command/volume.go` | 476 | ~40 CLI flags, server bootstrap |
|
||||
| HTTP server + handlers | `weed/server/volume_server*.go` | 1,517 | Struct, routes, read/write/delete handlers |
|
||||
| gRPC handlers | `weed/server/volume_grpc_*.go` | 3,073 | 40 RPC method implementations |
|
||||
| Storage engine | `weed/storage/` | 15,271 | Volumes, needles, index, compaction, EC, backend |
|
||||
| Protobuf definitions | `weed/pb/volume_server.proto` | 759 | Service + message definitions |
|
||||
| Shared utilities | `weed/security/`, `weed/stats/`, `weed/util/` | ~2,000+ | JWT, TLS, metrics, helpers |
|
||||
| **Total** | | **~23,000+** | |
|
||||
|
||||
## Rust Crate & Dependency Strategy
|
||||
|
||||
```
|
||||
seaweed-volume/
|
||||
├── Cargo.toml
|
||||
├── build.rs # protobuf codegen
|
||||
├── proto/
|
||||
│ ├── volume_server.proto # copied from Go, adapted
|
||||
│ └── remote.proto
|
||||
├── src/
|
||||
│ ├── main.rs # CLI entry point
|
||||
│ ├── config.rs # CLI flags + config
|
||||
│ ├── server/
|
||||
│ │ ├── mod.rs
|
||||
│ │ ├── volume_server.rs # VolumeServer struct + lifecycle
|
||||
│ │ ├── http_handlers.rs # HTTP route dispatch
|
||||
│ │ ├── http_read.rs # GET/HEAD handlers
|
||||
│ │ ├── http_write.rs # POST/PUT handlers
|
||||
│ │ ├── http_delete.rs # DELETE handler
|
||||
│ │ ├── http_admin.rs # /status, /healthz, /ui
|
||||
│ │ ├── grpc_service.rs # gRPC trait impl dispatch
|
||||
│ │ ├── grpc_vacuum.rs
|
||||
│ │ ├── grpc_copy.rs
|
||||
│ │ ├── grpc_erasure_coding.rs
|
||||
│ │ ├── grpc_tail.rs
|
||||
│ │ ├── grpc_admin.rs
|
||||
│ │ ├── grpc_read_write.rs
|
||||
│ │ ├── grpc_batch_delete.rs
|
||||
│ │ ├── grpc_scrub.rs
|
||||
│ │ ├── grpc_tier.rs
|
||||
│ │ ├── grpc_remote.rs
|
||||
│ │ ├── grpc_query.rs
|
||||
│ │ ├── grpc_state.rs
|
||||
│ │ └── grpc_client_to_master.rs # heartbeat
|
||||
│ ├── storage/
|
||||
│ │ ├── mod.rs
|
||||
│ │ ├── store.rs # Store (multi-disk manager)
|
||||
│ │ ├── volume.rs # Volume struct + lifecycle
|
||||
│ │ ├── volume_read.rs
|
||||
│ │ ├── volume_write.rs
|
||||
│ │ ├── volume_compact.rs
|
||||
│ │ ├── volume_info.rs
|
||||
│ │ ├── needle/
|
||||
│ │ │ ├── mod.rs
|
||||
│ │ │ ├── needle.rs # Needle struct + serialization
|
||||
│ │ │ ├── needle_read.rs
|
||||
│ │ │ ├── needle_write.rs
|
||||
│ │ │ ├── needle_map.rs # in-memory NeedleMap
|
||||
│ │ │ ├── needle_value.rs
|
||||
│ │ │ └── crc.rs
|
||||
│ │ ├── super_block.rs
|
||||
│ │ ├── idx/
|
||||
│ │ │ ├── mod.rs
|
||||
│ │ │ └── idx.rs # .idx file format read/write
|
||||
│ │ ├── needle_map_leveldb.rs
|
||||
│ │ ├── types.rs # NeedleId, Offset, Size, DiskType
|
||||
│ │ ├── disk_location.rs # DiskLocation per-directory
|
||||
│ │ ├── erasure_coding/
|
||||
│ │ │ ├── mod.rs
|
||||
│ │ │ ├── ec_volume.rs
|
||||
│ │ │ ├── ec_shard.rs
|
||||
│ │ │ ├── ec_encoder.rs # Reed-Solomon encoding
|
||||
│ │ │ └── ec_decoder.rs
|
||||
│ │ └── backend/
|
||||
│ │ ├── mod.rs
|
||||
│ │ ├── disk.rs
|
||||
│ │ └── s3_backend.rs # tiered storage to S3
|
||||
│ ├── topology/
|
||||
│ │ └── volume_layout.rs # replication placement
|
||||
│ ├── security/
|
||||
│ │ ├── mod.rs
|
||||
│ │ ├── guard.rs # whitelist + JWT gate
|
||||
│ │ ├── jwt.rs
|
||||
│ │ └── tls.rs
|
||||
│ ├── stats/
|
||||
│ │ ├── mod.rs
|
||||
│ │ └── metrics.rs # Prometheus counters/gauges
|
||||
│ └── util/
|
||||
│ ├── mod.rs
|
||||
│ ├── grpc.rs
|
||||
│ ├── http.rs
|
||||
│ └── file.rs
|
||||
└── tests/
|
||||
├── integration/
|
||||
│ ├── http_read_test.rs
|
||||
│ ├── http_write_test.rs
|
||||
│ ├── grpc_test.rs
|
||||
│ └── storage_test.rs
|
||||
└── unit/
|
||||
├── needle_test.rs
|
||||
├── idx_test.rs
|
||||
├── super_block_test.rs
|
||||
└── ec_test.rs
|
||||
```
|
||||
|
||||
### Key Rust dependencies
|
||||
|
||||
| Purpose | Crate |
|
||||
|---|---|
|
||||
| Async runtime | `tokio` |
|
||||
| gRPC | `tonic` + `prost` |
|
||||
| HTTP server | `hyper` + `axum` |
|
||||
| CLI parsing | `clap` (derive) |
|
||||
| Prometheus metrics | `prometheus` |
|
||||
| JWT | `jsonwebtoken` |
|
||||
| TLS | `rustls` + `tokio-rustls` |
|
||||
| LevelDB | `rusty-leveldb` or `rocksdb` |
|
||||
| Reed-Solomon EC | `reed-solomon-erasure` |
|
||||
| Logging | `tracing` + `tracing-subscriber` |
|
||||
| Config (security.toml) | `toml` + `serde` |
|
||||
| CRC32 | `crc32fast` |
|
||||
| Memory-mapped files | `memmap2` |
|
||||
|
||||
---
|
||||
|
||||
## Phased Execution Plan
|
||||
|
||||
### Phase 1: Project Skeleton & Protobuf Codegen
|
||||
**Goal:** Cargo project compiles, proto codegen works, CLI parses all flags.
|
||||
|
||||
**Steps:**
|
||||
|
||||
1.1. Create `seaweed-volume/Cargo.toml` with all dependencies listed above.
|
||||
|
||||
1.2. Copy `volume_server.proto` and `remote.proto` into `proto/`. Adjust package paths for Rust codegen.
|
||||
|
||||
1.3. Create `build.rs` using `tonic-build` to compile `.proto` files into Rust types.
|
||||
|
||||
1.4. Create `src/main.rs` with `clap` derive structs mirroring all 40 CLI flags from `weed/command/volume.go`:
|
||||
- `--port` (default 8080)
|
||||
- `--port.grpc` (default 0 → 10000+port)
|
||||
- `--port.public` (default 0 → same as port)
|
||||
- `--ip` (auto-detect)
|
||||
- `--id` (default empty → ip:port)
|
||||
- `--publicUrl`
|
||||
- `--ip.bind`
|
||||
- `--master` (default "localhost:9333")
|
||||
- `--mserver` (deprecated compat)
|
||||
- `--preStopSeconds` (default 10)
|
||||
- `--idleTimeout` (default 30)
|
||||
- `--dataCenter`
|
||||
- `--rack`
|
||||
- `--index` [memory|leveldb|leveldbMedium|leveldbLarge]
|
||||
- `--disk` [hdd|ssd|<tag>]
|
||||
- `--tags`
|
||||
- `--dir` (default temp dir)
|
||||
- `--dir.idx`
|
||||
- `--max` (default "8")
|
||||
- `--whiteList`
|
||||
- `--minFreeSpacePercent` (default "1")
|
||||
- `--minFreeSpace`
|
||||
- `--images.fix.orientation` (default false)
|
||||
- `--readMode` [local|proxy|redirect] (default "proxy")
|
||||
- `--cpuprofile`
|
||||
- `--memprofile`
|
||||
- `--compactionMBps` (default 0)
|
||||
- `--maintenanceMBps` (default 0)
|
||||
- `--fileSizeLimitMB` (default 256)
|
||||
- `--concurrentUploadLimitMB` (default 0)
|
||||
- `--concurrentDownloadLimitMB` (default 0)
|
||||
- `--pprof` (default false)
|
||||
- `--metricsPort` (default 0)
|
||||
- `--metricsIp`
|
||||
- `--inflightUploadDataTimeout` (default 60s)
|
||||
- `--inflightDownloadDataTimeout` (default 60s)
|
||||
- `--hasSlowRead` (default true)
|
||||
- `--readBufferSizeMB` (default 4)
|
||||
- `--index.leveldbTimeout` (default 0)
|
||||
- `--debug` (default false)
|
||||
- `--debug.port` (default 6060)
|
||||
|
||||
1.5. Implement the same flag validation logic from `startVolumeServer()`:
|
||||
- Parse comma-separated `--dir`, `--max`, `--minFreeSpace`, `--disk`, `--tags`
|
||||
- Replicate single-value-to-all-dirs expansion
|
||||
- Validate count matches between dirs and limits
|
||||
- `--mserver` backward compat
|
||||
|
||||
1.6. **Test:** `cargo build` succeeds. `cargo run -- --help` shows all flags. Proto types generated.
|
||||
|
||||
**Verification:** Run with `--port 8080 --dir /tmp --master localhost:9333` — should parse without error and print config.
|
||||
|
||||
---
|
||||
|
||||
### Phase 2: Core Storage Types & On-Disk Format
|
||||
**Goal:** Read and write the SeaweedFS needle/volume binary format bit-for-bit compatible with Go.
|
||||
|
||||
**Source files to port:**
|
||||
- `weed/storage/types/needle_types.go` → `src/storage/types.rs`
|
||||
- `weed/storage/needle/needle.go` → `src/storage/needle/needle.rs`
|
||||
- `weed/storage/needle/needle_read.go` → `src/storage/needle/needle_read.rs`
|
||||
- `weed/storage/needle/needle_write.go` (partial) → `src/storage/needle/needle_write.rs`
|
||||
- `weed/storage/needle/crc.go` → `src/storage/needle/crc.rs`
|
||||
- `weed/storage/needle/needle_value_map.go` → `src/storage/needle/needle_value.rs`
|
||||
- `weed/storage/super_block/super_block.go` → `src/storage/super_block.rs`
|
||||
- `weed/storage/idx/` → `src/storage/idx/`
|
||||
|
||||
**Steps:**
|
||||
|
||||
2.1. **Fundamental types** (`types.rs`):
|
||||
- `NeedleId` (u64), `Offset` (u32 or u64 depending on version), `Size` (i32, negative = deleted)
|
||||
- `Cookie` (u32)
|
||||
- `DiskType` enum (HDD, SSD, Custom)
|
||||
- Version constants (Version1=1, Version2=2, Version3=3, CurrentVersion=3)
|
||||
- Byte serialization matching Go's `binary.BigEndian` encoding
|
||||
|
||||
2.2. **SuperBlock** (`super_block.rs`):
|
||||
- 8-byte header: Version(1) + ReplicaPlacement(1) + TTL(2) + CompactRevision(2) + Reserved(2)
|
||||
- `ReplicaPlacement` struct with same/diff rack/dc counts
|
||||
- `TTL` struct with count + unit
|
||||
- Read/write from first 8 bytes of `.dat` file
|
||||
- Match exact byte layout from `super_block.go`
|
||||
|
||||
2.3. **Needle binary format** (`needle.rs`, `needle_read.rs`):
|
||||
- Version 2/3 header: Cookie(4) + NeedleId(8) + Size(4)
|
||||
- Body: Data, Flags, Name, Mime, PairsSize, Pairs, LastModified, TTL, Checksum, AppendAtNs, Padding
|
||||
- CRC32 checksum (matching Go's `crc32.ChecksumIEEE`)
|
||||
- Padding to 8-byte alignment
|
||||
- Read path: read header → compute body length → read body → verify CRC
|
||||
|
||||
2.4. **Idx file format** (`idx/`):
|
||||
- Fixed 16-byte records: NeedleId(8) + Offset(4) + Size(4)
|
||||
- Sequential append-only file
|
||||
- Walk/iterate all entries
|
||||
- Binary search not used (loaded into memory map)
|
||||
|
||||
2.5. **NeedleMap (in-memory)** (`needle_map.rs`):
|
||||
- HashMap<NeedleId, NeedleValue> where NeedleValue = {Offset, Size}
|
||||
- Load from `.idx` file on volume mount
|
||||
- Support Get, Set, Delete operations
|
||||
- Track file count, deleted count, deleted byte count
|
||||
|
||||
2.6. **Tests:**
|
||||
- Unit test: write a needle to bytes → read it back → verify fields match
|
||||
- Unit test: write/read SuperBlock round-trip
|
||||
- Unit test: write/read idx entries round-trip
|
||||
- **Cross-compat test:** Use Go volume server to create a small volume with known data. Read it from Rust and verify all needles decoded correctly. (Keep test fixture `.dat`/`.idx` files in `tests/fixtures/`)
|
||||
|
||||
---
|
||||
|
||||
### Phase 3: Volume Struct & Lifecycle
|
||||
**Goal:** Mount, read from, write to, and unmount a volume.
|
||||
|
||||
**Source files to port:**
|
||||
- `weed/storage/volume.go` → `src/storage/volume.rs`
|
||||
- `weed/storage/volume_read.go` → `src/storage/volume_read.rs`
|
||||
- `weed/storage/volume_write.go` → `src/storage/volume_write.rs`
|
||||
- `weed/storage/volume_loading.go`
|
||||
- `weed/storage/volume_vacuum.go` → `src/storage/volume_compact.rs`
|
||||
- `weed/storage/volume_info/volume_info.go` → `src/storage/volume_info.rs`
|
||||
- `weed/storage/volume_super_block.go`
|
||||
|
||||
**Steps:**
|
||||
|
||||
3.1. **Volume struct** (`volume.rs`):
|
||||
- Fields: Id, dir, dataFile, nm (NeedleMap), SuperBlock, readOnly, lastModifiedTs, lastCompactIndexOffset, lastCompactRevision
|
||||
- `noWriteOrDelete` / `noWriteCanDelete` / `readOnly` state flags
|
||||
- File handles for `.dat` file (read + append)
|
||||
- Lock strategy: `RwLock` for concurrent reads, exclusive writes
|
||||
|
||||
3.2. **Volume loading** — exact logic from `volume_loading.go`:
|
||||
- Open `.dat` file, read SuperBlock from first 8 bytes
|
||||
- Load `.idx` file into NeedleMap
|
||||
- Handle `.vif` (VolumeInfo) JSON sidecar file
|
||||
- Set volume state based on SuperBlock + VolumeInfo
|
||||
|
||||
3.3. **Volume read** (`volume_read.rs`) — from `volume_read.go`:
|
||||
- `ReadNeedle(needleId, cookie)`: lookup in NeedleMap → seek in .dat → read needle bytes → verify cookie + CRC → return data
|
||||
- Handle deleted needles (Size < 0)
|
||||
- `ReadNeedleBlob(offset, size)`: raw blob read
|
||||
- `ReadNeedleMeta(needleId, offset, size)`: read metadata only
|
||||
|
||||
3.4. **Volume write** (`volume_write.rs`) — from `volume_write.go`:
|
||||
- `WriteNeedle(needle)`: serialize needle → append to .dat → update .idx → update NeedleMap
|
||||
- `DeleteNeedle(needleId)`: mark as deleted in NeedleMap + append tombstone to .idx
|
||||
- File size limit check
|
||||
- Concurrent write serialization (mutex on write path)
|
||||
|
||||
3.5. **Volume compaction** (`volume_compact.rs`) — from `volume_vacuum.go`:
|
||||
- `CheckCompact()`: compute garbage ratio
|
||||
- `Compact()`: create new .dat/.idx, copy only live needles, update compact revision
|
||||
- `CommitCompact()`: rename compacted files over originals
|
||||
- `CleanupCompact()`: remove temp files
|
||||
- Throttle by `compactionBytePerSecond`
|
||||
|
||||
3.6. **Volume info** (`volume_info.rs`):
|
||||
- Read/write `.vif` JSON sidecar
|
||||
- VolumeInfo protobuf struct mapping
|
||||
- Remote file references for tiered storage
|
||||
|
||||
3.7. **Tests:**
|
||||
- Mount a volume, write 100 needles, read them all back, verify content
|
||||
- Delete 50 needles, verify they return "deleted"
|
||||
- Compact, verify only 50 remain, verify content
|
||||
- Read Go-created volume fixtures
|
||||
|
||||
---
|
||||
|
||||
### Phase 4: Store (Multi-Volume, Multi-Disk Manager)
|
||||
**Goal:** Manage multiple volumes across multiple disk directories.
|
||||
|
||||
**Source files to port:**
|
||||
- `weed/storage/store.go` → `src/storage/store.rs`
|
||||
- `weed/storage/disk_location.go` → `src/storage/disk_location.rs`
|
||||
- `weed/storage/store_ec.go`
|
||||
- `weed/storage/store_state.go`
|
||||
|
||||
**Steps:**
|
||||
|
||||
4.1. **DiskLocation** (`disk_location.rs`):
|
||||
- Directory path, max volume count, min free space, disk type, tags
|
||||
- Load all volumes from directory on startup
|
||||
- Track free space, check writable
|
||||
|
||||
4.2. **Store** (`store.rs`):
|
||||
- Vector of `DiskLocation`s
|
||||
- `GetVolume(volumeId)` → lookup across all locations
|
||||
- `HasVolume(volumeId)` check
|
||||
- `AllocateVolume(...)` — create new volume in appropriate location
|
||||
- `DeleteVolume(...)`, `MountVolume(...)`, `UnmountVolume(...)`
|
||||
- `DeleteCollection(collection)` — delete all volumes of a collection
|
||||
- Collect volume status for heartbeat
|
||||
- `SetStopping()`, `Close()`
|
||||
- Persistent state (maintenance mode) via `store_state.go`
|
||||
|
||||
4.3. **Store state** — `VolumeServerState` protobuf with maintenance flag, persisted to disk.
|
||||
|
||||
4.4. **Tests:**
|
||||
- Create store with 2 dirs, allocate volumes in each, verify load balancing
|
||||
- Mount/unmount/delete lifecycle
|
||||
- State persistence across restart
|
||||
|
||||
---
|
||||
|
||||
### Phase 5: Erasure Coding
|
||||
**Goal:** Full EC shard encode/decode/read/write/rebuild.
|
||||
|
||||
**Source files to port:**
|
||||
- `weed/storage/erasure_coding/` (3,599 lines)
|
||||
|
||||
**Steps:**
|
||||
|
||||
5.1. **EC volume + shard structs** — `EcVolume`, `EcShard` with file handles for `.ec00`–`.ec13` shard files + `.ecx` index + `.ecj` journal.
|
||||
|
||||
5.2. **EC encoder** — Reed-Solomon 10+4 (configurable) encoding using `reed-solomon-erasure` crate:
|
||||
- `VolumeEcShardsGenerate`: read .dat → split into data shards → compute parity → write .ec00-.ec13 + .ecx
|
||||
|
||||
5.3. **EC decoder/reader** — reconstruct data from any 10 of 14 shards:
|
||||
- `EcShardRead`: read range from a specific shard
|
||||
- Locate needle in EC volume via .ecx index
|
||||
- Handle cross-shard needle reads
|
||||
|
||||
5.4. **EC shard operations:**
|
||||
- Copy, delete, mount, unmount shards
|
||||
- `VolumeEcShardsRebuild`: rebuild missing shards from remaining
|
||||
- `VolumeEcShardsToVolume`: reconstruct .dat from EC shards
|
||||
- `VolumeEcBlobDelete`: mark deleted in EC journal
|
||||
- `VolumeEcShardsInfo`: report shard metadata
|
||||
|
||||
5.5. **Tests:**
|
||||
- Encode a volume → verify 14 shards created
|
||||
- Delete 4 shards → rebuild → verify data intact
|
||||
- Read individual needles from EC volume
|
||||
- Cross-compat with Go-generated EC shards
|
||||
|
||||
---
|
||||
|
||||
### Phase 6: Backend / Tiered Storage
|
||||
**Goal:** Support tiered storage to remote backends (S3, etc).
|
||||
|
||||
**Source files to port:**
|
||||
- `weed/storage/backend/` (1,850 lines)
|
||||
|
||||
**Steps:**
|
||||
|
||||
6.1. **Backend trait** — abstract `BackendStorage` trait with `ReadAt`, `WriteAt`, `Truncate`, `Close`, `Name`.
|
||||
|
||||
6.2. **Disk backend** — default local disk implementation.
|
||||
|
||||
6.3. **S3 backend** — upload .dat to S3, read ranges via S3 range requests.
|
||||
|
||||
6.4. **Tier move operations:**
|
||||
- `VolumeTierMoveDatToRemote`: upload .dat to remote, optionally delete local
|
||||
- `VolumeTierMoveDatFromRemote`: download .dat from remote
|
||||
|
||||
6.5. **Tests:**
|
||||
- Disk backend read/write round-trip
|
||||
- S3 backend with mock/localstack
|
||||
|
||||
---
|
||||
|
||||
### Phase 7: Security Layer
|
||||
**Goal:** JWT authentication, whitelist guard, TLS configuration.
|
||||
|
||||
**Source files to port:**
|
||||
- `weed/security/guard.go` → `src/security/guard.rs`
|
||||
- `weed/security/jwt.go` → `src/security/jwt.rs`
|
||||
- `weed/security/tls.go` → `src/security/tls.rs`
|
||||
|
||||
**Steps:**
|
||||
|
||||
7.1. **Guard** (`guard.rs`):
|
||||
- Whitelist IP check (exact match on `r.RemoteAddr`)
|
||||
- Wrap handlers with whitelist enforcement
|
||||
- `UpdateWhiteList()` for live reload
|
||||
|
||||
7.2. **JWT** (`jwt.rs`):
|
||||
- `SeaweedFileIdClaims` with `fid` field
|
||||
- Sign with HMAC-SHA256
|
||||
- Verify + decode with expiry check
|
||||
- Separate signing keys for read vs write
|
||||
- `GetJwt(request)` — extract from `Authorization: Bearer` header or `jwt` query param
|
||||
|
||||
7.3. **TLS** (`tls.rs`):
|
||||
- Load server TLS cert/key for gRPC and HTTPS
|
||||
- Load client TLS for mutual TLS
|
||||
- Read from `security.toml` config (same format as Go's viper config)
|
||||
|
||||
7.4. **Tests:**
|
||||
- JWT sign → verify round-trip
|
||||
- JWT with wrong key → reject
|
||||
- JWT with expired token → reject
|
||||
- JWT fid mismatch → reject
|
||||
- Whitelist allow/deny
|
||||
|
||||
---
|
||||
|
||||
### Phase 8: Prometheus Metrics
|
||||
**Goal:** Export same metric names as Go for dashboard compatibility.
|
||||
|
||||
**Source files to port:**
|
||||
- `weed/stats/metrics.go` (volume server counters/gauges/histograms)
|
||||
|
||||
**Steps:**
|
||||
|
||||
8.1. Define all Prometheus metrics matching Go names:
|
||||
- `VolumeServerRequestCounter` (labels: method, status)
|
||||
- `VolumeServerRequestHistogram` (labels: method)
|
||||
- `VolumeServerInFlightRequestsGauge` (labels: method)
|
||||
- `VolumeServerInFlightUploadSize`
|
||||
- `VolumeServerInFlightDownloadSize`
|
||||
- `VolumeServerConcurrentUploadLimit`
|
||||
- `VolumeServerConcurrentDownloadLimit`
|
||||
- `VolumeServerHandlerCounter` (labels: type — UploadLimitCond, DownloadLimitCond)
|
||||
- Read/Write/Delete request counters
|
||||
|
||||
8.2. Metrics HTTP endpoint on `--metricsPort`.
|
||||
|
||||
8.3. Optional push-based metrics loop (`LoopPushingMetric`).
|
||||
|
||||
8.4. **Test:** Verify metric names and labels match Go output.
|
||||
|
||||
---
|
||||
|
||||
### Phase 9: HTTP Server & Handlers
|
||||
**Goal:** All HTTP endpoints with exact same behavior as Go.
|
||||
|
||||
**Source files to port:**
|
||||
- `weed/server/volume_server.go` → `src/server/volume_server.rs`
|
||||
- `weed/server/volume_server_handlers.go` → `src/server/http_handlers.rs`
|
||||
- `weed/server/volume_server_handlers_read.go` → `src/server/http_read.rs`
|
||||
- `weed/server/volume_server_handlers_write.go` → `src/server/http_write.rs`
|
||||
- `weed/server/volume_server_handlers_admin.go` → `src/server/http_admin.rs`
|
||||
- `weed/server/volume_server_handlers_helper.go` (URL parsing, proxy, JSON responses)
|
||||
- `weed/server/volume_server_handlers_ui.go` → `src/server/http_admin.rs`
|
||||
|
||||
**Steps:**
|
||||
|
||||
9.1. **URL path parsing** — from `handlers_helper.go`:
|
||||
- Parse `/<vid>,<fid>` and `/<vid>/<fid>` patterns
|
||||
- Extract volume ID, file ID, filename, ext
|
||||
|
||||
9.2. **Route dispatch** — from `privateStoreHandler` and `publicReadOnlyHandler`:
|
||||
- `GET /` → `GetOrHeadHandler`
|
||||
- `HEAD /` → `GetOrHeadHandler`
|
||||
- `POST /` → `PostHandler` (whitelist gated)
|
||||
- `PUT /` → `PostHandler` (whitelist gated)
|
||||
- `DELETE /` → `DeleteHandler` (whitelist gated)
|
||||
- `OPTIONS /` → CORS preflight
|
||||
- `GET /status` → JSON status
|
||||
- `GET /healthz` → health check
|
||||
- `GET /ui/index.html` → HTML UI page
|
||||
- Static resources (CSS/JS for UI)
|
||||
|
||||
9.3. **GET/HEAD handler** (`http_read.rs`) — from `handlers_read.go` (468 lines):
|
||||
- JWT read authorization check
|
||||
- Lookup needle by volume ID + needle ID + cookie
|
||||
- ETag / If-None-Match / If-Modified-Since conditional responses
|
||||
- Content-Type from stored MIME or filename extension
|
||||
- Content-Disposition header
|
||||
- Content-Encoding (gzip/zstd stored data)
|
||||
- Range request support (HTTP 206 Partial Content)
|
||||
- JPEG orientation fix (if configured)
|
||||
- Proxy to replica on local miss (readMode=proxy)
|
||||
- Redirect to replica (readMode=redirect)
|
||||
- Download tracking (in-flight size accounting)
|
||||
|
||||
9.4. **POST/PUT handler** (`http_write.rs`) — from `handlers_write.go` (170 lines):
|
||||
- JWT write authorization check
|
||||
- Multipart form parsing
|
||||
- Extract file data, filename, content type, TTL, last-modified
|
||||
- Optional gzip/zstd compression
|
||||
- Write needle to volume
|
||||
- Replicate to peers (same logic as Go's `DistributedOperation`)
|
||||
- Return JSON: {name, size, eTag, error}
|
||||
|
||||
9.5. **DELETE handler** — already in handlers.go:
|
||||
- JWT authorization
|
||||
- Delete from local volume
|
||||
- Replicate delete to peers
|
||||
- Return JSON result
|
||||
|
||||
9.6. **Admin handlers** (`http_admin.rs`):
|
||||
- `/status` → JSON with volumes, version, disk status
|
||||
- `/healthz` → 200 OK if serving
|
||||
- `/ui/index.html` → HTML dashboard
|
||||
|
||||
9.7. **Concurrency limiting** — from `handlers.go`:
|
||||
- Upload concurrency limit with `sync::Condvar` + timeout
|
||||
- Download concurrency limit with proxy fallback to replicas
|
||||
- HTTP 429 on timeout, 499 on client cancel
|
||||
- Replication traffic bypasses upload limits
|
||||
|
||||
9.8. **Public port** — if configured, separate listener with read-only routes (GET/HEAD/OPTIONS only).
|
||||
|
||||
9.9. **Request ID middleware** — generate unique request ID per request.
|
||||
|
||||
9.10. **Tests:**
|
||||
- Integration: start server → upload file via POST → GET it back → verify content
|
||||
- Integration: upload → DELETE → GET returns 404
|
||||
- Integration: conditional GET with ETag → 304
|
||||
- Integration: range request → 206 with correct bytes
|
||||
- Integration: exceed upload limit → 429
|
||||
- Integration: whitelist enforcement
|
||||
- Integration: JWT enforcement
|
||||
|
||||
---
|
||||
|
||||
### Phase 10: gRPC Service Implementation
|
||||
**Goal:** All 40 gRPC methods with exact logic.
|
||||
|
||||
**Source files to port:**
|
||||
- `weed/server/volume_grpc_admin.go` (380 lines)
|
||||
- `weed/server/volume_grpc_vacuum.go` (124 lines)
|
||||
- `weed/server/volume_grpc_copy.go` (636 lines)
|
||||
- `weed/server/volume_grpc_copy_incremental.go` (66 lines)
|
||||
- `weed/server/volume_grpc_read_write.go` (74 lines)
|
||||
- `weed/server/volume_grpc_batch_delete.go` (124 lines)
|
||||
- `weed/server/volume_grpc_tail.go` (140 lines)
|
||||
- `weed/server/volume_grpc_erasure_coding.go` (619 lines)
|
||||
- `weed/server/volume_grpc_scrub.go` (121 lines)
|
||||
- `weed/server/volume_grpc_tier_upload.go` (98 lines)
|
||||
- `weed/server/volume_grpc_tier_download.go` (85 lines)
|
||||
- `weed/server/volume_grpc_remote.go` (95 lines)
|
||||
- `weed/server/volume_grpc_query.go` (69 lines)
|
||||
- `weed/server/volume_grpc_state.go` (26 lines)
|
||||
- `weed/server/volume_grpc_read_all.go` (35 lines)
|
||||
- `weed/server/volume_grpc_client_to_master.go` (325 lines)
|
||||
|
||||
**Steps (grouped by functional area):**
|
||||
|
||||
10.1. **Implement `tonic::Service` for `VolumeServer`** — the generated trait from proto.
|
||||
|
||||
10.2. **Admin RPCs** (`grpc_admin.rs`):
|
||||
- `AllocateVolume` — create volume on appropriate disk location
|
||||
- `VolumeMount` / `VolumeUnmount` / `VolumeDelete`
|
||||
- `VolumeMarkReadonly` / `VolumeMarkWritable`
|
||||
- `VolumeConfigure` — change replication
|
||||
- `VolumeStatus` — return read-only, size, file counts
|
||||
- `VolumeServerStatus` — disk statuses, memory, version, DC, rack
|
||||
- `VolumeServerLeave` — deregister from master
|
||||
- `DeleteCollection`
|
||||
- `VolumeNeedleStatus` — get needle metadata by ID
|
||||
- `Ping` — latency measurement
|
||||
- `GetState` / `SetState` — maintenance mode
|
||||
|
||||
10.3. **Vacuum RPCs** (`grpc_vacuum.rs`):
|
||||
- `VacuumVolumeCheck` — return garbage ratio
|
||||
- `VacuumVolumeCompact` — stream progress (streaming response)
|
||||
- `VacuumVolumeCommit` — finalize compaction
|
||||
- `VacuumVolumeCleanup` — remove temp files
|
||||
|
||||
10.4. **Copy RPCs** (`grpc_copy.rs`):
|
||||
- `VolumeCopy` — stream .dat/.idx from source to create local copy
|
||||
- `VolumeSyncStatus` — return sync metadata
|
||||
- `VolumeIncrementalCopy` — stream .dat delta since timestamp (streaming)
|
||||
- `CopyFile` — generic file copy by extension (streaming)
|
||||
- `ReceiveFile` — receive streamed file (client streaming)
|
||||
- `ReadVolumeFileStatus` — return file timestamps and sizes
|
||||
|
||||
10.5. **Read/Write RPCs** (`grpc_read_write.rs`):
|
||||
- `ReadNeedleBlob` — raw needle blob read
|
||||
- `ReadNeedleMeta` — needle metadata
|
||||
- `WriteNeedleBlob` — raw needle blob write
|
||||
- `ReadAllNeedles` — stream all needles from volume(s) (streaming)
|
||||
|
||||
10.6. **Batch delete** (`grpc_batch_delete.rs`):
|
||||
- `BatchDelete` — delete multiple file IDs, return per-ID results
|
||||
|
||||
10.7. **Tail RPCs** (`grpc_tail.rs`):
|
||||
- `VolumeTailSender` — stream new needles since timestamp (streaming)
|
||||
- `VolumeTailReceiver` — connect to another volume server and tail its changes
|
||||
|
||||
10.8. **Erasure coding RPCs** (`grpc_erasure_coding.rs`):
|
||||
- `VolumeEcShardsGenerate` — generate EC shards from volume
|
||||
- `VolumeEcShardsRebuild` — rebuild missing shards
|
||||
- `VolumeEcShardsCopy` — copy shards from another server
|
||||
- `VolumeEcShardsDelete` — delete EC shards
|
||||
- `VolumeEcShardsMount` / `VolumeEcShardsUnmount`
|
||||
- `VolumeEcShardRead` — read from EC shard (streaming)
|
||||
- `VolumeEcBlobDelete` — mark blob deleted in EC volume
|
||||
- `VolumeEcShardsToVolume` — reconstruct volume from EC shards
|
||||
- `VolumeEcShardsInfo` — return shard metadata
|
||||
|
||||
10.9. **Scrub RPCs** (`grpc_scrub.rs`):
|
||||
- `ScrubVolume` — integrity check volumes (INDEX / FULL / LOCAL modes)
|
||||
- `ScrubEcVolume` — integrity check EC volumes
|
||||
|
||||
10.10. **Tier RPCs** (`grpc_tier.rs`):
|
||||
- `VolumeTierMoveDatToRemote` — upload to remote backend (streaming progress)
|
||||
- `VolumeTierMoveDatFromRemote` — download from remote (streaming progress)
|
||||
|
||||
10.11. **Remote storage** (`grpc_remote.rs`):
|
||||
- `FetchAndWriteNeedle` — fetch from remote storage, write locally, replicate
|
||||
|
||||
10.12. **Query** (`grpc_query.rs`):
|
||||
- `Query` — experimental CSV/JSON/Parquet select on stored data (streaming)
|
||||
|
||||
10.13. **Master heartbeat** (`grpc_client_to_master.rs`):
|
||||
- `heartbeat()` background task — periodic gRPC stream to master
|
||||
- Send: volume info, EC shard info, disk stats, has-no-space flags, deleted volumes
|
||||
- Receive: volume size limit, leader address, metrics config
|
||||
- Reconnect on failure with backoff
|
||||
- `StopHeartbeat()` for graceful shutdown
|
||||
|
||||
10.14. **Tests:**
|
||||
- Integration test per RPC: call via tonic client → verify response
|
||||
- Streaming RPCs: verify all chunks received
|
||||
- Error cases: invalid volume ID, non-existent volume, etc.
|
||||
- Heartbeat: mock master gRPC server, verify registration
|
||||
|
||||
---
|
||||
|
||||
### Phase 11: Startup, Lifecycle & Graceful Shutdown
|
||||
**Goal:** Full server startup matching Go's `runVolume()` and `startVolumeServer()`.
|
||||
|
||||
**Steps:**
|
||||
|
||||
11.1. **Startup sequence** (match `volume.go` exactly):
|
||||
1. Load security configuration from `security.toml`
|
||||
2. Start metrics server on metrics port
|
||||
3. Parse folder/max/minFreeSpace/diskType/tags
|
||||
4. Validate all directory writable
|
||||
5. Resolve IP, bind IP, public URL, gRPC port
|
||||
6. Create `VolumeServer` struct
|
||||
7. Check with master (initial handshake)
|
||||
8. Create `Store` (loads all existing volumes from disk)
|
||||
9. Create security `Guard`
|
||||
10. Register HTTP routes on admin mux
|
||||
11. Optionally register public mux
|
||||
12. Start gRPC server on gRPC port
|
||||
13. Start public HTTP server (if separated)
|
||||
14. Start cluster HTTP server (with optional TLS)
|
||||
15. Start heartbeat background task
|
||||
16. Start metrics push loop
|
||||
17. Register SIGHUP handler for config reload + new volume loading
|
||||
|
||||
11.2. **Graceful shutdown** (match Go exactly):
|
||||
1. On SIGINT/SIGTERM:
|
||||
2. Stop heartbeat (notify master we're leaving)
|
||||
3. Wait `preStopSeconds`
|
||||
4. Stop public HTTP server
|
||||
5. Stop cluster HTTP server
|
||||
6. Graceful stop gRPC server
|
||||
7. `volumeServer.Shutdown()` → `store.Close()` (flush all volumes)
|
||||
|
||||
11.3. **Reload** (SIGHUP):
|
||||
- Reload security config
|
||||
- Update whitelist
|
||||
- Load newly appeared volumes from disk
|
||||
|
||||
11.4. **Tests:**
|
||||
- Start server → send SIGTERM → verify clean shutdown
|
||||
- Start server → SIGHUP → verify config reloaded
|
||||
|
||||
---
|
||||
|
||||
### Phase 12: Integration & Cross-Compatibility Testing
|
||||
**Goal:** Rust volume server is a drop-in replacement for Go volume server.
|
||||
|
||||
**Steps:**
|
||||
|
||||
12.1. **Binary compatibility tests:**
|
||||
- Create volumes with Go volume server
|
||||
- Start Rust volume server on same data directory
|
||||
- Read all data → verify identical
|
||||
- Write new data with Rust → read with Go → verify
|
||||
|
||||
12.2. **API compatibility tests:**
|
||||
- Run same HTTP requests against both Go and Rust servers
|
||||
- Compare response bodies, headers, status codes
|
||||
- Test all gRPC RPCs against both
|
||||
|
||||
12.3. **Master interop test:**
|
||||
- Start Go master server
|
||||
- Register Rust volume server
|
||||
- Verify heartbeat works
|
||||
- Verify volume assignment works
|
||||
- Upload via filer → stored on Rust volume server → read back
|
||||
|
||||
12.4. **Performance benchmarks:**
|
||||
- Throughput: sequential writes, sequential reads
|
||||
- Latency: p50/p99 for read/write
|
||||
- Concurrency: parallel reads/writes
|
||||
- Compare Rust vs Go numbers
|
||||
|
||||
12.5. **Edge cases:**
|
||||
- Volume at max size
|
||||
- Disk full handling
|
||||
- Corrupt .dat file recovery
|
||||
- Network partition during replication
|
||||
- EC shard loss + rebuild
|
||||
|
||||
---
|
||||
|
||||
## Execution Order & Dependencies
|
||||
|
||||
```
|
||||
Phase 1 (Skeleton + CLI) ← no deps, start here
|
||||
↓
|
||||
Phase 2 (Storage types) ← needs Phase 1 (types used everywhere)
|
||||
↓
|
||||
Phase 3 (Volume struct) ← needs Phase 2
|
||||
↓
|
||||
Phase 4 (Store manager) ← needs Phase 3
|
||||
↓
|
||||
Phase 7 (Security) ← independent, can parallel with 3-4
|
||||
Phase 8 (Metrics) ← independent, can parallel with 3-4
|
||||
↓
|
||||
Phase 9 (HTTP server) ← needs Phase 4 + 7 + 8
|
||||
Phase 10 (gRPC server) ← needs Phase 4 + 7 + 8
|
||||
↓
|
||||
Phase 5 (Erasure coding) ← needs Phase 4, wire into Phase 10
|
||||
Phase 6 (Tiered storage) ← needs Phase 4, wire into Phase 10
|
||||
↓
|
||||
Phase 11 (Startup + shutdown) ← needs Phase 9 + 10
|
||||
↓
|
||||
Phase 12 (Integration tests) ← needs all above
|
||||
```
|
||||
|
||||
## Estimated Scope
|
||||
|
||||
| Phase | Estimated Rust Lines | Complexity |
|
||||
|---|---|---|
|
||||
| 1. Skeleton + CLI | ~400 | Low |
|
||||
| 2. Storage types | ~2,000 | High (binary compat critical) |
|
||||
| 3. Volume struct | ~2,500 | High |
|
||||
| 4. Store manager | ~1,000 | Medium |
|
||||
| 5. Erasure coding | ~3,000 | High |
|
||||
| 6. Tiered storage | ~1,500 | Medium |
|
||||
| 7. Security | ~500 | Medium |
|
||||
| 8. Metrics | ~300 | Low |
|
||||
| 9. HTTP server | ~2,000 | High |
|
||||
| 10. gRPC server | ~3,500 | High |
|
||||
| 11. Startup/shutdown | ~500 | Medium |
|
||||
| 12. Integration tests | ~2,000 | Medium |
|
||||
| **Total** | **~19,000** | |
|
||||
|
||||
## Critical Invariants to Preserve
|
||||
|
||||
1. **Binary format compatibility** — Rust must read/write `.dat`, `.idx`, `.vif`, `.ecX` files identically to Go. A single byte off = data loss.
|
||||
2. **gRPC wire compatibility** — Same proto, same field semantics. Go master must talk to Rust volume server seamlessly.
|
||||
3. **HTTP API compatibility** — Same URL patterns, same JSON response shapes, same headers, same status codes.
|
||||
4. **Replication protocol** — Write replication between Go and Rust volume servers must work bidirectionally.
|
||||
5. **Heartbeat protocol** — Rust volume server must register with Go master and maintain heartbeat.
|
||||
6. **CRC32 algorithm** — Must use IEEE polynomial (same as Go's `crc32.ChecksumIEEE`).
|
||||
7. **JWT compatibility** — Tokens signed by Go filer/master must be verifiable by Rust volume server and vice versa.
|
||||
@@ -16,9 +16,31 @@ RUN cd /go/src/github.com/seaweedfs/seaweedfs/weed \
|
||||
&& export LDFLAGS="-X github.com/seaweedfs/seaweedfs/weed/util/version.COMMIT=$(git rev-parse --short HEAD)" \
|
||||
&& CGO_ENABLED=0 go install -tags "$TAGS" -ldflags "-extldflags -static ${LDFLAGS}"
|
||||
|
||||
# Rust volume server builder (amd64/arm64 only)
|
||||
FROM rust:1-alpine as rust_builder
|
||||
ARG TARGETARCH
|
||||
RUN apk add musl-dev protobuf-dev git
|
||||
COPY --from=builder /go/src/github.com/seaweedfs/seaweedfs/seaweed-volume /build/seaweed-volume
|
||||
COPY --from=builder /go/src/github.com/seaweedfs/seaweedfs/proto /build/proto
|
||||
WORKDIR /build/seaweed-volume
|
||||
ARG TAGS
|
||||
RUN if [ "$TARGETARCH" = "amd64" ] || [ "$TARGETARCH" = "arm64" ]; then \
|
||||
if [ "$TAGS" = "5BytesOffset" ]; then \
|
||||
cargo build --release; \
|
||||
else \
|
||||
cargo build --release --no-default-features; \
|
||||
fi && \
|
||||
cp target/release/weed-volume /weed-volume; \
|
||||
else \
|
||||
echo "Skipping Rust build for $TARGETARCH (unsupported)" && \
|
||||
touch /weed-volume; \
|
||||
fi
|
||||
|
||||
FROM alpine AS final
|
||||
LABEL author="Chris Lu"
|
||||
COPY --from=builder /go/bin/weed /usr/bin/
|
||||
# Copy Rust volume server binary (real binary on amd64/arm64, empty placeholder on other platforms)
|
||||
COPY --from=rust_builder /weed-volume /usr/bin/weed-volume
|
||||
RUN mkdir -p /etc/seaweedfs
|
||||
COPY --from=builder /go/src/github.com/seaweedfs/seaweedfs/docker/filer.toml /etc/seaweedfs/filer.toml
|
||||
COPY --from=builder /go/src/github.com/seaweedfs/seaweedfs/docker/entrypoint.sh /entrypoint.sh
|
||||
|
||||
@@ -72,6 +72,20 @@ case "$1" in
|
||||
exec /usr/bin/weed -logtostderr=true volume $ARGS $@
|
||||
;;
|
||||
|
||||
'volume-rust')
|
||||
ARGS="-dir /data -max 0"
|
||||
if isArgPassed "-max" "$@"; then
|
||||
ARGS="-dir /data"
|
||||
fi
|
||||
shift
|
||||
if [ ! -s /usr/bin/weed-volume ]; then
|
||||
echo "Error: Rust volume server is not available on this platform ($(uname -m))." >&2
|
||||
echo "Use 'volume' for the Go volume server instead." >&2
|
||||
exit 1
|
||||
fi
|
||||
exec /usr/bin/weed-volume $ARGS $@
|
||||
;;
|
||||
|
||||
'server')
|
||||
ARGS="-dir=/data -volume.max=0 -master.volumeSizeLimitMB=1024"
|
||||
if isArgPassed "-volume.max" "$@"; then
|
||||
|
||||
275
install.sh
Executable file
275
install.sh
Executable file
@@ -0,0 +1,275 @@
|
||||
#!/bin/bash
|
||||
#
|
||||
# SeaweedFS Installer
|
||||
# Downloads Go and/or Rust binaries from GitHub releases.
|
||||
#
|
||||
# Usage:
|
||||
# curl -fsSL https://raw.githubusercontent.com/seaweedfs/seaweedfs/master/install.sh | bash
|
||||
# curl -fsSL ... | bash -s -- --component volume-rust --large-disk
|
||||
# curl -fsSL ... | bash -s -- --version v3.93 --dir /usr/local/bin
|
||||
#
|
||||
# Options:
|
||||
# --component COMP Which binary to install: weed, volume-rust, all (default: weed)
|
||||
# --version VER Release version tag (default: latest)
|
||||
# --large-disk Use large disk variant (5-byte offset, 8TB max volume)
|
||||
# --dir DIR Installation directory (default: /usr/local/bin)
|
||||
# --help Show this help message
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
REPO="seaweedfs/seaweedfs"
|
||||
COMPONENT="weed"
|
||||
VERSION=""
|
||||
LARGE_DISK=false
|
||||
INSTALL_DIR="/usr/local/bin"
|
||||
|
||||
# Colors (if terminal supports them)
|
||||
if [ -t 1 ]; then
|
||||
RED='\033[0;31m'
|
||||
GREEN='\033[0;32m'
|
||||
YELLOW='\033[0;33m'
|
||||
BLUE='\033[0;34m'
|
||||
NC='\033[0m'
|
||||
else
|
||||
RED='' GREEN='' YELLOW='' BLUE='' NC=''
|
||||
fi
|
||||
|
||||
info() { echo -e "${BLUE}[info]${NC} $*"; }
|
||||
ok() { echo -e "${GREEN}[ok]${NC} $*"; }
|
||||
warn() { echo -e "${YELLOW}[warn]${NC} $*"; }
|
||||
error() { echo -e "${RED}[error]${NC} $*" >&2; exit 1; }
|
||||
|
||||
usage() {
|
||||
sed -n '/^# Usage:/,/^$/p' "$0" | sed 's/^# \?//'
|
||||
exit 0
|
||||
}
|
||||
|
||||
# Parse arguments
|
||||
while [ $# -gt 0 ]; do
|
||||
case "$1" in
|
||||
--component) COMPONENT="$2"; shift 2 ;;
|
||||
--version) VERSION="$2"; shift 2 ;;
|
||||
--large-disk) LARGE_DISK=true; shift ;;
|
||||
--dir) INSTALL_DIR="$2"; shift 2 ;;
|
||||
--help|-h) usage ;;
|
||||
*) error "Unknown option: $1. Use --help for usage." ;;
|
||||
esac
|
||||
done
|
||||
|
||||
# Detect OS and architecture
|
||||
detect_platform() {
|
||||
local os arch
|
||||
|
||||
case "$(uname -s)" in
|
||||
Linux*) os="linux" ;;
|
||||
Darwin*) os="darwin" ;;
|
||||
MINGW*|MSYS*|CYGWIN*) os="windows" ;;
|
||||
FreeBSD*) os="freebsd" ;;
|
||||
*) error "Unsupported OS: $(uname -s)" ;;
|
||||
esac
|
||||
|
||||
case "$(uname -m)" in
|
||||
x86_64|amd64) arch="amd64" ;;
|
||||
aarch64|arm64) arch="arm64" ;;
|
||||
armv7l|armv6l) arch="arm" ;;
|
||||
*) error "Unsupported architecture: $(uname -m)" ;;
|
||||
esac
|
||||
|
||||
echo "${os}" "${arch}"
|
||||
}
|
||||
|
||||
# Get latest release tag from GitHub API
|
||||
get_latest_version() {
|
||||
local url="https://api.github.com/repos/${REPO}/releases/latest"
|
||||
if command -v curl &>/dev/null; then
|
||||
curl -fsSL "$url" | grep '"tag_name"' | head -1 | sed 's/.*"tag_name": *"\([^"]*\)".*/\1/'
|
||||
elif command -v wget &>/dev/null; then
|
||||
wget -qO- "$url" | grep '"tag_name"' | head -1 | sed 's/.*"tag_name": *"\([^"]*\)".*/\1/'
|
||||
else
|
||||
error "Neither curl nor wget found. Please install one."
|
||||
fi
|
||||
}
|
||||
|
||||
# Download a file
|
||||
download() {
|
||||
local url="$1" dest="$2"
|
||||
info "Downloading ${url}"
|
||||
if command -v curl &>/dev/null; then
|
||||
curl -fsSL -o "$dest" "$url"
|
||||
elif command -v wget &>/dev/null; then
|
||||
wget -qO "$dest" "$url"
|
||||
fi
|
||||
}
|
||||
|
||||
# Build Go weed binary asset name
|
||||
go_asset_name() {
|
||||
local os="$1" arch="$2"
|
||||
local suffix="${os}_${arch}"
|
||||
if [ "$LARGE_DISK" = true ]; then
|
||||
suffix="${suffix}_large_disk"
|
||||
fi
|
||||
echo "${suffix}.tar.gz"
|
||||
}
|
||||
|
||||
# Build Rust volume server asset name
|
||||
rust_asset_name() {
|
||||
local os="$1" arch="$2"
|
||||
local prefix="weed-volume"
|
||||
if [ "$LARGE_DISK" = true ]; then
|
||||
prefix="weed-volume_large_disk"
|
||||
else
|
||||
prefix="weed-volume"
|
||||
fi
|
||||
local suffix="${os}_${arch}"
|
||||
if [ "$os" = "windows" ]; then
|
||||
echo "${prefix}_${suffix}.zip"
|
||||
else
|
||||
echo "${prefix}_${suffix}.tar.gz"
|
||||
fi
|
||||
}
|
||||
|
||||
# Install a single component
|
||||
install_component() {
|
||||
local component="$1" os="$2" arch="$3"
|
||||
local asset_name download_url tmpdir
|
||||
|
||||
tmpdir="$(mktemp -d)"
|
||||
trap "rm -rf '$tmpdir'" EXIT
|
||||
|
||||
case "$component" in
|
||||
weed)
|
||||
asset_name="$(go_asset_name "$os" "$arch")"
|
||||
download_url="https://github.com/${REPO}/releases/download/${VERSION}/${asset_name}"
|
||||
download "$download_url" "${tmpdir}/${asset_name}"
|
||||
|
||||
info "Extracting ${asset_name}..."
|
||||
tar xzf "${tmpdir}/${asset_name}" -C "$tmpdir"
|
||||
|
||||
# The Go release action puts the binary inside a directory
|
||||
local weed_bin
|
||||
weed_bin="$(find "$tmpdir" -name 'weed' -type f | head -1)"
|
||||
if [ -z "$weed_bin" ]; then
|
||||
weed_bin="$(find "$tmpdir" -name 'weed.exe' -type f | head -1)"
|
||||
fi
|
||||
if [ -z "$weed_bin" ]; then
|
||||
error "Could not find weed binary in archive"
|
||||
fi
|
||||
|
||||
chmod +x "$weed_bin"
|
||||
install_binary "$weed_bin" "weed"
|
||||
ok "Installed weed to ${INSTALL_DIR}/weed"
|
||||
;;
|
||||
|
||||
volume-rust)
|
||||
# Check platform support for Rust volume server
|
||||
case "$os" in
|
||||
linux|darwin|windows) ;;
|
||||
*) error "Rust volume server is not available for ${os}. Supported: linux, darwin, windows" ;;
|
||||
esac
|
||||
case "$arch" in
|
||||
amd64|arm64) ;;
|
||||
*) error "Rust volume server is not available for ${arch}. Supported: amd64, arm64" ;;
|
||||
esac
|
||||
|
||||
asset_name="$(rust_asset_name "$os" "$arch")"
|
||||
download_url="https://github.com/${REPO}/releases/download/${VERSION}/${asset_name}"
|
||||
download "$download_url" "${tmpdir}/${asset_name}"
|
||||
|
||||
info "Extracting ${asset_name}..."
|
||||
if [ "$os" = "windows" ]; then
|
||||
unzip -q "${tmpdir}/${asset_name}" -d "$tmpdir"
|
||||
else
|
||||
tar xzf "${tmpdir}/${asset_name}" -C "$tmpdir"
|
||||
fi
|
||||
|
||||
local rust_bin
|
||||
if [ "$LARGE_DISK" = true ]; then
|
||||
rust_bin="$(find "$tmpdir" -name 'weed-volume-large-disk*' -type f | head -1)"
|
||||
else
|
||||
rust_bin="$(find "$tmpdir" -name 'weed-volume-normal*' -type f | head -1)"
|
||||
fi
|
||||
if [ -z "$rust_bin" ]; then
|
||||
rust_bin="$(find "$tmpdir" -name 'weed-volume*' -type f | head -1)"
|
||||
fi
|
||||
if [ -z "$rust_bin" ]; then
|
||||
error "Could not find weed-volume binary in archive"
|
||||
fi
|
||||
|
||||
chmod +x "$rust_bin"
|
||||
local dest_name="weed-volume"
|
||||
if [ "$os" = "windows" ]; then
|
||||
dest_name="weed-volume.exe"
|
||||
fi
|
||||
install_binary "$rust_bin" "$dest_name"
|
||||
ok "Installed weed-volume to ${INSTALL_DIR}/${dest_name}"
|
||||
;;
|
||||
|
||||
*)
|
||||
error "Unknown component: ${component}. Use: weed, volume-rust, all"
|
||||
;;
|
||||
esac
|
||||
}
|
||||
|
||||
# Copy binary to install dir, using sudo if needed
|
||||
install_binary() {
|
||||
local src="$1" name="$2"
|
||||
local dest="${INSTALL_DIR}/${name}"
|
||||
|
||||
mkdir -p "$INSTALL_DIR" 2>/dev/null || true
|
||||
|
||||
if [ -w "$INSTALL_DIR" ]; then
|
||||
cp "$src" "$dest"
|
||||
else
|
||||
info "Need elevated permissions to write to ${INSTALL_DIR}"
|
||||
sudo cp "$src" "$dest"
|
||||
fi
|
||||
chmod +x "$dest" 2>/dev/null || sudo chmod +x "$dest"
|
||||
}
|
||||
|
||||
main() {
|
||||
info "SeaweedFS Installer"
|
||||
|
||||
read -r os arch <<< "$(detect_platform)"
|
||||
info "Detected platform: ${os}/${arch}"
|
||||
|
||||
if [ -z "$VERSION" ]; then
|
||||
info "Resolving latest release..."
|
||||
VERSION="$(get_latest_version)"
|
||||
if [ -z "$VERSION" ]; then
|
||||
error "Could not determine latest version. Specify with --version"
|
||||
fi
|
||||
fi
|
||||
info "Version: ${VERSION}"
|
||||
|
||||
if [ "$LARGE_DISK" = true ]; then
|
||||
info "Variant: large disk (8TB max volume)"
|
||||
else
|
||||
info "Variant: normal (32GB max volume)"
|
||||
fi
|
||||
|
||||
case "$COMPONENT" in
|
||||
all)
|
||||
install_component "weed" "$os" "$arch"
|
||||
install_component "volume-rust" "$os" "$arch"
|
||||
;;
|
||||
*)
|
||||
install_component "$COMPONENT" "$os" "$arch"
|
||||
;;
|
||||
esac
|
||||
|
||||
echo ""
|
||||
ok "Installation complete!"
|
||||
if [ "$COMPONENT" = "weed" ] || [ "$COMPONENT" = "all" ]; then
|
||||
info " weed: ${INSTALL_DIR}/weed"
|
||||
fi
|
||||
if [ "$COMPONENT" = "volume-rust" ] || [ "$COMPONENT" = "all" ]; then
|
||||
info " weed-volume: ${INSTALL_DIR}/weed-volume"
|
||||
fi
|
||||
echo ""
|
||||
info "Quick start:"
|
||||
info " weed master # Start master server"
|
||||
info " weed volume -mserver=localhost:9333 # Start Go volume server"
|
||||
info " weed-volume -mserver localhost:9333 # Start Rust volume server"
|
||||
}
|
||||
|
||||
main
|
||||
5255
seaweed-volume/Cargo.lock
generated
Normal file
5255
seaweed-volume/Cargo.lock
generated
Normal file
File diff suppressed because it is too large
Load Diff
137
seaweed-volume/Cargo.toml
Normal file
137
seaweed-volume/Cargo.toml
Normal file
@@ -0,0 +1,137 @@
|
||||
[package]
|
||||
name = "weed-volume"
|
||||
version = "0.1.0"
|
||||
edition = "2021"
|
||||
description = "SeaweedFS Volume Server — Rust implementation"
|
||||
|
||||
[lib]
|
||||
name = "seaweed_volume"
|
||||
|
||||
[[bin]]
|
||||
name = "weed-volume"
|
||||
path = "src/main.rs"
|
||||
|
||||
[features]
|
||||
# Default: 5-byte offsets (8TB max volume size), matching production Go builds (-tags 5BytesOffset).
|
||||
# Disable with --no-default-features for 4-byte offsets (32GB max volume size).
|
||||
default = ["5bytes"]
|
||||
5bytes = []
|
||||
|
||||
[dependencies]
|
||||
# Async runtime
|
||||
tokio = { version = "1", features = ["full"] }
|
||||
tokio-stream = "0.1"
|
||||
tokio-io-timeout = "1"
|
||||
|
||||
# gRPC + protobuf
|
||||
tonic = { version = "0.12", features = ["tls"] }
|
||||
tonic-reflection = "0.12"
|
||||
prost = "0.13"
|
||||
prost-types = "0.13"
|
||||
|
||||
# HTTP server
|
||||
axum = { version = "0.7", features = ["multipart"] }
|
||||
http-body = "1"
|
||||
hyper = { version = "1", features = ["full"] }
|
||||
hyper-util = { version = "0.1", features = ["tokio", "service", "server-auto", "http1", "http2"] }
|
||||
tower = "0.4"
|
||||
tower-http = { version = "0.5", features = ["cors", "trace"] }
|
||||
|
||||
# CLI
|
||||
clap = { version = "4", features = ["derive"] }
|
||||
|
||||
# Metrics
|
||||
prometheus = { version = "0.13", default-features = false, features = ["process"] }
|
||||
lazy_static = "1"
|
||||
|
||||
# JWT
|
||||
jsonwebtoken = { version = "10", features = ["rust_crypto"] }
|
||||
|
||||
# TLS
|
||||
rustls = "0.23"
|
||||
tokio-rustls = "0.26"
|
||||
rustls-pemfile = "2"
|
||||
|
||||
# LevelDB (via RocksDB for better Rust support)
|
||||
# Using rusty-leveldb for pure Rust LevelDB
|
||||
rusty-leveldb = "3"
|
||||
|
||||
# Disk-backed needle map (alternative to in-memory HashMap)
|
||||
redb = "3"
|
||||
|
||||
# Reed-Solomon erasure coding
|
||||
reed-solomon-erasure = "6"
|
||||
|
||||
# Logging
|
||||
tracing = "0.1"
|
||||
tracing-subscriber = { version = "0.3", features = ["env-filter"] }
|
||||
pprof = { version = "0.15", features = ["prost-codec"] }
|
||||
|
||||
# Config
|
||||
toml = "0.8"
|
||||
serde = { version = "1", features = ["derive"] }
|
||||
serde_json = "1"
|
||||
serde_urlencoded = "0.7"
|
||||
|
||||
# CRC32 — using Castagnoli polynomial (CRC32-C), matching Go's crc32.Castagnoli
|
||||
crc32c = "0.6"
|
||||
crc32fast = "1"
|
||||
|
||||
# Memory-mapped files
|
||||
memmap2 = "0.9"
|
||||
|
||||
# UUID
|
||||
uuid = { version = "1", features = ["v4"] }
|
||||
|
||||
# HTTP client (for proxying, remote fetch)
|
||||
reqwest = { version = "0.12", features = ["rustls-tls", "stream", "multipart", "json"] }
|
||||
|
||||
# Content hashing
|
||||
md-5 = "0.10"
|
||||
base64 = "0.22"
|
||||
|
||||
# Compression
|
||||
flate2 = "1"
|
||||
|
||||
# Image processing
|
||||
image = { version = "0.25", default-features = false, features = ["png", "jpeg", "gif", "webp"] }
|
||||
kamadak-exif = "0.5"
|
||||
|
||||
# Multipart form-data parsing
|
||||
multer = "3"
|
||||
|
||||
# MIME type guessing from file extensions
|
||||
mime_guess = "2"
|
||||
|
||||
# Misc
|
||||
bytes = "1"
|
||||
rand = "0.8"
|
||||
chrono = "0.4"
|
||||
hex = "0.4"
|
||||
parking_lot = "0.12"
|
||||
dashmap = "6"
|
||||
thiserror = "1"
|
||||
anyhow = "1"
|
||||
async-trait = "0.1"
|
||||
futures = "0.3"
|
||||
async-stream = "0.3"
|
||||
x509-parser = "0.16"
|
||||
|
||||
# Disk space checking
|
||||
sysinfo = "0.31"
|
||||
libc = "0.2"
|
||||
|
||||
# AWS S3 SDK (for remote storage backends)
|
||||
aws-config = { version = "1", features = ["behavior-version-latest"] }
|
||||
aws-sdk-s3 = { version = "1.125.0", default-features = false, features = ["sigv4a", "http-1x", "default-https-client", "rt-tokio"] }
|
||||
aws-credential-types = "1"
|
||||
aws-types = "1"
|
||||
|
||||
[dev-dependencies]
|
||||
tempfile = "3"
|
||||
|
||||
[build-dependencies]
|
||||
tonic-build = "0.12"
|
||||
|
||||
[patch.crates-io]
|
||||
reed-solomon-erasure = { path = "vendor/reed-solomon-erasure" }
|
||||
105
seaweed-volume/DEV_PLAN.md
Normal file
105
seaweed-volume/DEV_PLAN.md
Normal file
@@ -0,0 +1,105 @@
|
||||
# Rust Volume Server — Dev Plan
|
||||
|
||||
## Current Status (2026-03-07)
|
||||
|
||||
**HTTP tests**: 53/53 pass (100%)
|
||||
**gRPC tests**: 56/56 pass (100%) — includes TestVolumeMoveHandlesInFlightWrites with Rust multi-volume cluster
|
||||
**Rust integration tests**: 8/8 pass
|
||||
**S3 remote storage tests**: 3/3 pass
|
||||
**Total**: 117/117 (100%) + 8 Rust + 3 S3 tests
|
||||
**Rust unit tests**: 137 lib + 7 integration = 144
|
||||
|
||||
## Completed Features
|
||||
|
||||
All phases from the original plan are complete:
|
||||
|
||||
- **Phase 1** — HTTP Core: CORS, OPTIONS, unsupported methods, static assets, path routing,
|
||||
cookie validation, conditional headers, range requests, dedup 204, content-encoding,
|
||||
readDeleted, chunk manifests, multipart validation, MD5 check, file size limit,
|
||||
upload/download throttling, image resize/crop, download disposition
|
||||
- **Phase 2** — JWT/Security: signing keys from security.toml, token source precedence
|
||||
(query > header > cookie), file_id claims, leeway=0
|
||||
- **Phase 3** — gRPC: maintenance mode, error message parity, ping routing, batch delete,
|
||||
VolumeServerStatus (with real disk stats, data_center, rack), ReadVolumeFileStatus
|
||||
(with timestamps)
|
||||
- **Phase 4** — Streaming gRPC: VolumeIncrementalCopy, CopyFile, ReceiveFile, ReadAllNeedles,
|
||||
VolumeTailSender, VolumeCopy, VolumeTailReceiver, VacuumVolumeCheck
|
||||
- **Phase 5** — EC Shards: mount/unmount, delete, read, blob delete, rebuild, shards-to-volume,
|
||||
copy, info
|
||||
- **Phase 6** — Advanced gRPC: ScrubVolume, ScrubEcVolume, Query, FetchAndWriteNeedle,
|
||||
VolumeTierMoveDat (error paths)
|
||||
- **Phase 7** — Remote Storage: S3-compatible backend via aws-sdk-s3,
|
||||
FetchAndWriteNeedle reads from S3/MinIO/SeaweedFS S3 and writes locally.
|
||||
Supports all S3-compatible providers (AWS, Wasabi, Backblaze, Aliyun, etc.)
|
||||
- **Master Heartbeat** — Bidirectional streaming SendHeartbeat RPC, volume/EC registration,
|
||||
leader changes, shutdown deregistration. Tested end-to-end with Go master.
|
||||
- **Production Sprint 1** — Quick wins:
|
||||
- VolumeMarkReadonly master notification (triggers immediate heartbeat)
|
||||
- Compaction throttling (`maybe_throttle_compaction()`)
|
||||
- File size limit enforcement on upload
|
||||
- `ts` query param for custom timestamps (upload + delete)
|
||||
- TTL expiration check (was already implemented)
|
||||
- Health check heartbeat status (returns 503 if disconnected from master)
|
||||
- preStopSeconds graceful drain before shutdown
|
||||
- S3 response passthrough headers (content-encoding, expires, content-language, content-disposition)
|
||||
- .vif persistence for readonly state across restarts
|
||||
- Webp image support for resize
|
||||
- **Production Sprint 2** — Compatibility:
|
||||
- MIME type extraction from Content-Type header
|
||||
- Stats endpoints (/stats/counter, /stats/memory, /stats/disk)
|
||||
- JSON pretty print (?pretty=y) and JSONP (?callback=fn)
|
||||
- Request ID generation (UUID if x-amz-request-id missing)
|
||||
- Advanced Prometheus metrics (INFLIGHT_REQUESTS, VOLUME_FILE_COUNT gauges)
|
||||
- **Production Sprint 3** — Streaming & Multi-node:
|
||||
- Streaming reads for large files (>1MB) via http_body::Body trait with spawn_blocking
|
||||
- Meta-only needle reads (NeedleStreamInfo) to avoid loading full body for streaming
|
||||
- Multi-volume Rust cluster support (RustMultiVolumeCluster test framework)
|
||||
- TestVolumeMoveHandlesInFlightWrites now uses Rust volume servers
|
||||
- CI skip list cleaned up (all tests pass with Rust)
|
||||
|
||||
- **Production Sprint 4** — Advanced Features:
|
||||
- BatchDelete EC shard support (ecx index lookup + ecj journal deletion)
|
||||
- JPEG EXIF orientation auto-fix on upload (kamadak-exif + image crate)
|
||||
- Async batched write processing (mpsc queue, up to 128 entries per batch)
|
||||
- VolumeTierMoveDatToRemote/FromRemote (S3 multipart upload/download)
|
||||
- S3TierRegistry for managing remote storage backends
|
||||
- VolumeInfo (.vif) persistence for remote file references
|
||||
- **Production Sprint 5** — Upload Compatibility:
|
||||
- TTL query parameter extraction during upload (`ttl=3m`)
|
||||
- Auto-compression for compressible file types (text/*, .js, .css, .json, .svg, etc.)
|
||||
- Seaweed-* custom metadata headers stored as needle pairs (JSON, max 64KB)
|
||||
- Filename extraction from URL path stored in needle name field
|
||||
- Upload response includes filename
|
||||
|
||||
- **Production Sprint 6** — Storage & Networking:
|
||||
- Redb disk-backed needle maps (pure Rust, no C deps) via `NeedleMap` enum
|
||||
- Binary search for `VolumeIncrementalCopy` with `since_ns > 0`
|
||||
- Proxy/redirect read modes for non-local volumes (master lookup, HTTP proxy, 301 redirect)
|
||||
|
||||
## Remaining Work (Production Readiness)
|
||||
|
||||
No major remaining items. All phases and production sprints are complete.
|
||||
|
||||
## Test Commands
|
||||
|
||||
```bash
|
||||
# Build
|
||||
cd seaweed-volume && cargo build --release
|
||||
|
||||
# Run all Go integration tests with Rust volume server
|
||||
VOLUME_SERVER_IMPL=rust go test -v -count=1 -timeout 1200s ./test/volume_server/grpc/... ./test/volume_server/http/...
|
||||
|
||||
# Run S3 remote storage tests
|
||||
VOLUME_SERVER_IMPL=rust go test -v -count=1 -timeout 180s -run "TestFetchAndWriteNeedle(FromS3|S3NotFound)" ./test/volume_server/grpc/...
|
||||
|
||||
# Run specific test
|
||||
VOLUME_SERVER_IMPL=rust go test -v -count=1 -timeout 60s -run "TestName" ./test/volume_server/http/...
|
||||
|
||||
# Run Rust unit tests
|
||||
cd seaweed-volume && cargo test
|
||||
|
||||
# Test heartbeat with Go master
|
||||
weed master -port=9333 &
|
||||
seaweed-volume --port 8080 --master localhost:9333 --dir /tmp/vol1 --max 7
|
||||
curl http://localhost:9333/dir/status # should show Rust volume server registered
|
||||
```
|
||||
288
seaweed-volume/MISSING_FEATURES.md
Normal file
288
seaweed-volume/MISSING_FEATURES.md
Normal file
@@ -0,0 +1,288 @@
|
||||
# Rust Volume Server — Missing Features Audit
|
||||
|
||||
Comprehensive line-by-line comparison of Go vs Rust volume server.
|
||||
Generated 2026-03-07 from 4 parallel audits covering HTTP, gRPC, storage, and infrastructure.
|
||||
|
||||
## Executive Summary
|
||||
|
||||
| Area | Total Features | Implemented | Partial | Missing |
|
||||
|------|---------------|-------------|---------|---------|
|
||||
| gRPC RPCs | 48 | 43 (90%) | 2 (4%) | 3 (6%) |
|
||||
| HTTP Handlers | 31 | 12 (39%) | 10 (32%) | 9 (29%) |
|
||||
| Storage Layer | 22 | 6 (27%) | 7 (32%) | 9 (41%) |
|
||||
| Infrastructure | 14 | 5 (36%) | 4 (29%) | 5 (36%) |
|
||||
|
||||
---
|
||||
|
||||
## Priority 1 — Critical for Production
|
||||
|
||||
### P1.1 Streaming / Meta-Only Reads
|
||||
- **Go**: `ReadNeedleMeta()`, `ReadNeedleData()`, `ReadPagedData()` — reads only metadata or pages of large files
|
||||
- **Go**: `streamWriteResponseContent()` streams needle data in chunks
|
||||
- **Go**: `AttemptMetaOnly` / `MustMetaOnly` flags in `ReadOption`
|
||||
- **Rust**: Reads entire needle into memory always
|
||||
- **Impact**: OOM on large files; 8MB file = 8MB heap per request
|
||||
- **Files**: `weed/storage/needle/needle_read.go`, `weed/server/volume_server_handlers_read.go`
|
||||
- **Effort**: Medium
|
||||
|
||||
### P1.2 Download Proxy/Redirect Fallback (ReadMode)
|
||||
- **Go**: `ReadMode` config: "local" | "proxy" | "redirect"
|
||||
- **Go**: `tryProxyToReplica()` probes replicas, `proxyReqToTargetServer()` streams response
|
||||
- **Rust**: Always returns 404 for non-local volumes
|
||||
- **Impact**: Clients must handle volume placement themselves; breaks transparent replication
|
||||
- **Files**: `weed/server/volume_server_handlers_read.go:138-250`
|
||||
- **Effort**: Medium
|
||||
|
||||
### P1.3 TLS/HTTPS Support
|
||||
- **Go**: `LoadServerTLS()`, `LoadClientTLS()`, cert/key loading from security.toml
|
||||
- **Go**: Applied to both HTTP and gRPC servers
|
||||
- **Rust**: No TLS at all — plain TCP only
|
||||
- **Impact**: Cannot deploy in secure clusters
|
||||
- **Files**: `weed/security/tls.go`, `weed/command/volume.go`
|
||||
- **Effort**: Medium (rustls + tokio-rustls already in Cargo.toml)
|
||||
|
||||
### P1.4 VolumeMarkReadonly/Writable Master Notification
|
||||
- **Go**: `notifyMasterVolumeReadonly()` updates master with readonly state
|
||||
- **Rust**: Only sets local in-memory flag
|
||||
- **Impact**: Master keeps directing writes to readonly volume
|
||||
- **Files**: `weed/server/volume_grpc_admin.go`
|
||||
- **Effort**: Low
|
||||
|
||||
### P1.5 Compaction/Maintenance Throttling
|
||||
- **Go**: `WriteThrottler` with `MaybeSlowdown()` for MB/s rate limiting
|
||||
- **Rust**: Flags parsed but no throttle implementation
|
||||
- **Impact**: Compaction/copy operations can saturate disk IO
|
||||
- **Files**: `weed/util/throttler.go`
|
||||
- **Effort**: Low
|
||||
|
||||
### P1.6 File Size Limit Enforcement
|
||||
- **Go**: `fileSizeLimitBytes` checked on upload, returns 400
|
||||
- **Rust**: No enforcement — accepts any size
|
||||
- **Impact**: Can write files larger than volume size limit
|
||||
- **Files**: `weed/server/volume_server_handlers_write.go`
|
||||
- **Effort**: Low
|
||||
|
||||
---
|
||||
|
||||
## Priority 2 — Important for Compatibility
|
||||
|
||||
### P2.1 `ts` Query Param (Custom Timestamps)
|
||||
- **Go**: Upload and delete accept `ts` query param for custom Last-Modified time
|
||||
- **Rust**: Always uses current time
|
||||
- **Impact**: Replication timestamp fidelity; sync from external sources
|
||||
- **Files**: `weed/server/volume_server_handlers_write.go`, `volume_server_handlers_admin.go`
|
||||
- **Effort**: Low
|
||||
|
||||
### P2.2 Multipart Form Upload Parsing
|
||||
- **Go**: `needle.CreateNeedleFromRequest()` parses multipart forms, extracts MIME type, custom headers/pairs
|
||||
- **Rust**: Reads raw body bytes only — no multipart form parsing for metadata
|
||||
- **Impact**: MIME type not stored; custom needle pairs not supported
|
||||
- **Files**: `weed/storage/needle/needle.go:CreateNeedleFromRequest`
|
||||
- **Effort**: Medium
|
||||
|
||||
### P2.3 JPEG Orientation Auto-Fix
|
||||
- **Go**: `images.FixJpgOrientation()` on upload when enabled
|
||||
- **Rust**: Not implemented (flag exists but unused)
|
||||
- **Impact**: Mobile uploads may display rotated
|
||||
- **Files**: `weed/images/orientation.go`
|
||||
- **Effort**: Low (exif crate)
|
||||
|
||||
### P2.4 TTL Expiration Enforcement
|
||||
- **Go**: Checks `HasTtl()` + `AppendAtNs` against current time on read path
|
||||
- **Rust**: TTL struct exists but no expiration checking
|
||||
- **Impact**: Expired needles still served
|
||||
- **Files**: `weed/storage/needle/volume_ttl.go`, `weed/storage/volume_read.go`
|
||||
- **Effort**: Low
|
||||
|
||||
### P2.5 Health Check — Master Heartbeat Status
|
||||
- **Go**: Returns 503 if not heartbeating (can't reach master)
|
||||
- **Rust**: Only checks `is_stopping` flag
|
||||
- **Impact**: Load balancers won't detect disconnected volume servers
|
||||
- **Files**: `weed/server/volume_server.go`
|
||||
- **Effort**: Low
|
||||
|
||||
### P2.6 Stats Endpoints
|
||||
- **Go**: `/stats/counter`, `/stats/memory`, `/stats/disk` (whitelist-guarded)
|
||||
- **Rust**: Not implemented
|
||||
- **Impact**: No operational visibility
|
||||
- **Files**: `weed/server/volume_server.go`
|
||||
- **Effort**: Low
|
||||
|
||||
### P2.7 Webp Image Support
|
||||
- **Go**: `.webp` included in resize-eligible extensions
|
||||
- **Rust**: Only `.png`, `.jpg`, `.jpeg`, `.gif`
|
||||
- **Impact**: Webp images can't be resized on read
|
||||
- **Files**: `weed/server/volume_server_handlers_read.go`
|
||||
- **Effort**: Low (add webp feature to image crate)
|
||||
|
||||
### P2.8 preStopSeconds Graceful Drain
|
||||
- **Go**: Stops heartbeat, waits N seconds, then shuts down servers
|
||||
- **Rust**: Immediate shutdown on signal
|
||||
- **Impact**: In-flight requests dropped; Kubernetes readiness race
|
||||
- **Files**: `weed/command/volume.go`
|
||||
- **Effort**: Low
|
||||
|
||||
### P2.9 S3 Response Passthrough Headers
|
||||
- **Go**: `response-content-encoding`, `response-expires`, `response-content-language` query params
|
||||
- **Rust**: Only handles `response-content-type`, `response-cache-control`, `dl`
|
||||
- **Impact**: S3-compatible GET requests missing some override headers
|
||||
- **Files**: `weed/server/volume_server_handlers_read.go`
|
||||
- **Effort**: Low
|
||||
|
||||
---
|
||||
|
||||
## Priority 3 — Storage Layer Gaps
|
||||
|
||||
### P3.1 LevelDB Needle Maps
|
||||
- **Go**: 5 needle map variants: memory, LevelDB, LevelDB-medium, LevelDB-large, sorted-file
|
||||
- **Rust**: Memory-only needle map
|
||||
- **Impact**: Large volumes (millions of needles) require too much RAM
|
||||
- **Files**: `weed/storage/needle_map_leveldb.go`
|
||||
- **Effort**: High (need LevelDB binding or alternative)
|
||||
|
||||
### P3.2 Async Request Processing
|
||||
- **Go**: `asyncRequestsChan` with 128-entry queue, worker goroutine for batched writes
|
||||
- **Rust**: All writes synchronous
|
||||
- **Impact**: Write throughput limited by fsync latency
|
||||
- **Files**: `weed/storage/needle/async_request.go`
|
||||
- **Effort**: Medium
|
||||
|
||||
### P3.3 Volume Scrubbing (Data Integrity)
|
||||
- **Go**: `ScrubIndex()`, `scrubVolumeData()` — full data + index verification
|
||||
- **Rust**: Stub only in gRPC (returns OK without actual scrubbing)
|
||||
- **Impact**: No way to verify data integrity
|
||||
- **Files**: `weed/storage/volume_checking.go`, `weed/storage/idx/check.go`
|
||||
- **Effort**: Medium
|
||||
|
||||
### P3.4 Volume Backup / Sync
|
||||
- **Go**: Streaming backup, binary search for last modification, index generation scanner
|
||||
- **Rust**: Not implemented
|
||||
- **Impact**: No backup/restore capability
|
||||
- **Files**: `weed/storage/volume_backup.go`
|
||||
- **Effort**: Medium
|
||||
|
||||
### P3.5 Volume Info (.vif) Persistence
|
||||
- **Go**: `.vif` files store tier/remote metadata, readonly state persists across restarts
|
||||
- **Rust**: No `.vif` support; readonly is in-memory only
|
||||
- **Impact**: Readonly state lost on restart; no tier metadata
|
||||
- **Files**: `weed/storage/volume_info/volume_info.go`
|
||||
- **Effort**: Low
|
||||
|
||||
### P3.6 Disk Location Features
|
||||
- **Go**: Directory UUID tracking, disk space monitoring, min-free-space enforcement, tag-based grouping
|
||||
- **Rust**: Basic directory only
|
||||
- **Impact**: No disk-full protection
|
||||
- **Files**: `weed/storage/disk_location.go`
|
||||
- **Effort**: Medium
|
||||
|
||||
### P3.7 Compact Map (Memory-Efficient Needle Map)
|
||||
- **Go**: `CompactMap` with overflow handling for memory optimization
|
||||
- **Rust**: Uses standard HashMap
|
||||
- **Impact**: Higher memory usage for index
|
||||
- **Files**: `weed/storage/needle_map/compact_map.go`
|
||||
- **Effort**: Medium
|
||||
|
||||
---
|
||||
|
||||
## Priority 4 — Nice to Have
|
||||
|
||||
### P4.1 gRPC: VolumeTierMoveDatToRemote / FromRemote
|
||||
- **Go**: Full streaming implementation for tiering volumes to/from S3
|
||||
- **Rust**: Stub returning error
|
||||
- **Files**: `weed/server/volume_grpc_tier_upload.go`, `volume_grpc_tier_download.go`
|
||||
- **Effort**: High
|
||||
|
||||
### P4.2 gRPC: Query (S3 Select)
|
||||
- **Go**: JSON/CSV query over needle data (S3 Select compatible)
|
||||
- **Rust**: Stub returning error
|
||||
- **Files**: `weed/server/volume_grpc_query.go`
|
||||
- **Effort**: High
|
||||
|
||||
### P4.3 FetchAndWriteNeedle — Already Implemented
|
||||
- **Note**: The gRPC audit incorrectly flagged this as missing. It was implemented in a prior session with full S3 remote storage support.
|
||||
|
||||
### P4.4 JSON Pretty Print + JSONP
|
||||
- **Go**: `?pretty` query param for indented JSON; `?callback=fn` for JSONP
|
||||
- **Rust**: Neither supported
|
||||
- **Effort**: Low
|
||||
|
||||
### P4.5 Request ID Generation
|
||||
- **Go**: Generates UUID if `x-amz-request-id` header missing, propagates to gRPC context
|
||||
- **Rust**: Only echoes existing header
|
||||
- **Effort**: Low
|
||||
|
||||
### P4.6 UI Status Page
|
||||
- **Go**: Full HTML template with volumes, disks, stats, uptime
|
||||
- **Rust**: Stub HTML
|
||||
- **Effort**: Medium
|
||||
|
||||
### P4.7 Advanced Prometheus Metrics
|
||||
- **Go**: InFlightRequestsGauge, ConcurrentUploadLimit/DownloadLimit gauges, metrics push gateway
|
||||
- **Rust**: Basic request counter and histogram only
|
||||
- **Effort**: Low
|
||||
|
||||
### P4.8 Profiling (pprof)
|
||||
- **Go**: CPU/memory profiling, /debug/pprof endpoints
|
||||
- **Rust**: Flags parsed but not wired
|
||||
- **Effort**: Medium (tokio-console or pprof-rs)
|
||||
|
||||
### P4.9 EC Distribution / Rebalancing
|
||||
- **Go**: 17 files for EC operations including placement strategies, recovery, scrubbing
|
||||
- **Rust**: 6 files with basic encoder/decoder
|
||||
- **Effort**: High
|
||||
|
||||
### P4.10 Cookie Mismatch Status Code
|
||||
- **Go**: Returns 406 Not Acceptable
|
||||
- **Rust**: Returns 400 Bad Request
|
||||
- **Effort**: Trivial
|
||||
|
||||
---
|
||||
|
||||
## Implementation Order Recommendation
|
||||
|
||||
### Sprint 1 — Quick Wins (Low effort, high impact) ✅ DONE
|
||||
1. ✅ P1.4 VolumeMarkReadonly master notification — triggers immediate heartbeat
|
||||
2. ✅ P1.5 Compaction throttling — `maybe_throttle_compaction()` method added
|
||||
3. ✅ P1.6 File size limit enforcement — checks `file_size_limit_bytes` on upload
|
||||
4. ✅ P2.1 `ts` query param — custom timestamps for upload and delete
|
||||
5. ✅ P2.4 TTL expiration check — was already implemented
|
||||
6. ✅ P2.5 Health check heartbeat status — returns 503 if not heartbeating
|
||||
7. ✅ P2.8 preStopSeconds — graceful drain delay before shutdown
|
||||
8. ✅ P2.9 S3 passthrough headers — content-encoding, expires, content-language, content-disposition
|
||||
9. ✅ P3.5 .vif persistence — readonly state persists across restarts
|
||||
10. ✅ P2.7 Webp support — added to image resize-eligible extensions
|
||||
11. ~~P4.10 Cookie 406~~ — Go actually uses 404 for HTTP cookie mismatch (406 is gRPC batch delete only)
|
||||
|
||||
### Sprint 2 — Core Read Path (Medium effort) — Partially Done
|
||||
1. P1.1 Streaming / meta-only reads — TODO (medium effort, no test coverage yet)
|
||||
2. ✅ P1.2 ReadMode proxy/redirect — was already implemented and tested
|
||||
3. ✅ P2.2 Multipart form parsing — MIME type extraction from Content-Type header
|
||||
4. P2.3 JPEG orientation fix — TODO (low effort, needs exif crate)
|
||||
5. ✅ P2.6 Stats endpoints — /stats/counter, /stats/memory, /stats/disk
|
||||
6. ✅ P2.7 Webp support — done in Sprint 1
|
||||
7. ✅ P4.4 JSON pretty print + JSONP — ?pretty=y and ?callback=fn
|
||||
8. ✅ P4.5 Request ID generation — generates UUID if x-amz-request-id missing
|
||||
9. ✅ P4.7 Advanced Prometheus metrics — INFLIGHT_REQUESTS gauge, VOLUME_FILE_COUNT gauge
|
||||
|
||||
### Sprint 3 — Infrastructure (Medium effort) — Partially Done
|
||||
1. ✅ P1.3 TLS/HTTPS — rustls + tokio-rustls for HTTP, tonic ServerTlsConfig for gRPC
|
||||
2. P3.2 Async request processing — TODO (medium effort)
|
||||
3. ✅ P3.3 Volume scrubbing — CRC checksum verification of all needles
|
||||
4. ✅ P3.6 Disk location features — MinFreeSpace enforcement, background disk monitor
|
||||
|
||||
### Sprint 4 — Storage Advanced (High effort) — Deferred
|
||||
No integration test coverage for these items. All existing tests pass.
|
||||
1. P3.1 LevelDB needle maps — needed only for volumes with millions of needles
|
||||
2. P3.4 Volume backup/sync — streaming backup, binary search
|
||||
3. P3.7 Compact map — memory optimization for needle index
|
||||
4. P4.1 VolumeTierMoveDat — full S3 tiering (currently error stub)
|
||||
5. P4.9 EC distribution — advanced EC placement/rebalancing
|
||||
|
||||
### Sprint 5 — Polish — Deferred
|
||||
No integration test coverage for these items.
|
||||
1. P4.2 Query (S3 Select) — JSON/CSV query over needle data
|
||||
2. ✅ P4.4 JSON pretty/JSONP — done in Sprint 2
|
||||
3. ✅ P4.5 Request ID generation — done in Sprint 2
|
||||
4. P4.6 UI status page — HTML template with volume/disk/stats info
|
||||
5. ✅ P4.7 Advanced metrics — done in Sprint 2
|
||||
6. P4.8 Profiling — pprof-rs or tokio-console
|
||||
230
seaweed-volume/PARITY_PLAN.md
Normal file
230
seaweed-volume/PARITY_PLAN.md
Normal file
@@ -0,0 +1,230 @@
|
||||
# Rust Volume Server Parity Plan
|
||||
|
||||
Generated: 2026-03-16
|
||||
|
||||
## Goal
|
||||
|
||||
Make `seaweed-volume` a drop-in replacement for the Go volume server by:
|
||||
|
||||
- comparing every Go volume-server code path against the Rust implementation,
|
||||
- recording file-level ownership and verification status,
|
||||
- closing verified behavior gaps one logic change per commit,
|
||||
- extending tests so regressions are caught by Go parity suites and Rust unit/integration tests.
|
||||
|
||||
## Ground Truth
|
||||
|
||||
Primary Go sources:
|
||||
|
||||
- `weed/server/volume_server.go`
|
||||
- `weed/server/volume_server_handlers*.go`
|
||||
- `weed/server/volume_grpc_*.go`
|
||||
- `weed/server/constants/volume.go`
|
||||
- `weed/storage/store*.go`
|
||||
- `weed/storage/disk_location*.go`
|
||||
- `weed/storage/volume*.go`
|
||||
- `weed/storage/needle/*.go`
|
||||
- `weed/storage/idx/*.go`
|
||||
- `weed/storage/needle_map*.go`
|
||||
- `weed/storage/needle_map/*.go`
|
||||
- `weed/storage/super_block/*.go`
|
||||
- `weed/storage/erasure_coding/*.go`
|
||||
|
||||
Supporting Go dependencies that affect drop-in behavior:
|
||||
|
||||
- `weed/command/volume.go`
|
||||
- `weed/security/*.go`
|
||||
- `weed/images/*.go`
|
||||
- `weed/stats/*.go`
|
||||
|
||||
Primary Rust sources:
|
||||
|
||||
- `seaweed-volume/src/main.rs`
|
||||
- `seaweed-volume/src/config.rs`
|
||||
- `seaweed-volume/src/security.rs`
|
||||
- `seaweed-volume/src/images.rs`
|
||||
- `seaweed-volume/src/server/*.rs`
|
||||
- `seaweed-volume/src/storage/*.rs`
|
||||
- `seaweed-volume/src/storage/needle/*.rs`
|
||||
- `seaweed-volume/src/storage/idx/*.rs`
|
||||
- `seaweed-volume/src/storage/erasure_coding/*.rs`
|
||||
- `seaweed-volume/src/remote_storage/*.rs`
|
||||
|
||||
## Audit Method
|
||||
|
||||
For each Go file:
|
||||
|
||||
1. Map it to the Rust file or files that should own the same behavior.
|
||||
2. Compare exported entry points, helper functions, state transitions, wire fields, and persistence side effects.
|
||||
3. Mark each file `implemented`, `partial`, `missing`, or `needs verification`.
|
||||
4. Link each behavior to an existing test or add a missing test.
|
||||
5. Only treat a gap as closed after code review plus local verification.
|
||||
|
||||
## Acceptance Criteria
|
||||
|
||||
The Rust server is a drop-in replacement only when all of these hold:
|
||||
|
||||
- HTTP routes, status codes, headers, and body semantics match Go.
|
||||
- gRPC RPCs match Go request validation, response fields, streaming behavior, and maintenance/read-only semantics.
|
||||
- Master heartbeat and topology metadata match Go closely enough that the Go master treats Rust and Go volume servers the same.
|
||||
- On-disk volume behavior matches Go for normal volumes, EC shards, tiering metadata, and readonly persistence.
|
||||
- Startup flags and operational endpoints that affect production deployment behave equivalently or are explicitly documented as unsupported.
|
||||
- Existing Go integration suites pass with `VOLUME_SERVER_IMPL=rust`.
|
||||
|
||||
## File Matrix
|
||||
|
||||
### HTTP server surface
|
||||
|
||||
| Go file | Rust counterpart | Status | Comparison focus |
|
||||
| --- | --- | --- | --- |
|
||||
| `weed/server/volume_server.go` | `seaweed-volume/src/main.rs`, `seaweed-volume/src/server/volume_server.rs`, `seaweed-volume/src/server/heartbeat.rs` | partial | startup wiring, routers, heartbeat, shutdown, metrics/debug listeners |
|
||||
| `weed/server/volume_server_handlers.go` | `seaweed-volume/src/server/volume_server.rs`, `seaweed-volume/src/server/handlers.rs` | needs verification | method dispatch, OPTIONS behavior, public/admin split |
|
||||
| `weed/server/volume_server_handlers_admin.go` | `seaweed-volume/src/server/handlers.rs` | implemented | `/status`, `/healthz`, stats, server headers |
|
||||
| `weed/server/volume_server_handlers_helper.go` | `seaweed-volume/src/server/handlers.rs` | needs verification | JSON encoding, request parsing, helper parity |
|
||||
| `weed/server/volume_server_handlers_read.go` | `seaweed-volume/src/server/handlers.rs` | needs verification | JWT, conditional reads, range reads, proxy/redirect, chunk manifests, image transforms |
|
||||
| `weed/server/volume_server_handlers_ui.go` | `seaweed-volume/src/server/handlers.rs`, embedded assets | partial | UI payload and HTML parity |
|
||||
| `weed/server/volume_server_handlers_write.go` | `seaweed-volume/src/server/handlers.rs`, `seaweed-volume/src/images.rs` | needs verification | multipart parsing, metadata, compression, ts, delete semantics |
|
||||
| `weed/server/constants/volume.go` | `seaweed-volume/src/server/heartbeat.rs`, config defaults | needs verification | heartbeat timing, constants parity |
|
||||
|
||||
### gRPC server surface
|
||||
|
||||
| Go file | Rust counterpart | Status | Comparison focus |
|
||||
| --- | --- | --- | --- |
|
||||
| `weed/server/volume_grpc_admin.go` | `seaweed-volume/src/server/grpc_server.rs` | needs verification | readonly/writable, allocate/delete/configure/mount/unmount |
|
||||
| `weed/server/volume_grpc_batch_delete.go` | `seaweed-volume/src/server/grpc_server.rs` | implemented | batch delete, EC delete path |
|
||||
| `weed/server/volume_grpc_client_to_master.go` | `seaweed-volume/src/server/heartbeat.rs` | partial | heartbeat fields, leader changes, metrics settings from master |
|
||||
| `weed/server/volume_grpc_copy.go` | `seaweed-volume/src/server/grpc_server.rs` | needs verification | full copy streams |
|
||||
| `weed/server/volume_grpc_copy_incremental.go` | `seaweed-volume/src/server/grpc_server.rs` | needs verification | incremental copy binary search, timestamps |
|
||||
| `weed/server/volume_grpc_erasure_coding.go` | `seaweed-volume/src/server/grpc_server.rs`, `seaweed-volume/src/storage/erasure_coding/*.rs` | needs verification | shard read/write/delete/mount/unmount/rebuild |
|
||||
| `weed/server/volume_grpc_query.go` | `seaweed-volume/src/server/grpc_server.rs` | needs verification | query validation and error parity |
|
||||
| `weed/server/volume_grpc_read_all.go` | `seaweed-volume/src/server/grpc_server.rs` | needs verification | read-all ordering and tail semantics |
|
||||
| `weed/server/volume_grpc_read_write.go` | `seaweed-volume/src/server/grpc_server.rs`, `seaweed-volume/src/storage/*.rs` | needs verification | blob/meta/page reads, write blob semantics |
|
||||
| `weed/server/volume_grpc_remote.go` | `seaweed-volume/src/server/grpc_server.rs`, `seaweed-volume/src/remote_storage/*.rs` | needs verification | remote fetch/write and tier metadata |
|
||||
| `weed/server/volume_grpc_scrub.go` | `seaweed-volume/src/server/grpc_server.rs`, `seaweed-volume/src/storage/*.rs` | needs verification | scrub result semantics |
|
||||
| `weed/server/volume_grpc_state.go` | `seaweed-volume/src/server/grpc_server.rs` | implemented | GetState/SetState/Status |
|
||||
| `weed/server/volume_grpc_tail.go` | `seaweed-volume/src/server/grpc_server.rs` | needs verification | tail streaming and idle timeout |
|
||||
| `weed/server/volume_grpc_tier_download.go` | `seaweed-volume/src/server/grpc_server.rs`, `seaweed-volume/src/remote_storage/*.rs` | needs verification | tier download stream/error paths |
|
||||
| `weed/server/volume_grpc_tier_upload.go` | `seaweed-volume/src/server/grpc_server.rs`, `seaweed-volume/src/remote_storage/*.rs` | needs verification | tier upload stream/error paths |
|
||||
| `weed/server/volume_grpc_vacuum.go` | `seaweed-volume/src/server/grpc_server.rs`, `seaweed-volume/src/storage/*.rs` | needs verification | compact/commit/cleanup progress and readonly transitions |
|
||||
|
||||
### Storage and persistence surface
|
||||
|
||||
| Go file group | Rust counterpart | Status | Comparison focus |
|
||||
| --- | --- | --- | --- |
|
||||
| `weed/storage/store.go`, `store_state.go` | `seaweed-volume/src/storage/store.rs`, `seaweed-volume/src/server/heartbeat.rs` | partial | topology metadata, disk tags, server id, state persistence |
|
||||
| `weed/storage/store_vacuum.go` | `seaweed-volume/src/storage/store.rs`, `seaweed-volume/src/storage/volume.rs` | needs verification | vacuum sequencing |
|
||||
| `weed/storage/store_ec.go`, `store_ec_delete.go`, `store_ec_scrub.go` | `seaweed-volume/src/storage/store.rs`, `seaweed-volume/src/storage/erasure_coding/*.rs` | needs verification | EC lifecycle and scrub behavior |
|
||||
| `weed/storage/disk_location.go`, `disk_location_ec.go` | `seaweed-volume/src/storage/disk_location.rs`, `seaweed-volume/src/storage/store.rs` | partial | directory UUIDs, tags, load rules, disk space checks |
|
||||
| `weed/storage/volume.go`, `volume_loading.go` | `seaweed-volume/src/storage/volume.rs` | needs verification | load/reload/readonly/remote metadata |
|
||||
| `weed/storage/volume_super_block.go` | `seaweed-volume/src/storage/super_block.rs`, `seaweed-volume/src/storage/volume.rs` | implemented | super block parity |
|
||||
| `weed/storage/volume_read.go`, `volume_read_all.go` | `seaweed-volume/src/storage/volume.rs`, `seaweed-volume/src/server/handlers.rs` | needs verification | full/meta/page reads, TTL, streaming |
|
||||
| `weed/storage/volume_write.go` | `seaweed-volume/src/storage/volume.rs`, `seaweed-volume/src/server/write_queue.rs` | needs verification | dedup, sync/async writes, metadata flags |
|
||||
| `weed/storage/volume_vacuum.go` | `seaweed-volume/src/storage/volume.rs` | needs verification | compact and commit parity |
|
||||
| `weed/storage/volume_backup.go` | `seaweed-volume/src/storage/volume.rs`, `seaweed-volume/src/server/grpc_server.rs` | needs verification | backup/search logic |
|
||||
| `weed/storage/volume_checking.go` | `seaweed-volume/src/storage/volume.rs`, `seaweed-volume/src/storage/idx/mod.rs`, `seaweed-volume/src/server/grpc_server.rs` | needs verification | scrub and integrity checks |
|
||||
| `weed/storage/volume_info.go`, `volume_info/volume_info.go`, `volume_tier.go` | `seaweed-volume/src/storage/volume.rs`, `seaweed-volume/src/remote_storage/*.rs` | needs verification | `.vif` format and tiered file metadata |
|
||||
| `weed/storage/needle/*.go` | `seaweed-volume/src/storage/needle/*.rs` | needs verification | needle parsing, CRC, TTL, multipart metadata |
|
||||
| `weed/storage/idx/*.go` | `seaweed-volume/src/storage/idx/*.rs` | needs verification | index walking and binary search |
|
||||
| `weed/storage/needle_map*.go`, `needle_map/*.go` | `seaweed-volume/src/storage/needle_map.rs` | needs verification | map kind parity, persistence, memory behavior |
|
||||
| `weed/storage/super_block/*.go` | `seaweed-volume/src/storage/super_block.rs` | implemented | replica placement and TTL metadata |
|
||||
| `weed/storage/erasure_coding/*.go` | `seaweed-volume/src/storage/erasure_coding/*.rs` | needs verification | EC shard placement, encode/decode, journal deletes |
|
||||
|
||||
### Supporting runtime surface
|
||||
|
||||
| Go file | Rust counterpart | Status | Comparison focus |
|
||||
| --- | --- | --- | --- |
|
||||
| `weed/command/volume.go` | `seaweed-volume/src/config.rs`, `seaweed-volume/src/main.rs` | partial | flags, metrics/debug listeners, startup behavior |
|
||||
| `weed/security/*.go` | `seaweed-volume/src/security.rs`, `seaweed-volume/src/main.rs` | implemented | JWT and TLS loading |
|
||||
| `weed/images/*.go` | `seaweed-volume/src/images.rs`, `seaweed-volume/src/server/handlers.rs` | implemented | JPEG orientation and transforms |
|
||||
| `weed/stats/*.go` | `seaweed-volume/src/metrics.rs`, `seaweed-volume/src/server/handlers.rs` | partial | metrics endpoints, push-gateway integration |
|
||||
|
||||
## Verified Gaps As Of 2026-03-08
|
||||
|
||||
The startup/runtime gaps that were verified in the initial audit are now closed:
|
||||
|
||||
1. Heartbeat metadata parity
|
||||
Closed by `8ade1c51d` and retained in current HEAD.
|
||||
|
||||
2. Dedicated metrics/debug listener parity
|
||||
Closed by `fbe0e5829`.
|
||||
|
||||
3. Master-provided metrics push settings
|
||||
Closed by `fbe0e5829`.
|
||||
|
||||
4. Slow-read tuning parity
|
||||
Closed by `66e3900dc`.
|
||||
|
||||
There are no remaining verified gaps from the initial startup/runtime audit. The broader line-by-line comparison batches below are still required to either confirm parity or surface new gaps.
|
||||
|
||||
## Execution Status As Of 2026-03-16
|
||||
|
||||
The file-by-file comparison and verification work executed in this round was:
|
||||
|
||||
1. Startup and harness alignment
|
||||
Compared `weed/command/volume.go`, `test/volume_server/framework/cluster*.go`, `seaweed-volume/src/config.rs`, and `seaweed-volume/src/main.rs` to ensure the Rust server is invoked with Go-compatible flags and is rebuilt from the current source during parity runs.
|
||||
|
||||
2. HTTP admin surface
|
||||
Compared `weed/server/volume_server_handlers_admin.go` against `seaweed-volume/src/server/handlers.rs` with emphasis on `/status` payload shape, disk-status fields, and volume ordering.
|
||||
|
||||
3. gRPC admin surface
|
||||
Compared `weed/server/volume_grpc_admin.go` against `seaweed-volume/src/server/grpc_server.rs` with emphasis on `Ping`, `VolumeConfigure`, readonly/writable flows, and error wrapping.
|
||||
|
||||
4. Storage/index layout
|
||||
Compared Go index-entry defaults in `weed/storage/types` and `weed/storage/idx/*.go` against the Rust default feature set in `seaweed-volume/Cargo.toml` and the Rust index reader/writer paths to confirm default binaries use the same offset width.
|
||||
|
||||
5. End-to-end parity verification
|
||||
Re-ran the Go HTTP and gRPC integration suites with `VOLUME_SERVER_IMPL=rust` after each fix to confirm wire-level compatibility.
|
||||
|
||||
### Verified mismatches closed in this round
|
||||
|
||||
- Rust parity runs could reuse a stale `weed-volume` binary across test invocations, hiding source and feature changes from the Go harness.
|
||||
- Rust defaulted to 5-byte index offsets, while the default Go `go build` path uses 4-byte offsets unless built with `-tags 5BytesOffset`.
|
||||
- Rust `/status` omitted Go fields in both `Volumes` and `DiskStatuses`, and did not sort volumes by `Id`.
|
||||
- Rust `Ping` treated an empty target as a self-ping and only performed a raw gRPC connect for filer targets; Go returns `remote_time_ns=0` for the empty request and performs a real filer `Ping` RPC.
|
||||
- Rust `VolumeNeedleStatus` dropped stored TTL metadata and reported `data_size` instead of Go’s `Size` field.
|
||||
- Rust multipart uploads ignored form fields such as `ts`, `ttl`, and `cm`, and also ignored part-level `Content-Encoding` and `Content-MD5`.
|
||||
- Rust only treated `dl=true` and `dl=1` as truthy, while Go accepts the full `strconv.ParseBool` set such as `dl=t` and `dl=True`.
|
||||
|
||||
### Verification commands
|
||||
|
||||
- `VOLUME_SERVER_IMPL=rust go test -count=1 -timeout 1200s ./test/volume_server/http/...`
|
||||
- `VOLUME_SERVER_IMPL=rust go test -count=1 -timeout 1200s ./test/volume_server/grpc/...`
|
||||
|
||||
## Execution Plan
|
||||
|
||||
### Batch 1: startup and heartbeat
|
||||
|
||||
- Compare `weed/command/volume.go`, `weed/server/volume_server.go`, `weed/server/volume_grpc_client_to_master.go`, `weed/storage/store.go`, and `weed/storage/disk_location.go`.
|
||||
- Close metadata and startup parity gaps that affect master registration and deployment compatibility.
|
||||
- Add Rust unit tests for heartbeat payloads and config wiring.
|
||||
|
||||
### Batch 2: HTTP read path
|
||||
|
||||
- Compare `volume_server_handlers_read.go`, `volume_server_handlers_helper.go`, and related storage read functions line by line.
|
||||
- Verify JWT, path parsing, proxy/redirect, ranges, streaming, chunk manifests, image transforms, and response-header overrides.
|
||||
- Extend `test/volume_server/http/...` and Rust handler tests where parity is not covered.
|
||||
|
||||
### Batch 3: HTTP write/delete path
|
||||
|
||||
- Compare `volume_server_handlers_write.go` and write-related storage functions.
|
||||
- Verify multipart behavior, metadata, md5, compression, unchanged writes, delete edge cases, and timestamp handling.
|
||||
|
||||
### Batch 4: gRPC admin and lifecycle
|
||||
|
||||
- Compare `volume_grpc_admin.go`, `volume_grpc_state.go`, and `volume_grpc_vacuum.go`.
|
||||
- Verify readonly/writable flows, maintenance mode, status payloads, mount/unmount/delete/configure, and vacuum transitions.
|
||||
|
||||
### Batch 5: gRPC data movement
|
||||
|
||||
- Compare `volume_grpc_read_write.go`, `copy*.go`, `read_all.go`, `tail.go`, `remote.go`, and `query.go`.
|
||||
- Verify stream framing, binary search, idle timeout, and remote-storage semantics.
|
||||
|
||||
### Batch 6: storage internals
|
||||
|
||||
- Compare all `weed/storage` volume, needle, idx, needle map, and EC files line by line.
|
||||
- Focus on persistence rules, readonly semantics, TTL, recovery/scrub, backup, and memory/disk map behavior.
|
||||
|
||||
## Commit Strategy
|
||||
|
||||
- One commit for the audit/plan document if the document itself changes.
|
||||
- One commit per logic fix.
|
||||
- Every logic commit must include the smallest test addition that proves the new parity claim.
|
||||
140
seaweed-volume/README.md
Normal file
140
seaweed-volume/README.md
Normal file
@@ -0,0 +1,140 @@
|
||||
# SeaweedFS Volume Server (Rust)
|
||||
|
||||
A drop-in replacement for the [SeaweedFS](https://github.com/seaweedfs/seaweedfs) Go volume server, rewritten in Rust. It uses binary-compatible storage formats (`.dat`, `.idx`, `.vif`) and speaks the same HTTP and gRPC protocols, so it works with an unmodified Go master server.
|
||||
|
||||
## Building
|
||||
|
||||
Requires Rust 1.75+ (2021 edition).
|
||||
|
||||
```bash
|
||||
cd seaweed-volume
|
||||
cargo build --release
|
||||
```
|
||||
|
||||
The binary is produced at `target/release/seaweed-volume`.
|
||||
|
||||
## Running
|
||||
|
||||
Start a Go master server first, then point the Rust volume server at it:
|
||||
|
||||
```bash
|
||||
# Minimal
|
||||
seaweed-volume --port 8080 --master localhost:9333 --dir /data/vol1 --max 7
|
||||
|
||||
# Multiple data directories
|
||||
seaweed-volume --port 8080 --master localhost:9333 \
|
||||
--dir /mnt/ssd1,/mnt/ssd2 --max 100,100 --disk ssd
|
||||
|
||||
# With datacenter/rack topology
|
||||
seaweed-volume --port 8080 --master localhost:9333 --dir /data/vol1 --max 7 \
|
||||
--dataCenter dc1 --rack rack1
|
||||
|
||||
# With JWT authentication
|
||||
seaweed-volume --port 8080 --master localhost:9333 --dir /data/vol1 --max 7 \
|
||||
--securityFile /etc/seaweedfs/security.toml
|
||||
|
||||
# With TLS (configured in security.toml via [https.volume] and [grpc.volume] sections)
|
||||
seaweed-volume --port 8080 --master localhost:9333 --dir /data/vol1 --max 7 \
|
||||
--securityFile /etc/seaweedfs/security.toml
|
||||
```
|
||||
|
||||
### Common flags
|
||||
|
||||
| Flag | Default | Description |
|
||||
|------|---------|-------------|
|
||||
| `--port` | `8080` | HTTP listen port |
|
||||
| `--port.grpc` | `port+10000` | gRPC listen port |
|
||||
| `--master` | `localhost:9333` | Comma-separated master server addresses |
|
||||
| `--dir` | `/tmp` | Comma-separated data directories |
|
||||
| `--max` | `8` | Max volumes per directory (comma-separated) |
|
||||
| `--ip` | auto-detect | Server IP / identifier |
|
||||
| `--ip.bind` | same as `--ip` | Bind address |
|
||||
| `--dataCenter` | | Datacenter name |
|
||||
| `--rack` | | Rack name |
|
||||
| `--disk` | | Disk type tag: `hdd`, `ssd`, or custom |
|
||||
| `--index` | `memory` | Needle map type: `memory`, `leveldb`, `leveldbMedium`, `leveldbLarge` |
|
||||
| `--readMode` | `proxy` | Non-local read mode: `local`, `proxy`, `redirect` |
|
||||
| `--fileSizeLimitMB` | `256` | Max upload file size |
|
||||
| `--minFreeSpace` | `1` (percent) | Min free disk space before marking volumes read-only |
|
||||
| `--securityFile` | | Path to `security.toml` for JWT keys and TLS certs |
|
||||
| `--metricsPort` | `0` (disabled) | Prometheus metrics endpoint port |
|
||||
| `--whiteList` | | Comma-separated IPs with write permission |
|
||||
| `--preStopSeconds` | `10` | Graceful drain period before shutdown |
|
||||
| `--compactionMBps` | `0` (unlimited) | Compaction I/O rate limit |
|
||||
| `--pprof` | `false` | Enable pprof HTTP handlers |
|
||||
|
||||
Set `RUST_LOG=debug` (or `trace`, `info`, `warn`) for log level control.
|
||||
Set `SEAWEED_WRITE_QUEUE=1` to enable batched async write processing.
|
||||
|
||||
## Features
|
||||
|
||||
- **Binary compatible** -- reads and writes the same `.dat`/`.idx`/`.vif` files as the Go server; seamless migration with no data conversion.
|
||||
- **HTTP + gRPC** -- full implementation of the volume server HTTP API and all gRPC RPCs including streaming operations (copy, tail, incremental copy, vacuum).
|
||||
- **Master heartbeat** -- bidirectional streaming heartbeat with the Go master server; volume and EC shard registration, leader failover, graceful shutdown deregistration.
|
||||
- **JWT authentication** -- signing key configuration via `security.toml` with token source precedence (query > header > cookie), file_id claims validation, and separate read/write keys.
|
||||
- **TLS** -- HTTPS for the HTTP API and mTLS for gRPC, configured through `security.toml`.
|
||||
- **Erasure coding** -- Reed-Solomon EC shard management: mount/unmount, read, rebuild, copy, delete, and shard-to-volume reconstruction.
|
||||
- **S3 remote storage** -- `FetchAndWriteNeedle` reads from any S3-compatible backend (AWS, MinIO, Wasabi, Backblaze, etc.) and writes locally. Supports `VolumeTierMoveDatToRemote`/`FromRemote` for tiered storage.
|
||||
- **Needle map backends** -- in-memory HashMap, LevelDB (via `rusty-leveldb`), or redb (pure Rust disk-backed) needle maps.
|
||||
- **Image processing** -- on-the-fly resize/crop, JPEG EXIF orientation auto-fix, WebP support.
|
||||
- **Streaming reads** -- large files (>1MB) are streamed via `spawn_blocking` to avoid blocking the async runtime.
|
||||
- **Auto-compression** -- compressible file types (text, JSON, CSS, JS, SVG, etc.) are gzip-compressed on upload.
|
||||
- **Prometheus metrics** -- counters, histograms, and gauges exported at a dedicated metrics port; optional push gateway support.
|
||||
- **Graceful shutdown** -- SIGINT/SIGTERM handling with configurable `preStopSeconds` drain period.
|
||||
|
||||
## Testing
|
||||
|
||||
### Rust unit tests
|
||||
|
||||
```bash
|
||||
cd seaweed-volume
|
||||
cargo test
|
||||
```
|
||||
|
||||
### Go integration tests
|
||||
|
||||
The Go test suite can target either the Go or Rust volume server via the `VOLUME_SERVER_IMPL` environment variable:
|
||||
|
||||
```bash
|
||||
# Run all HTTP + gRPC integration tests against the Rust server
|
||||
VOLUME_SERVER_IMPL=rust go test -v -count=1 -timeout 1200s \
|
||||
./test/volume_server/grpc/... ./test/volume_server/http/...
|
||||
|
||||
# Run a single test
|
||||
VOLUME_SERVER_IMPL=rust go test -v -count=1 -timeout 60s \
|
||||
-run "TestName" ./test/volume_server/http/...
|
||||
|
||||
# Run S3 remote storage tests
|
||||
VOLUME_SERVER_IMPL=rust go test -v -count=1 -timeout 180s \
|
||||
-run "TestFetchAndWriteNeedle" ./test/volume_server/grpc/...
|
||||
```
|
||||
|
||||
## Load testing
|
||||
|
||||
A load test harness is available at `test/volume_server/loadtest/`. See that directory for usage instructions and scenarios.
|
||||
|
||||
## Architecture
|
||||
|
||||
The server runs three listeners concurrently:
|
||||
|
||||
- **HTTP** (Axum 0.7) -- admin and public routers for file upload/download, status, and stats endpoints.
|
||||
- **gRPC** (Tonic 0.12) -- all `VolumeServer` RPCs from the SeaweedFS protobuf definition.
|
||||
- **Metrics** (optional) -- Prometheus scrape endpoint on a separate port.
|
||||
|
||||
Key source modules:
|
||||
|
||||
| Path | Description |
|
||||
|------|-------------|
|
||||
| `src/main.rs` | Entry point, server startup, signal handling |
|
||||
| `src/config.rs` | CLI parsing and configuration resolution |
|
||||
| `src/server/volume_server.rs` | HTTP router setup and middleware |
|
||||
| `src/server/handlers.rs` | HTTP request handlers (read, write, delete, status) |
|
||||
| `src/server/grpc_server.rs` | gRPC service implementation |
|
||||
| `src/server/heartbeat.rs` | Master heartbeat loop |
|
||||
| `src/storage/volume.rs` | Volume read/write/delete logic |
|
||||
| `src/storage/needle.rs` | Needle (file entry) serialization |
|
||||
| `src/storage/store.rs` | Multi-volume store management |
|
||||
| `src/security.rs` | JWT validation and IP whitelist guard |
|
||||
| `src/remote_storage/` | S3 remote storage backend |
|
||||
|
||||
See [DEV_PLAN.md](DEV_PLAN.md) for the full development history and feature checklist.
|
||||
17
seaweed-volume/build.rs
Normal file
17
seaweed-volume/build.rs
Normal file
@@ -0,0 +1,17 @@
|
||||
fn main() -> Result<(), Box<dyn std::error::Error>> {
|
||||
let out_dir = std::path::PathBuf::from(std::env::var("OUT_DIR")?);
|
||||
tonic_build::configure()
|
||||
.build_server(true)
|
||||
.build_client(true)
|
||||
.file_descriptor_set_path(out_dir.join("seaweed_descriptor.bin"))
|
||||
.compile_protos(
|
||||
&[
|
||||
"proto/volume_server.proto",
|
||||
"proto/master.proto",
|
||||
"proto/remote.proto",
|
||||
"../weed/pb/filer.proto",
|
||||
],
|
||||
&["proto/", "../weed/pb/"],
|
||||
)?;
|
||||
Ok(())
|
||||
}
|
||||
474
seaweed-volume/proto/master.proto
Normal file
474
seaweed-volume/proto/master.proto
Normal file
@@ -0,0 +1,474 @@
|
||||
syntax = "proto3";
|
||||
|
||||
package master_pb;
|
||||
|
||||
option go_package = "github.com/seaweedfs/seaweedfs/weed/pb/master_pb";
|
||||
|
||||
import "volume_server.proto";
|
||||
|
||||
//////////////////////////////////////////////////
|
||||
|
||||
service Seaweed {
|
||||
rpc SendHeartbeat (stream Heartbeat) returns (stream HeartbeatResponse) {
|
||||
}
|
||||
rpc KeepConnected (stream KeepConnectedRequest) returns (stream KeepConnectedResponse) {
|
||||
}
|
||||
rpc LookupVolume (LookupVolumeRequest) returns (LookupVolumeResponse) {
|
||||
}
|
||||
rpc Assign (AssignRequest) returns (AssignResponse) {
|
||||
}
|
||||
rpc StreamAssign (stream AssignRequest) returns (stream AssignResponse) {
|
||||
}
|
||||
rpc Statistics (StatisticsRequest) returns (StatisticsResponse) {
|
||||
}
|
||||
rpc CollectionList (CollectionListRequest) returns (CollectionListResponse) {
|
||||
}
|
||||
rpc CollectionDelete (CollectionDeleteRequest) returns (CollectionDeleteResponse) {
|
||||
}
|
||||
rpc VolumeList (VolumeListRequest) returns (VolumeListResponse) {
|
||||
}
|
||||
rpc LookupEcVolume (LookupEcVolumeRequest) returns (LookupEcVolumeResponse) {
|
||||
}
|
||||
rpc VacuumVolume (VacuumVolumeRequest) returns (VacuumVolumeResponse) {
|
||||
}
|
||||
rpc DisableVacuum (DisableVacuumRequest) returns (DisableVacuumResponse) {
|
||||
}
|
||||
rpc EnableVacuum (EnableVacuumRequest) returns (EnableVacuumResponse) {
|
||||
}
|
||||
rpc VolumeMarkReadonly (VolumeMarkReadonlyRequest) returns (VolumeMarkReadonlyResponse) {
|
||||
}
|
||||
rpc GetMasterConfiguration (GetMasterConfigurationRequest) returns (GetMasterConfigurationResponse) {
|
||||
}
|
||||
rpc ListClusterNodes (ListClusterNodesRequest) returns (ListClusterNodesResponse) {
|
||||
}
|
||||
rpc LeaseAdminToken (LeaseAdminTokenRequest) returns (LeaseAdminTokenResponse) {
|
||||
}
|
||||
rpc ReleaseAdminToken (ReleaseAdminTokenRequest) returns (ReleaseAdminTokenResponse) {
|
||||
}
|
||||
rpc Ping (PingRequest) returns (PingResponse) {
|
||||
}
|
||||
rpc RaftListClusterServers (RaftListClusterServersRequest) returns (RaftListClusterServersResponse) {
|
||||
}
|
||||
rpc RaftAddServer (RaftAddServerRequest) returns (RaftAddServerResponse) {
|
||||
}
|
||||
rpc RaftRemoveServer (RaftRemoveServerRequest) returns (RaftRemoveServerResponse) {
|
||||
}
|
||||
rpc RaftLeadershipTransfer (RaftLeadershipTransferRequest) returns (RaftLeadershipTransferResponse) {
|
||||
}
|
||||
rpc VolumeGrow (VolumeGrowRequest) returns (VolumeGrowResponse) {
|
||||
}
|
||||
}
|
||||
|
||||
//////////////////////////////////////////////////
|
||||
|
||||
message DiskTag {
|
||||
uint32 disk_id = 1;
|
||||
repeated string tags = 2;
|
||||
}
|
||||
|
||||
message Heartbeat {
|
||||
string ip = 1;
|
||||
uint32 port = 2;
|
||||
string public_url = 3;
|
||||
uint64 max_file_key = 5;
|
||||
string data_center = 6;
|
||||
string rack = 7;
|
||||
uint32 admin_port = 8;
|
||||
repeated VolumeInformationMessage volumes = 9;
|
||||
// delta volumes
|
||||
repeated VolumeShortInformationMessage new_volumes = 10;
|
||||
repeated VolumeShortInformationMessage deleted_volumes = 11;
|
||||
bool has_no_volumes = 12;
|
||||
|
||||
// erasure coding
|
||||
repeated VolumeEcShardInformationMessage ec_shards = 16;
|
||||
// delta erasure coding shards
|
||||
repeated VolumeEcShardInformationMessage new_ec_shards = 17;
|
||||
repeated VolumeEcShardInformationMessage deleted_ec_shards = 18;
|
||||
bool has_no_ec_shards = 19;
|
||||
|
||||
map<string, uint32> max_volume_counts = 4;
|
||||
uint32 grpc_port = 20;
|
||||
repeated string location_uuids = 21;
|
||||
string id = 22; // volume server id, independent of ip:port for stable identification
|
||||
|
||||
// state flags
|
||||
volume_server_pb.VolumeServerState state = 23;
|
||||
|
||||
repeated DiskTag disk_tags = 24;
|
||||
}
|
||||
|
||||
message HeartbeatResponse {
|
||||
uint64 volume_size_limit = 1;
|
||||
string leader = 2;
|
||||
string metrics_address = 3;
|
||||
uint32 metrics_interval_seconds = 4;
|
||||
repeated StorageBackend storage_backends = 5;
|
||||
repeated string duplicated_uuids = 6;
|
||||
bool preallocate = 7;
|
||||
}
|
||||
|
||||
message VolumeInformationMessage {
|
||||
uint32 id = 1;
|
||||
uint64 size = 2;
|
||||
string collection = 3;
|
||||
uint64 file_count = 4;
|
||||
uint64 delete_count = 5;
|
||||
uint64 deleted_byte_count = 6;
|
||||
bool read_only = 7;
|
||||
uint32 replica_placement = 8;
|
||||
uint32 version = 9;
|
||||
uint32 ttl = 10;
|
||||
uint32 compact_revision = 11;
|
||||
int64 modified_at_second = 12;
|
||||
string remote_storage_name = 13;
|
||||
string remote_storage_key = 14;
|
||||
string disk_type = 15;
|
||||
uint32 disk_id = 16;
|
||||
}
|
||||
|
||||
message VolumeShortInformationMessage {
|
||||
uint32 id = 1;
|
||||
string collection = 3;
|
||||
uint32 replica_placement = 8;
|
||||
uint32 version = 9;
|
||||
uint32 ttl = 10;
|
||||
string disk_type = 15;
|
||||
uint32 disk_id = 16;
|
||||
}
|
||||
|
||||
message VolumeEcShardInformationMessage {
|
||||
uint32 id = 1;
|
||||
string collection = 2;
|
||||
uint32 ec_index_bits = 3;
|
||||
string disk_type = 4;
|
||||
uint64 expire_at_sec = 5; // used to record the destruction time of ec volume
|
||||
uint32 disk_id = 6;
|
||||
repeated int64 shard_sizes = 7; // optimized: sizes for shards in order of set bits in ec_index_bits
|
||||
}
|
||||
|
||||
message StorageBackend {
|
||||
string type = 1;
|
||||
string id = 2;
|
||||
map<string, string> properties = 3;
|
||||
}
|
||||
|
||||
message Empty {
|
||||
}
|
||||
|
||||
message SuperBlockExtra {
|
||||
message ErasureCoding {
|
||||
uint32 data = 1;
|
||||
uint32 parity = 2;
|
||||
repeated uint32 volume_ids = 3;
|
||||
}
|
||||
ErasureCoding erasure_coding = 1;
|
||||
}
|
||||
|
||||
message KeepConnectedRequest {
|
||||
string client_type = 1;
|
||||
string client_address = 3;
|
||||
string version = 4;
|
||||
string filer_group = 5;
|
||||
string data_center = 6;
|
||||
string rack = 7;
|
||||
}
|
||||
|
||||
message VolumeLocation {
|
||||
string url = 1;
|
||||
string public_url = 2;
|
||||
repeated uint32 new_vids = 3;
|
||||
repeated uint32 deleted_vids = 4;
|
||||
string leader = 5; // optional when leader is not itself
|
||||
string data_center = 6; // optional when DataCenter is in use
|
||||
uint32 grpc_port = 7;
|
||||
repeated uint32 new_ec_vids = 8;
|
||||
repeated uint32 deleted_ec_vids = 9;
|
||||
}
|
||||
|
||||
message ClusterNodeUpdate {
|
||||
string node_type = 1;
|
||||
string address = 2;
|
||||
bool is_add = 4;
|
||||
string filer_group = 5;
|
||||
int64 created_at_ns = 6;
|
||||
}
|
||||
|
||||
message KeepConnectedResponse {
|
||||
VolumeLocation volume_location = 1;
|
||||
ClusterNodeUpdate cluster_node_update = 2;
|
||||
}
|
||||
|
||||
message LookupVolumeRequest {
|
||||
repeated string volume_or_file_ids = 1;
|
||||
string collection = 2; // optional, a bit faster if provided.
|
||||
}
|
||||
message LookupVolumeResponse {
|
||||
message VolumeIdLocation {
|
||||
string volume_or_file_id = 1;
|
||||
repeated Location locations = 2;
|
||||
string error = 3;
|
||||
string auth = 4;
|
||||
}
|
||||
repeated VolumeIdLocation volume_id_locations = 1;
|
||||
}
|
||||
|
||||
message Location {
|
||||
string url = 1;
|
||||
string public_url = 2;
|
||||
uint32 grpc_port = 3;
|
||||
string data_center = 4;
|
||||
}
|
||||
|
||||
message AssignRequest {
|
||||
uint64 count = 1;
|
||||
string replication = 2;
|
||||
string collection = 3;
|
||||
string ttl = 4;
|
||||
string data_center = 5;
|
||||
string rack = 6;
|
||||
string data_node = 7;
|
||||
uint32 memory_map_max_size_mb = 8;
|
||||
uint32 writable_volume_count = 9;
|
||||
string disk_type = 10;
|
||||
}
|
||||
|
||||
message VolumeGrowRequest {
|
||||
uint32 writable_volume_count = 1;
|
||||
string replication = 2;
|
||||
string collection = 3;
|
||||
string ttl = 4;
|
||||
string data_center = 5;
|
||||
string rack = 6;
|
||||
string data_node = 7;
|
||||
uint32 memory_map_max_size_mb = 8;
|
||||
string disk_type = 9;
|
||||
}
|
||||
|
||||
message AssignResponse {
|
||||
string fid = 1;
|
||||
uint64 count = 4;
|
||||
string error = 5;
|
||||
string auth = 6;
|
||||
repeated Location replicas = 7;
|
||||
Location location = 8;
|
||||
}
|
||||
|
||||
message StatisticsRequest {
|
||||
string replication = 1;
|
||||
string collection = 2;
|
||||
string ttl = 3;
|
||||
string disk_type = 4;
|
||||
}
|
||||
message StatisticsResponse {
|
||||
uint64 total_size = 4;
|
||||
uint64 used_size = 5;
|
||||
uint64 file_count = 6;
|
||||
}
|
||||
|
||||
//
|
||||
// collection related
|
||||
//
|
||||
message Collection {
|
||||
string name = 1;
|
||||
}
|
||||
message CollectionListRequest {
|
||||
bool include_normal_volumes = 1;
|
||||
bool include_ec_volumes = 2;
|
||||
}
|
||||
message CollectionListResponse {
|
||||
repeated Collection collections = 1;
|
||||
}
|
||||
|
||||
message CollectionDeleteRequest {
|
||||
string name = 1;
|
||||
}
|
||||
message CollectionDeleteResponse {
|
||||
}
|
||||
|
||||
//
|
||||
// volume related
|
||||
//
|
||||
message DiskInfo {
|
||||
string type = 1;
|
||||
int64 volume_count = 2;
|
||||
int64 max_volume_count = 3;
|
||||
int64 free_volume_count = 4;
|
||||
int64 active_volume_count = 5;
|
||||
repeated VolumeInformationMessage volume_infos = 6;
|
||||
repeated VolumeEcShardInformationMessage ec_shard_infos = 7;
|
||||
int64 remote_volume_count = 8;
|
||||
uint32 disk_id = 9;
|
||||
repeated string tags = 10;
|
||||
}
|
||||
message DataNodeInfo {
|
||||
string id = 1;
|
||||
map<string, DiskInfo> diskInfos = 2;
|
||||
uint32 grpc_port = 3;
|
||||
string address = 4; // ip:port for connecting to the volume server
|
||||
}
|
||||
message RackInfo {
|
||||
string id = 1;
|
||||
repeated DataNodeInfo data_node_infos = 2;
|
||||
map<string, DiskInfo> diskInfos = 3;
|
||||
}
|
||||
message DataCenterInfo {
|
||||
string id = 1;
|
||||
repeated RackInfo rack_infos = 2;
|
||||
map<string, DiskInfo> diskInfos = 3;
|
||||
}
|
||||
message TopologyInfo {
|
||||
string id = 1;
|
||||
repeated DataCenterInfo data_center_infos = 2;
|
||||
map<string, DiskInfo> diskInfos = 3;
|
||||
}
|
||||
message VolumeListRequest {
|
||||
}
|
||||
message VolumeListResponse {
|
||||
TopologyInfo topology_info = 1;
|
||||
uint64 volume_size_limit_mb = 2;
|
||||
}
|
||||
|
||||
message LookupEcVolumeRequest {
|
||||
uint32 volume_id = 1;
|
||||
}
|
||||
message LookupEcVolumeResponse {
|
||||
uint32 volume_id = 1;
|
||||
message EcShardIdLocation {
|
||||
uint32 shard_id = 1;
|
||||
repeated Location locations = 2;
|
||||
}
|
||||
repeated EcShardIdLocation shard_id_locations = 2;
|
||||
}
|
||||
|
||||
message VacuumVolumeRequest {
|
||||
float garbage_threshold = 1;
|
||||
uint32 volume_id = 2;
|
||||
string collection = 3;
|
||||
}
|
||||
message VacuumVolumeResponse {
|
||||
}
|
||||
|
||||
message DisableVacuumRequest {
|
||||
}
|
||||
message DisableVacuumResponse {
|
||||
}
|
||||
|
||||
message EnableVacuumRequest {
|
||||
}
|
||||
message EnableVacuumResponse {
|
||||
}
|
||||
|
||||
message VolumeMarkReadonlyRequest {
|
||||
string ip = 1;
|
||||
uint32 port = 2;
|
||||
uint32 volume_id = 4;
|
||||
string collection = 5;
|
||||
uint32 replica_placement = 6;
|
||||
uint32 version = 7;
|
||||
uint32 ttl = 8;
|
||||
string disk_type = 9;
|
||||
bool is_readonly = 10;
|
||||
}
|
||||
message VolumeMarkReadonlyResponse {
|
||||
}
|
||||
|
||||
message GetMasterConfigurationRequest {
|
||||
}
|
||||
message GetMasterConfigurationResponse {
|
||||
string metrics_address = 1;
|
||||
uint32 metrics_interval_seconds = 2;
|
||||
repeated StorageBackend storage_backends = 3;
|
||||
string default_replication = 4;
|
||||
string leader = 5;
|
||||
uint32 volume_size_limit_m_b = 6;
|
||||
bool volume_preallocate = 7;
|
||||
// MIGRATION: fields 8-9 help migrate master.toml [master.maintenance] to admin script plugin. Remove after March 2027.
|
||||
string maintenance_scripts = 8;
|
||||
uint32 maintenance_sleep_minutes = 9;
|
||||
}
|
||||
|
||||
message ListClusterNodesRequest {
|
||||
string client_type = 1;
|
||||
string filer_group = 2;
|
||||
int32 limit = 4;
|
||||
}
|
||||
message ListClusterNodesResponse {
|
||||
message ClusterNode {
|
||||
string address = 1;
|
||||
string version = 2;
|
||||
int64 created_at_ns = 4;
|
||||
string data_center = 5;
|
||||
string rack = 6;
|
||||
}
|
||||
repeated ClusterNode cluster_nodes = 1;
|
||||
}
|
||||
|
||||
message LeaseAdminTokenRequest {
|
||||
int64 previous_token = 1;
|
||||
int64 previous_lock_time = 2;
|
||||
string lock_name = 3;
|
||||
string client_name = 4;
|
||||
string message = 5;
|
||||
}
|
||||
message LeaseAdminTokenResponse {
|
||||
int64 token = 1;
|
||||
int64 lock_ts_ns = 2;
|
||||
}
|
||||
|
||||
message ReleaseAdminTokenRequest {
|
||||
int64 previous_token = 1;
|
||||
int64 previous_lock_time = 2;
|
||||
string lock_name = 3;
|
||||
}
|
||||
message ReleaseAdminTokenResponse {
|
||||
}
|
||||
|
||||
message PingRequest {
|
||||
string target = 1; // default to ping itself
|
||||
string target_type = 2;
|
||||
}
|
||||
message PingResponse {
|
||||
int64 start_time_ns = 1;
|
||||
int64 remote_time_ns = 2;
|
||||
int64 stop_time_ns = 3;
|
||||
}
|
||||
|
||||
message RaftAddServerRequest {
|
||||
string id = 1;
|
||||
string address = 2;
|
||||
bool voter = 3;
|
||||
}
|
||||
message RaftAddServerResponse {
|
||||
}
|
||||
|
||||
message RaftRemoveServerRequest {
|
||||
string id = 1;
|
||||
bool force = 2;
|
||||
}
|
||||
message RaftRemoveServerResponse {
|
||||
}
|
||||
|
||||
message RaftListClusterServersRequest {
|
||||
}
|
||||
message RaftListClusterServersResponse {
|
||||
message ClusterServers {
|
||||
string id = 1;
|
||||
string address = 2;
|
||||
string suffrage = 3;
|
||||
bool isLeader = 4;
|
||||
}
|
||||
repeated ClusterServers cluster_servers = 1;
|
||||
}
|
||||
|
||||
message RaftLeadershipTransferRequest {
|
||||
string target_id = 1; // Optional: target server ID. If empty, transfers to any eligible follower
|
||||
string target_address = 2; // Optional: target server address. Required if target_id is specified
|
||||
}
|
||||
message RaftLeadershipTransferResponse {
|
||||
string previous_leader = 1;
|
||||
string new_leader = 2;
|
||||
}
|
||||
|
||||
message VolumeGrowResponse {
|
||||
}
|
||||
76
seaweed-volume/proto/remote.proto
Normal file
76
seaweed-volume/proto/remote.proto
Normal file
@@ -0,0 +1,76 @@
|
||||
syntax = "proto3";
|
||||
|
||||
package remote_pb;
|
||||
|
||||
option go_package = "github.com/seaweedfs/seaweedfs/weed/pb/remote_pb";
|
||||
option java_package = "seaweedfs.client";
|
||||
option java_outer_classname = "FilerProto";
|
||||
|
||||
/////////////////////////
|
||||
// Remote Storage related
|
||||
/////////////////////////
|
||||
message RemoteConf {
|
||||
string type = 1;
|
||||
string name = 2;
|
||||
string s3_access_key = 4;
|
||||
string s3_secret_key = 5;
|
||||
string s3_region = 6;
|
||||
string s3_endpoint = 7;
|
||||
string s3_storage_class = 8;
|
||||
bool s3_force_path_style = 9;
|
||||
bool s3_support_tagging = 13;
|
||||
bool s3_v4_signature = 11;
|
||||
|
||||
string gcs_google_application_credentials = 10;
|
||||
string gcs_project_id = 12;
|
||||
|
||||
string azure_account_name = 15;
|
||||
string azure_account_key = 16;
|
||||
|
||||
string backblaze_key_id = 20;
|
||||
string backblaze_application_key = 21;
|
||||
string backblaze_endpoint = 22;
|
||||
string backblaze_region = 23;
|
||||
|
||||
string aliyun_access_key = 25;
|
||||
string aliyun_secret_key = 26;
|
||||
string aliyun_endpoint = 27;
|
||||
string aliyun_region = 28;
|
||||
|
||||
string tencent_secret_id = 30;
|
||||
string tencent_secret_key = 31;
|
||||
string tencent_endpoint = 32;
|
||||
|
||||
string baidu_access_key = 35;
|
||||
string baidu_secret_key = 36;
|
||||
string baidu_endpoint = 37;
|
||||
string baidu_region = 38;
|
||||
|
||||
string wasabi_access_key = 40;
|
||||
string wasabi_secret_key = 41;
|
||||
string wasabi_endpoint = 42;
|
||||
string wasabi_region = 43;
|
||||
|
||||
string filebase_access_key = 60;
|
||||
string filebase_secret_key = 61;
|
||||
string filebase_endpoint = 62;
|
||||
|
||||
string storj_access_key = 65;
|
||||
string storj_secret_key = 66;
|
||||
string storj_endpoint = 67;
|
||||
|
||||
string contabo_access_key = 68;
|
||||
string contabo_secret_key = 69;
|
||||
string contabo_endpoint = 70;
|
||||
string contabo_region = 71;
|
||||
}
|
||||
|
||||
message RemoteStorageMapping {
|
||||
map<string,RemoteStorageLocation> mappings = 1;
|
||||
string primary_bucket_storage_name = 2;
|
||||
}
|
||||
message RemoteStorageLocation {
|
||||
string name = 1;
|
||||
string bucket = 2;
|
||||
string path = 3;
|
||||
}
|
||||
759
seaweed-volume/proto/volume_server.proto
Normal file
759
seaweed-volume/proto/volume_server.proto
Normal file
@@ -0,0 +1,759 @@
|
||||
syntax = "proto3";
|
||||
|
||||
package volume_server_pb;
|
||||
option go_package = "github.com/seaweedfs/seaweedfs/weed/pb/volume_server_pb";
|
||||
|
||||
import "remote.proto";
|
||||
|
||||
//////////////////////////////////////////////////
|
||||
|
||||
// Persistent state for volume servers.
|
||||
message VolumeServerState {
|
||||
// whether the server is in maintenance (i.e. read-only) mode.
|
||||
bool maintenance = 1;
|
||||
// incremental version counter
|
||||
uint32 version = 2;
|
||||
}
|
||||
|
||||
//////////////////////////////////////////////////
|
||||
|
||||
service VolumeServer {
|
||||
//Experts only: takes multiple fid parameters. This function does not propagate deletes to replicas.
|
||||
rpc BatchDelete (BatchDeleteRequest) returns (BatchDeleteResponse) {
|
||||
}
|
||||
|
||||
rpc VacuumVolumeCheck (VacuumVolumeCheckRequest) returns (VacuumVolumeCheckResponse) {
|
||||
}
|
||||
rpc VacuumVolumeCompact (VacuumVolumeCompactRequest) returns (stream VacuumVolumeCompactResponse) {
|
||||
}
|
||||
rpc VacuumVolumeCommit (VacuumVolumeCommitRequest) returns (VacuumVolumeCommitResponse) {
|
||||
}
|
||||
rpc VacuumVolumeCleanup (VacuumVolumeCleanupRequest) returns (VacuumVolumeCleanupResponse) {
|
||||
}
|
||||
|
||||
rpc DeleteCollection (DeleteCollectionRequest) returns (DeleteCollectionResponse) {
|
||||
}
|
||||
rpc AllocateVolume (AllocateVolumeRequest) returns (AllocateVolumeResponse) {
|
||||
}
|
||||
|
||||
rpc VolumeSyncStatus (VolumeSyncStatusRequest) returns (VolumeSyncStatusResponse) {
|
||||
}
|
||||
rpc VolumeIncrementalCopy (VolumeIncrementalCopyRequest) returns (stream VolumeIncrementalCopyResponse) {
|
||||
}
|
||||
|
||||
rpc VolumeMount (VolumeMountRequest) returns (VolumeMountResponse) {
|
||||
}
|
||||
rpc VolumeUnmount (VolumeUnmountRequest) returns (VolumeUnmountResponse) {
|
||||
}
|
||||
rpc VolumeDelete (VolumeDeleteRequest) returns (VolumeDeleteResponse) {
|
||||
}
|
||||
rpc VolumeMarkReadonly (VolumeMarkReadonlyRequest) returns (VolumeMarkReadonlyResponse) {
|
||||
}
|
||||
rpc VolumeMarkWritable (VolumeMarkWritableRequest) returns (VolumeMarkWritableResponse) {
|
||||
}
|
||||
rpc VolumeConfigure (VolumeConfigureRequest) returns (VolumeConfigureResponse) {
|
||||
}
|
||||
rpc VolumeStatus (VolumeStatusRequest) returns (VolumeStatusResponse) {
|
||||
}
|
||||
|
||||
rpc GetState (GetStateRequest) returns (GetStateResponse) {
|
||||
}
|
||||
rpc SetState (SetStateRequest) returns (SetStateResponse) {
|
||||
}
|
||||
|
||||
// copy the .idx .dat files, and mount this volume
|
||||
rpc VolumeCopy (VolumeCopyRequest) returns (stream VolumeCopyResponse) {
|
||||
}
|
||||
rpc ReadVolumeFileStatus (ReadVolumeFileStatusRequest) returns (ReadVolumeFileStatusResponse) {
|
||||
}
|
||||
rpc CopyFile (CopyFileRequest) returns (stream CopyFileResponse) {
|
||||
}
|
||||
rpc ReceiveFile (stream ReceiveFileRequest) returns (ReceiveFileResponse) {
|
||||
}
|
||||
|
||||
rpc ReadNeedleBlob (ReadNeedleBlobRequest) returns (ReadNeedleBlobResponse) {
|
||||
}
|
||||
rpc ReadNeedleMeta (ReadNeedleMetaRequest) returns (ReadNeedleMetaResponse) {
|
||||
}
|
||||
rpc WriteNeedleBlob (WriteNeedleBlobRequest) returns (WriteNeedleBlobResponse) {
|
||||
}
|
||||
rpc ReadAllNeedles (ReadAllNeedlesRequest) returns (stream ReadAllNeedlesResponse) {
|
||||
}
|
||||
|
||||
rpc VolumeTailSender (VolumeTailSenderRequest) returns (stream VolumeTailSenderResponse) {
|
||||
}
|
||||
rpc VolumeTailReceiver (VolumeTailReceiverRequest) returns (VolumeTailReceiverResponse) {
|
||||
}
|
||||
|
||||
// erasure coding
|
||||
rpc VolumeEcShardsGenerate (VolumeEcShardsGenerateRequest) returns (VolumeEcShardsGenerateResponse) {
|
||||
}
|
||||
rpc VolumeEcShardsRebuild (VolumeEcShardsRebuildRequest) returns (VolumeEcShardsRebuildResponse) {
|
||||
}
|
||||
rpc VolumeEcShardsCopy (VolumeEcShardsCopyRequest) returns (VolumeEcShardsCopyResponse) {
|
||||
}
|
||||
rpc VolumeEcShardsDelete (VolumeEcShardsDeleteRequest) returns (VolumeEcShardsDeleteResponse) {
|
||||
}
|
||||
rpc VolumeEcShardsMount (VolumeEcShardsMountRequest) returns (VolumeEcShardsMountResponse) {
|
||||
}
|
||||
rpc VolumeEcShardsUnmount (VolumeEcShardsUnmountRequest) returns (VolumeEcShardsUnmountResponse) {
|
||||
}
|
||||
rpc VolumeEcShardRead (VolumeEcShardReadRequest) returns (stream VolumeEcShardReadResponse) {
|
||||
}
|
||||
rpc VolumeEcBlobDelete (VolumeEcBlobDeleteRequest) returns (VolumeEcBlobDeleteResponse) {
|
||||
}
|
||||
rpc VolumeEcShardsToVolume (VolumeEcShardsToVolumeRequest) returns (VolumeEcShardsToVolumeResponse) {
|
||||
}
|
||||
rpc VolumeEcShardsInfo (VolumeEcShardsInfoRequest) returns (VolumeEcShardsInfoResponse) {
|
||||
}
|
||||
|
||||
// tiered storage
|
||||
rpc VolumeTierMoveDatToRemote (VolumeTierMoveDatToRemoteRequest) returns (stream VolumeTierMoveDatToRemoteResponse) {
|
||||
}
|
||||
rpc VolumeTierMoveDatFromRemote (VolumeTierMoveDatFromRemoteRequest) returns (stream VolumeTierMoveDatFromRemoteResponse) {
|
||||
}
|
||||
|
||||
rpc VolumeServerStatus (VolumeServerStatusRequest) returns (VolumeServerStatusResponse) {
|
||||
}
|
||||
rpc VolumeServerLeave (VolumeServerLeaveRequest) returns (VolumeServerLeaveResponse) {
|
||||
}
|
||||
|
||||
// remote storage
|
||||
rpc FetchAndWriteNeedle (FetchAndWriteNeedleRequest) returns (FetchAndWriteNeedleResponse) {
|
||||
}
|
||||
|
||||
// scrubbing
|
||||
rpc ScrubVolume (ScrubVolumeRequest) returns (ScrubVolumeResponse) {
|
||||
}
|
||||
rpc ScrubEcVolume (ScrubEcVolumeRequest) returns (ScrubEcVolumeResponse) {
|
||||
}
|
||||
|
||||
// <experimental> query
|
||||
rpc Query (QueryRequest) returns (stream QueriedStripe) {
|
||||
}
|
||||
|
||||
rpc VolumeNeedleStatus (VolumeNeedleStatusRequest) returns (VolumeNeedleStatusResponse) {
|
||||
}
|
||||
|
||||
rpc Ping (PingRequest) returns (PingResponse) {
|
||||
}
|
||||
|
||||
}
|
||||
|
||||
//////////////////////////////////////////////////
|
||||
|
||||
message BatchDeleteRequest {
|
||||
repeated string file_ids = 1;
|
||||
bool skip_cookie_check = 2;
|
||||
}
|
||||
|
||||
message BatchDeleteResponse {
|
||||
repeated DeleteResult results = 1;
|
||||
}
|
||||
message DeleteResult {
|
||||
string file_id = 1;
|
||||
int32 status = 2;
|
||||
string error = 3;
|
||||
uint32 size = 4;
|
||||
uint32 version = 5;
|
||||
}
|
||||
|
||||
message Empty {
|
||||
}
|
||||
|
||||
message VacuumVolumeCheckRequest {
|
||||
uint32 volume_id = 1;
|
||||
}
|
||||
message VacuumVolumeCheckResponse {
|
||||
double garbage_ratio = 1;
|
||||
}
|
||||
|
||||
message VacuumVolumeCompactRequest {
|
||||
uint32 volume_id = 1;
|
||||
int64 preallocate = 2;
|
||||
}
|
||||
message VacuumVolumeCompactResponse {
|
||||
int64 processed_bytes = 1;
|
||||
float load_avg_1m = 2;
|
||||
}
|
||||
|
||||
message VacuumVolumeCommitRequest {
|
||||
uint32 volume_id = 1;
|
||||
}
|
||||
message VacuumVolumeCommitResponse {
|
||||
bool is_read_only = 1;
|
||||
uint64 volume_size = 2;
|
||||
}
|
||||
|
||||
message VacuumVolumeCleanupRequest {
|
||||
uint32 volume_id = 1;
|
||||
}
|
||||
message VacuumVolumeCleanupResponse {
|
||||
}
|
||||
|
||||
message DeleteCollectionRequest {
|
||||
string collection = 1;
|
||||
}
|
||||
message DeleteCollectionResponse {
|
||||
}
|
||||
|
||||
message AllocateVolumeRequest {
|
||||
uint32 volume_id = 1;
|
||||
string collection = 2;
|
||||
int64 preallocate = 3;
|
||||
string replication = 4;
|
||||
string ttl = 5;
|
||||
uint32 memory_map_max_size_mb = 6;
|
||||
string disk_type = 7;
|
||||
uint32 version = 8;
|
||||
}
|
||||
message AllocateVolumeResponse {
|
||||
}
|
||||
|
||||
message VolumeSyncStatusRequest {
|
||||
uint32 volume_id = 1;
|
||||
}
|
||||
message VolumeSyncStatusResponse {
|
||||
uint32 volume_id = 1;
|
||||
string collection = 2;
|
||||
string replication = 4;
|
||||
string ttl = 5;
|
||||
uint64 tail_offset = 6;
|
||||
uint32 compact_revision = 7;
|
||||
uint64 idx_file_size = 8;
|
||||
uint32 version = 9;
|
||||
}
|
||||
|
||||
message VolumeIncrementalCopyRequest {
|
||||
uint32 volume_id = 1;
|
||||
uint64 since_ns = 2;
|
||||
}
|
||||
message VolumeIncrementalCopyResponse {
|
||||
bytes file_content = 1;
|
||||
}
|
||||
|
||||
message VolumeMountRequest {
|
||||
uint32 volume_id = 1;
|
||||
}
|
||||
message VolumeMountResponse {
|
||||
}
|
||||
|
||||
message VolumeUnmountRequest {
|
||||
uint32 volume_id = 1;
|
||||
}
|
||||
message VolumeUnmountResponse {
|
||||
}
|
||||
|
||||
message VolumeDeleteRequest {
|
||||
uint32 volume_id = 1;
|
||||
bool only_empty = 2;
|
||||
}
|
||||
message VolumeDeleteResponse {
|
||||
}
|
||||
|
||||
message VolumeMarkReadonlyRequest {
|
||||
uint32 volume_id = 1;
|
||||
bool persist = 2;
|
||||
}
|
||||
message VolumeMarkReadonlyResponse {
|
||||
}
|
||||
|
||||
message VolumeMarkWritableRequest {
|
||||
uint32 volume_id = 1;
|
||||
}
|
||||
message VolumeMarkWritableResponse {
|
||||
}
|
||||
|
||||
message VolumeConfigureRequest {
|
||||
uint32 volume_id = 1;
|
||||
string replication = 2;
|
||||
}
|
||||
message VolumeConfigureResponse {
|
||||
string error = 1;
|
||||
}
|
||||
|
||||
message VolumeStatusRequest {
|
||||
uint32 volume_id = 1;
|
||||
}
|
||||
message VolumeStatusResponse {
|
||||
bool is_read_only = 1;
|
||||
uint64 volume_size = 2;
|
||||
uint64 file_count = 3;
|
||||
uint64 file_deleted_count = 4;
|
||||
}
|
||||
|
||||
message GetStateRequest {
|
||||
}
|
||||
message GetStateResponse {
|
||||
VolumeServerState state = 1;
|
||||
}
|
||||
|
||||
message SetStateRequest {
|
||||
// SetState updates *all* volume server flags at once. Retrieve state with GetState(),
|
||||
// modify individual flags as required, then call this RPC to update.
|
||||
VolumeServerState state = 1;
|
||||
}
|
||||
message SetStateResponse {
|
||||
VolumeServerState state = 1;
|
||||
}
|
||||
|
||||
message VolumeCopyRequest {
|
||||
uint32 volume_id = 1;
|
||||
string collection = 2;
|
||||
string replication = 3;
|
||||
string ttl = 4;
|
||||
string source_data_node = 5;
|
||||
string disk_type = 6;
|
||||
int64 io_byte_per_second = 7;
|
||||
}
|
||||
message VolumeCopyResponse {
|
||||
uint64 last_append_at_ns = 1;
|
||||
int64 processed_bytes = 2;
|
||||
}
|
||||
|
||||
message CopyFileRequest {
|
||||
uint32 volume_id = 1;
|
||||
string ext = 2;
|
||||
uint32 compaction_revision = 3;
|
||||
uint64 stop_offset = 4;
|
||||
string collection = 5;
|
||||
bool is_ec_volume = 6;
|
||||
bool ignore_source_file_not_found = 7;
|
||||
}
|
||||
message CopyFileResponse {
|
||||
bytes file_content = 1;
|
||||
int64 modified_ts_ns = 2;
|
||||
}
|
||||
|
||||
message ReceiveFileRequest {
|
||||
oneof data {
|
||||
ReceiveFileInfo info = 1;
|
||||
bytes file_content = 2;
|
||||
}
|
||||
}
|
||||
|
||||
message ReceiveFileInfo {
|
||||
uint32 volume_id = 1;
|
||||
string ext = 2;
|
||||
string collection = 3;
|
||||
bool is_ec_volume = 4;
|
||||
uint32 shard_id = 5;
|
||||
uint64 file_size = 6;
|
||||
}
|
||||
|
||||
message ReceiveFileResponse {
|
||||
uint64 bytes_written = 1;
|
||||
string error = 2;
|
||||
}
|
||||
|
||||
message ReadNeedleBlobRequest {
|
||||
uint32 volume_id = 1;
|
||||
int64 offset = 3; // actual offset
|
||||
int32 size = 4;
|
||||
}
|
||||
message ReadNeedleBlobResponse {
|
||||
bytes needle_blob = 1;
|
||||
}
|
||||
|
||||
message ReadNeedleMetaRequest {
|
||||
uint32 volume_id = 1;
|
||||
uint64 needle_id = 2;
|
||||
int64 offset = 3; // actual offset
|
||||
int32 size = 4;
|
||||
}
|
||||
message ReadNeedleMetaResponse {
|
||||
uint32 cookie = 1;
|
||||
uint64 last_modified = 2;
|
||||
uint32 crc = 3;
|
||||
string ttl = 4;
|
||||
uint64 append_at_ns = 5;
|
||||
}
|
||||
|
||||
message WriteNeedleBlobRequest {
|
||||
uint32 volume_id = 1;
|
||||
uint64 needle_id = 2;
|
||||
int32 size = 3;
|
||||
bytes needle_blob = 4;
|
||||
}
|
||||
message WriteNeedleBlobResponse {
|
||||
}
|
||||
|
||||
message ReadAllNeedlesRequest {
|
||||
repeated uint32 volume_ids = 1;
|
||||
}
|
||||
message ReadAllNeedlesResponse {
|
||||
uint32 volume_id = 1;
|
||||
uint64 needle_id = 2;
|
||||
uint32 cookie = 3;
|
||||
bytes needle_blob = 5;
|
||||
bool needle_blob_compressed = 6;
|
||||
uint64 last_modified = 7;
|
||||
uint32 crc = 8;
|
||||
bytes name = 9;
|
||||
bytes mime = 10;
|
||||
}
|
||||
|
||||
message VolumeTailSenderRequest {
|
||||
uint32 volume_id = 1;
|
||||
uint64 since_ns = 2;
|
||||
uint32 idle_timeout_seconds = 3;
|
||||
}
|
||||
message VolumeTailSenderResponse {
|
||||
bytes needle_header = 1;
|
||||
bytes needle_body = 2;
|
||||
bool is_last_chunk = 3;
|
||||
uint32 version = 4;
|
||||
}
|
||||
|
||||
message VolumeTailReceiverRequest {
|
||||
uint32 volume_id = 1;
|
||||
uint64 since_ns = 2;
|
||||
uint32 idle_timeout_seconds = 3;
|
||||
string source_volume_server = 4;
|
||||
}
|
||||
message VolumeTailReceiverResponse {
|
||||
}
|
||||
|
||||
message VolumeEcShardsGenerateRequest {
|
||||
uint32 volume_id = 1;
|
||||
string collection = 2;
|
||||
}
|
||||
message VolumeEcShardsGenerateResponse {
|
||||
}
|
||||
|
||||
message VolumeEcShardsRebuildRequest {
|
||||
uint32 volume_id = 1;
|
||||
string collection = 2;
|
||||
}
|
||||
message VolumeEcShardsRebuildResponse {
|
||||
repeated uint32 rebuilt_shard_ids = 1;
|
||||
}
|
||||
|
||||
message VolumeEcShardsCopyRequest {
|
||||
uint32 volume_id = 1;
|
||||
string collection = 2;
|
||||
repeated uint32 shard_ids = 3;
|
||||
bool copy_ecx_file = 4;
|
||||
string source_data_node = 5;
|
||||
bool copy_ecj_file = 6;
|
||||
bool copy_vif_file = 7;
|
||||
uint32 disk_id = 8; // Target disk ID for storing EC shards
|
||||
}
|
||||
message VolumeEcShardsCopyResponse {
|
||||
}
|
||||
|
||||
message VolumeEcShardsDeleteRequest {
|
||||
uint32 volume_id = 1;
|
||||
string collection = 2;
|
||||
repeated uint32 shard_ids = 3;
|
||||
}
|
||||
message VolumeEcShardsDeleteResponse {
|
||||
}
|
||||
|
||||
message VolumeEcShardsMountRequest {
|
||||
uint32 volume_id = 1;
|
||||
string collection = 2;
|
||||
repeated uint32 shard_ids = 3;
|
||||
}
|
||||
message VolumeEcShardsMountResponse {
|
||||
}
|
||||
|
||||
message VolumeEcShardsUnmountRequest {
|
||||
uint32 volume_id = 1;
|
||||
repeated uint32 shard_ids = 3;
|
||||
}
|
||||
message VolumeEcShardsUnmountResponse {
|
||||
}
|
||||
|
||||
message VolumeEcShardReadRequest {
|
||||
uint32 volume_id = 1;
|
||||
uint32 shard_id = 2;
|
||||
int64 offset = 3;
|
||||
int64 size = 4;
|
||||
uint64 file_key = 5;
|
||||
}
|
||||
message VolumeEcShardReadResponse {
|
||||
bytes data = 1;
|
||||
bool is_deleted = 2;
|
||||
}
|
||||
|
||||
message VolumeEcBlobDeleteRequest {
|
||||
uint32 volume_id = 1;
|
||||
string collection = 2;
|
||||
uint64 file_key = 3;
|
||||
uint32 version = 4;
|
||||
}
|
||||
message VolumeEcBlobDeleteResponse {
|
||||
}
|
||||
|
||||
message VolumeEcShardsToVolumeRequest {
|
||||
uint32 volume_id = 1;
|
||||
string collection = 2;
|
||||
}
|
||||
message VolumeEcShardsToVolumeResponse {
|
||||
}
|
||||
|
||||
message VolumeEcShardsInfoRequest {
|
||||
uint32 volume_id = 1;
|
||||
}
|
||||
message VolumeEcShardsInfoResponse {
|
||||
repeated EcShardInfo ec_shard_infos = 1;
|
||||
uint64 volume_size = 2;
|
||||
uint64 file_count = 3;
|
||||
uint64 file_deleted_count = 4;
|
||||
}
|
||||
|
||||
message EcShardInfo {
|
||||
uint32 shard_id = 1;
|
||||
int64 size = 2;
|
||||
string collection = 3;
|
||||
uint32 volume_id = 4;
|
||||
}
|
||||
|
||||
message ReadVolumeFileStatusRequest {
|
||||
uint32 volume_id = 1;
|
||||
}
|
||||
message ReadVolumeFileStatusResponse {
|
||||
uint32 volume_id = 1;
|
||||
uint64 idx_file_timestamp_seconds = 2;
|
||||
uint64 idx_file_size = 3;
|
||||
uint64 dat_file_timestamp_seconds = 4;
|
||||
uint64 dat_file_size = 5;
|
||||
uint64 file_count = 6;
|
||||
uint32 compaction_revision = 7;
|
||||
string collection = 8;
|
||||
string disk_type = 9;
|
||||
VolumeInfo volume_info = 10;
|
||||
uint32 version = 11;
|
||||
}
|
||||
|
||||
message DiskStatus {
|
||||
string dir = 1;
|
||||
uint64 all = 2;
|
||||
uint64 used = 3;
|
||||
uint64 free = 4;
|
||||
float percent_free = 5;
|
||||
float percent_used = 6;
|
||||
string disk_type = 7;
|
||||
}
|
||||
|
||||
message MemStatus {
|
||||
int32 goroutines = 1;
|
||||
uint64 all = 2;
|
||||
uint64 used = 3;
|
||||
uint64 free = 4;
|
||||
uint64 self = 5;
|
||||
uint64 heap = 6;
|
||||
uint64 stack = 7;
|
||||
}
|
||||
|
||||
// tired storage on volume servers
|
||||
message RemoteFile {
|
||||
string backend_type = 1;
|
||||
string backend_id = 2;
|
||||
string key = 3;
|
||||
uint64 offset = 4;
|
||||
uint64 file_size = 5;
|
||||
uint64 modified_time = 6;
|
||||
string extension = 7;
|
||||
}
|
||||
message VolumeInfo {
|
||||
repeated RemoteFile files = 1;
|
||||
uint32 version = 2;
|
||||
string replication = 3;
|
||||
uint32 bytes_offset = 4;
|
||||
int64 dat_file_size = 5; // store the original dat file size
|
||||
uint64 expire_at_sec = 6; // expiration time of ec volume
|
||||
bool read_only = 7;
|
||||
EcShardConfig ec_shard_config = 8; // EC shard configuration (optional, null = use default 10+4)
|
||||
}
|
||||
|
||||
// EcShardConfig specifies erasure coding shard configuration
|
||||
message EcShardConfig {
|
||||
uint32 data_shards = 1; // Number of data shards (e.g., 10)
|
||||
uint32 parity_shards = 2; // Number of parity shards (e.g., 4)
|
||||
}
|
||||
message OldVersionVolumeInfo {
|
||||
repeated RemoteFile files = 1;
|
||||
uint32 version = 2;
|
||||
string replication = 3;
|
||||
uint32 BytesOffset = 4;
|
||||
int64 dat_file_size = 5; // store the original dat file size
|
||||
uint64 DestroyTime = 6; // expiration time of ec volume
|
||||
bool read_only = 7;
|
||||
}
|
||||
|
||||
// tiered storage
|
||||
message VolumeTierMoveDatToRemoteRequest {
|
||||
uint32 volume_id = 1;
|
||||
string collection = 2;
|
||||
string destination_backend_name = 3;
|
||||
bool keep_local_dat_file = 4;
|
||||
}
|
||||
message VolumeTierMoveDatToRemoteResponse {
|
||||
int64 processed = 1;
|
||||
float processedPercentage = 2;
|
||||
}
|
||||
|
||||
message VolumeTierMoveDatFromRemoteRequest {
|
||||
uint32 volume_id = 1;
|
||||
string collection = 2;
|
||||
bool keep_remote_dat_file = 3;
|
||||
}
|
||||
message VolumeTierMoveDatFromRemoteResponse {
|
||||
int64 processed = 1;
|
||||
float processedPercentage = 2;
|
||||
}
|
||||
|
||||
message VolumeServerStatusRequest {
|
||||
|
||||
}
|
||||
message VolumeServerStatusResponse {
|
||||
repeated DiskStatus disk_statuses = 1;
|
||||
MemStatus memory_status = 2;
|
||||
string version = 3;
|
||||
string data_center = 4;
|
||||
string rack = 5;
|
||||
VolumeServerState state = 6;
|
||||
}
|
||||
|
||||
message VolumeServerLeaveRequest {
|
||||
}
|
||||
message VolumeServerLeaveResponse {
|
||||
}
|
||||
|
||||
// remote storage
|
||||
message FetchAndWriteNeedleRequest {
|
||||
uint32 volume_id = 1;
|
||||
uint64 needle_id = 2;
|
||||
uint32 cookie = 3;
|
||||
int64 offset = 4;
|
||||
int64 size = 5;
|
||||
message Replica {
|
||||
string url = 1;
|
||||
string public_url = 2;
|
||||
int32 grpc_port = 3;
|
||||
}
|
||||
repeated Replica replicas = 6;
|
||||
string auth = 7;
|
||||
// remote conf
|
||||
remote_pb.RemoteConf remote_conf = 15;
|
||||
remote_pb.RemoteStorageLocation remote_location = 16;
|
||||
}
|
||||
message FetchAndWriteNeedleResponse {
|
||||
string e_tag = 1;
|
||||
}
|
||||
|
||||
enum VolumeScrubMode {
|
||||
UNKNOWN = 0;
|
||||
INDEX = 1;
|
||||
FULL = 2;
|
||||
LOCAL = 3;
|
||||
}
|
||||
|
||||
message ScrubVolumeRequest {
|
||||
VolumeScrubMode mode = 1;
|
||||
// optional list of volume IDs to scrub. if empty, all volumes for the server are scrubbed.
|
||||
repeated uint32 volume_ids = 2;
|
||||
bool mark_broken_volumes_readonly = 3;
|
||||
}
|
||||
message ScrubVolumeResponse {
|
||||
uint64 total_volumes = 1;
|
||||
uint64 total_files = 2;
|
||||
repeated uint32 broken_volume_ids = 3;
|
||||
repeated string details = 4;
|
||||
}
|
||||
|
||||
message ScrubEcVolumeRequest {
|
||||
VolumeScrubMode mode = 1;
|
||||
// optional list of volume IDs to scrub. if empty, all EC volumes for the server are scrubbed.
|
||||
repeated uint32 volume_ids = 2;
|
||||
}
|
||||
message ScrubEcVolumeResponse {
|
||||
uint64 total_volumes = 1;
|
||||
uint64 total_files = 2;
|
||||
repeated uint32 broken_volume_ids = 3;
|
||||
repeated EcShardInfo broken_shard_infos = 4;
|
||||
repeated string details = 5;
|
||||
}
|
||||
|
||||
// select on volume servers
|
||||
message QueryRequest {
|
||||
repeated string selections = 1;
|
||||
repeated string from_file_ids = 2;
|
||||
message Filter {
|
||||
string field = 1;
|
||||
string operand = 2;
|
||||
string value = 3;
|
||||
}
|
||||
Filter filter = 3;
|
||||
|
||||
message InputSerialization {
|
||||
// NONE | GZIP | BZIP2
|
||||
string compression_type = 1;
|
||||
message CSVInput {
|
||||
string file_header_info = 1; // Valid values: NONE | USE | IGNORE
|
||||
string record_delimiter = 2; // Default: \n
|
||||
string field_delimiter = 3; // Default: ,
|
||||
string quote_character = 4; // Default: "
|
||||
string quote_escape_character = 5; // Default: "
|
||||
string comments = 6; // Default: #
|
||||
// If true, records might contain record delimiters within quote characters
|
||||
bool allow_quoted_record_delimiter = 7; // default False.
|
||||
}
|
||||
message JSONInput {
|
||||
string type = 1; // Valid values: DOCUMENT | LINES
|
||||
}
|
||||
message ParquetInput {
|
||||
}
|
||||
|
||||
CSVInput csv_input = 2;
|
||||
JSONInput json_input = 3;
|
||||
ParquetInput parquet_input = 4;
|
||||
}
|
||||
InputSerialization input_serialization = 4;
|
||||
|
||||
message OutputSerialization {
|
||||
message CSVOutput {
|
||||
string quote_fields = 1; // Valid values: ALWAYS | ASNEEDED
|
||||
string record_delimiter = 2; // Default: \n
|
||||
string field_delimiter = 3; // Default: ,
|
||||
string quote_character = 4; // Default: "
|
||||
string quote_escape_character = 5; // Default: "
|
||||
}
|
||||
message JSONOutput {
|
||||
string record_delimiter = 1;
|
||||
}
|
||||
|
||||
CSVOutput csv_output = 2;
|
||||
JSONOutput json_output = 3;
|
||||
}
|
||||
|
||||
OutputSerialization output_serialization = 5;
|
||||
}
|
||||
message QueriedStripe {
|
||||
bytes records = 1;
|
||||
}
|
||||
|
||||
message VolumeNeedleStatusRequest {
|
||||
uint32 volume_id = 1;
|
||||
uint64 needle_id = 2;
|
||||
}
|
||||
message VolumeNeedleStatusResponse {
|
||||
uint64 needle_id = 1;
|
||||
uint32 cookie = 2;
|
||||
uint32 size = 3;
|
||||
uint64 last_modified = 4;
|
||||
uint32 crc = 5;
|
||||
string ttl = 6;
|
||||
}
|
||||
|
||||
message PingRequest {
|
||||
string target = 1; // default to ping itself
|
||||
string target_type = 2;
|
||||
}
|
||||
message PingResponse {
|
||||
int64 start_time_ns = 1;
|
||||
int64 remote_time_ns = 2;
|
||||
int64 stop_time_ns = 3;
|
||||
}
|
||||
1697
seaweed-volume/src/config.rs
Normal file
1697
seaweed-volume/src/config.rs
Normal file
File diff suppressed because it is too large
Load Diff
275
seaweed-volume/src/images.rs
Normal file
275
seaweed-volume/src/images.rs
Normal file
@@ -0,0 +1,275 @@
|
||||
//! JPEG EXIF orientation auto-fix, matching Go's `FixJpgOrientation`.
|
||||
//!
|
||||
//! Reads the EXIF orientation tag from JPEG data and rotates/flips the image
|
||||
//! to normalize it to orientation 1 (top-left). If EXIF parsing fails or
|
||||
//! orientation is already normal, returns the original data unchanged.
|
||||
|
||||
use std::io::Cursor;
|
||||
|
||||
use image::{DynamicImage, GenericImageView, ImageFormat, RgbaImage};
|
||||
|
||||
/// EXIF orientation tag values.
|
||||
/// See: <http://sylvana.net/jpegcrop/exif_orientation.html>
|
||||
const TOP_LEFT_SIDE: u32 = 1;
|
||||
const TOP_RIGHT_SIDE: u32 = 2;
|
||||
const BOTTOM_RIGHT_SIDE: u32 = 3;
|
||||
const BOTTOM_LEFT_SIDE: u32 = 4;
|
||||
const LEFT_SIDE_TOP: u32 = 5;
|
||||
const RIGHT_SIDE_TOP: u32 = 6;
|
||||
const RIGHT_SIDE_BOTTOM: u32 = 7;
|
||||
const LEFT_SIDE_BOTTOM: u32 = 8;
|
||||
|
||||
/// Fix JPEG orientation based on EXIF data.
|
||||
///
|
||||
/// Reads the EXIF orientation tag and applies the appropriate rotation/flip
|
||||
/// to normalize the image to orientation 1 (top-left). Re-encodes as JPEG.
|
||||
///
|
||||
/// Returns the original data unchanged if:
|
||||
/// - EXIF data cannot be parsed
|
||||
/// - No orientation tag is present
|
||||
/// - Orientation is already 1 (normal)
|
||||
/// - Image decoding or re-encoding fails
|
||||
pub fn fix_jpg_orientation(data: &[u8]) -> Vec<u8> {
|
||||
// Parse EXIF data
|
||||
let orientation = match read_exif_orientation(data) {
|
||||
Some(o) => o,
|
||||
None => return data.to_vec(),
|
||||
};
|
||||
|
||||
// Orientation 1 means normal — no transformation needed
|
||||
if orientation == TOP_LEFT_SIDE {
|
||||
return data.to_vec();
|
||||
}
|
||||
|
||||
// Determine rotation angle and flip mode
|
||||
let (angle, flip_horizontal) = match orientation {
|
||||
TOP_RIGHT_SIDE => (0, true),
|
||||
BOTTOM_RIGHT_SIDE => (180, false),
|
||||
BOTTOM_LEFT_SIDE => (180, true),
|
||||
LEFT_SIDE_TOP => (-90, true),
|
||||
RIGHT_SIDE_TOP => (-90, false),
|
||||
RIGHT_SIDE_BOTTOM => (90, true),
|
||||
LEFT_SIDE_BOTTOM => (90, false),
|
||||
_ => return data.to_vec(),
|
||||
};
|
||||
|
||||
// Decode the image
|
||||
let src_image = match image::load_from_memory_with_format(data, ImageFormat::Jpeg) {
|
||||
Ok(img) => img,
|
||||
Err(_) => return data.to_vec(),
|
||||
};
|
||||
|
||||
// Apply rotation then flip (matching Go's flip(rotate(img, angle), flipMode))
|
||||
let transformed = flip_horizontal_if(rotate(src_image, angle), flip_horizontal);
|
||||
|
||||
// Re-encode as JPEG
|
||||
let mut buf = Cursor::new(Vec::new());
|
||||
match transformed.write_to(&mut buf, ImageFormat::Jpeg) {
|
||||
Ok(_) => buf.into_inner(),
|
||||
Err(_) => data.to_vec(),
|
||||
}
|
||||
}
|
||||
|
||||
/// Read the EXIF orientation tag from JPEG data.
|
||||
/// Returns None if EXIF cannot be parsed or orientation tag is not present.
|
||||
fn read_exif_orientation(data: &[u8]) -> Option<u32> {
|
||||
let exif_reader = exif::Reader::new();
|
||||
let mut cursor = Cursor::new(data);
|
||||
let exif_data = exif_reader.read_from_container(&mut cursor).ok()?;
|
||||
|
||||
let orientation_field = exif_data.get_field(exif::Tag::Orientation, exif::In::PRIMARY)?;
|
||||
match orientation_field.value {
|
||||
exif::Value::Short(ref v) if !v.is_empty() => Some(v[0] as u32),
|
||||
_ => orientation_field.value.get_uint(0),
|
||||
}
|
||||
}
|
||||
|
||||
/// Rotate an image by the given angle (counter-clockwise, in degrees).
|
||||
/// Matches Go's rotate function.
|
||||
fn rotate(img: DynamicImage, angle: i32) -> DynamicImage {
|
||||
let (width, height) = img.dimensions();
|
||||
|
||||
match angle {
|
||||
90 => {
|
||||
// 90 degrees counter-clockwise
|
||||
let new_w = height;
|
||||
let new_h = width;
|
||||
let mut out = RgbaImage::new(new_w, new_h);
|
||||
for y in 0..new_h {
|
||||
for x in 0..new_w {
|
||||
out.put_pixel(x, y, img.get_pixel(new_h - 1 - y, x));
|
||||
}
|
||||
}
|
||||
DynamicImage::ImageRgba8(out)
|
||||
}
|
||||
-90 => {
|
||||
// 90 degrees clockwise (or 270 counter-clockwise)
|
||||
let new_w = height;
|
||||
let new_h = width;
|
||||
let mut out = RgbaImage::new(new_w, new_h);
|
||||
for y in 0..new_h {
|
||||
for x in 0..new_w {
|
||||
out.put_pixel(x, y, img.get_pixel(y, new_w - 1 - x));
|
||||
}
|
||||
}
|
||||
DynamicImage::ImageRgba8(out)
|
||||
}
|
||||
180 | -180 => {
|
||||
let mut out = RgbaImage::new(width, height);
|
||||
for y in 0..height {
|
||||
for x in 0..width {
|
||||
out.put_pixel(x, y, img.get_pixel(width - 1 - x, height - 1 - y));
|
||||
}
|
||||
}
|
||||
DynamicImage::ImageRgba8(out)
|
||||
}
|
||||
_ => img,
|
||||
}
|
||||
}
|
||||
|
||||
/// Flip the image horizontally if requested.
|
||||
/// In Go, flipMode 2 == FlipHorizontal. We simplify since only horizontal flip is used.
|
||||
fn flip_horizontal_if(img: DynamicImage, do_flip: bool) -> DynamicImage {
|
||||
if !do_flip {
|
||||
return img;
|
||||
}
|
||||
let (width, height) = img.dimensions();
|
||||
let mut out = RgbaImage::new(width, height);
|
||||
for y in 0..height {
|
||||
for x in 0..width {
|
||||
out.put_pixel(x, y, img.get_pixel(width - 1 - x, y));
|
||||
}
|
||||
}
|
||||
DynamicImage::ImageRgba8(out)
|
||||
}
|
||||
|
||||
/// Returns true if the given MIME type or file path extension indicates a JPEG file.
|
||||
pub fn is_jpeg(mime_type: &str, path: &str) -> bool {
|
||||
if mime_type == "image/jpeg" {
|
||||
return true;
|
||||
}
|
||||
let lower = path.to_lowercase();
|
||||
lower.ends_with(".jpg") || lower.ends_with(".jpeg")
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn test_non_jpeg_data_returned_unchanged() {
|
||||
let data = b"not a jpeg file at all";
|
||||
let result = fix_jpg_orientation(data);
|
||||
assert_eq!(result, data);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_jpeg_without_exif_returned_unchanged() {
|
||||
// Create a minimal JPEG without EXIF data
|
||||
let img = DynamicImage::ImageRgba8(RgbaImage::new(2, 2));
|
||||
let mut buf = Cursor::new(Vec::new());
|
||||
img.write_to(&mut buf, ImageFormat::Jpeg).unwrap();
|
||||
let jpeg_data = buf.into_inner();
|
||||
|
||||
let result = fix_jpg_orientation(&jpeg_data);
|
||||
// Should return data unchanged (no EXIF orientation tag)
|
||||
// Just verify it's still valid JPEG
|
||||
assert!(!result.is_empty());
|
||||
assert_eq!(&result[0..2], &[0xFF, 0xD8]); // JPEG magic bytes
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_is_jpeg() {
|
||||
assert!(is_jpeg("image/jpeg", ""));
|
||||
assert!(is_jpeg("", "/3,abc.jpg"));
|
||||
assert!(is_jpeg("", "/3,abc.JPEG"));
|
||||
assert!(is_jpeg("application/octet-stream", "/3,abc.JPG"));
|
||||
assert!(!is_jpeg("image/png", "/3,abc.png"));
|
||||
assert!(!is_jpeg("", "/3,abc.png"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_rotate_180() {
|
||||
// Create a 2x2 image with distinct pixel colors
|
||||
let mut img = RgbaImage::new(2, 2);
|
||||
img.put_pixel(0, 0, image::Rgba([255, 0, 0, 255])); // red top-left
|
||||
img.put_pixel(1, 0, image::Rgba([0, 255, 0, 255])); // green top-right
|
||||
img.put_pixel(0, 1, image::Rgba([0, 0, 255, 255])); // blue bottom-left
|
||||
img.put_pixel(1, 1, image::Rgba([255, 255, 0, 255])); // yellow bottom-right
|
||||
let dynamic = DynamicImage::ImageRgba8(img);
|
||||
|
||||
let rotated = rotate(dynamic, 180);
|
||||
let (w, h) = rotated.dimensions();
|
||||
assert_eq!((w, h), (2, 2));
|
||||
// After 180 rotation: top-left should be yellow, top-right should be blue
|
||||
assert_eq!(rotated.get_pixel(0, 0), image::Rgba([255, 255, 0, 255]));
|
||||
assert_eq!(rotated.get_pixel(1, 0), image::Rgba([0, 0, 255, 255]));
|
||||
assert_eq!(rotated.get_pixel(0, 1), image::Rgba([0, 255, 0, 255]));
|
||||
assert_eq!(rotated.get_pixel(1, 1), image::Rgba([255, 0, 0, 255]));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_rotate_90_ccw() {
|
||||
// Create 3x2 image (width=3, height=2)
|
||||
let mut img = RgbaImage::new(3, 2);
|
||||
img.put_pixel(0, 0, image::Rgba([1, 0, 0, 255]));
|
||||
img.put_pixel(1, 0, image::Rgba([2, 0, 0, 255]));
|
||||
img.put_pixel(2, 0, image::Rgba([3, 0, 0, 255]));
|
||||
img.put_pixel(0, 1, image::Rgba([4, 0, 0, 255]));
|
||||
img.put_pixel(1, 1, image::Rgba([5, 0, 0, 255]));
|
||||
img.put_pixel(2, 1, image::Rgba([6, 0, 0, 255]));
|
||||
let dynamic = DynamicImage::ImageRgba8(img);
|
||||
|
||||
let rotated = rotate(dynamic, 90);
|
||||
let (w, h) = rotated.dimensions();
|
||||
// 90 CCW: width=3,height=2 -> new_w=2, new_h=3
|
||||
assert_eq!((w, h), (2, 3));
|
||||
// Top-right (2,0) should move to top-left (0,0) in CCW 90
|
||||
assert_eq!(rotated.get_pixel(0, 0)[0], 3);
|
||||
assert_eq!(rotated.get_pixel(1, 0)[0], 6);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_rotate_neg90_cw() {
|
||||
// Create 3x2 image
|
||||
let mut img = RgbaImage::new(3, 2);
|
||||
img.put_pixel(0, 0, image::Rgba([1, 0, 0, 255]));
|
||||
img.put_pixel(1, 0, image::Rgba([2, 0, 0, 255]));
|
||||
img.put_pixel(2, 0, image::Rgba([3, 0, 0, 255]));
|
||||
img.put_pixel(0, 1, image::Rgba([4, 0, 0, 255]));
|
||||
img.put_pixel(1, 1, image::Rgba([5, 0, 0, 255]));
|
||||
img.put_pixel(2, 1, image::Rgba([6, 0, 0, 255]));
|
||||
let dynamic = DynamicImage::ImageRgba8(img);
|
||||
|
||||
let rotated = rotate(dynamic, -90);
|
||||
let (w, h) = rotated.dimensions();
|
||||
assert_eq!((w, h), (2, 3));
|
||||
// -90 (CW 90): top-left (0,0) should go to top-right
|
||||
assert_eq!(rotated.get_pixel(0, 0)[0], 4);
|
||||
assert_eq!(rotated.get_pixel(1, 0)[0], 1);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_flip_horizontal() {
|
||||
let mut img = RgbaImage::new(2, 1);
|
||||
img.put_pixel(0, 0, image::Rgba([10, 0, 0, 255]));
|
||||
img.put_pixel(1, 0, image::Rgba([20, 0, 0, 255]));
|
||||
let dynamic = DynamicImage::ImageRgba8(img);
|
||||
|
||||
let flipped = flip_horizontal_if(dynamic, true);
|
||||
assert_eq!(flipped.get_pixel(0, 0)[0], 20);
|
||||
assert_eq!(flipped.get_pixel(1, 0)[0], 10);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_flip_horizontal_noop() {
|
||||
let mut img = RgbaImage::new(2, 1);
|
||||
img.put_pixel(0, 0, image::Rgba([10, 0, 0, 255]));
|
||||
img.put_pixel(1, 0, image::Rgba([20, 0, 0, 255]));
|
||||
let dynamic = DynamicImage::ImageRgba8(img);
|
||||
|
||||
let not_flipped = flip_horizontal_if(dynamic, false);
|
||||
assert_eq!(not_flipped.get_pixel(0, 0)[0], 10);
|
||||
assert_eq!(not_flipped.get_pixel(1, 0)[0], 20);
|
||||
}
|
||||
}
|
||||
27
seaweed-volume/src/lib.rs
Normal file
27
seaweed-volume/src/lib.rs
Normal file
@@ -0,0 +1,27 @@
|
||||
pub mod config;
|
||||
pub mod images;
|
||||
pub mod metrics;
|
||||
pub mod remote_storage;
|
||||
pub mod security;
|
||||
pub mod server;
|
||||
pub mod storage;
|
||||
pub mod version;
|
||||
|
||||
/// Generated protobuf modules.
|
||||
pub mod pb {
|
||||
pub const FILE_DESCRIPTOR_SET: &[u8] =
|
||||
tonic::include_file_descriptor_set!("seaweed_descriptor");
|
||||
|
||||
pub mod remote_pb {
|
||||
tonic::include_proto!("remote_pb");
|
||||
}
|
||||
pub mod volume_server_pb {
|
||||
tonic::include_proto!("volume_server_pb");
|
||||
}
|
||||
pub mod master_pb {
|
||||
tonic::include_proto!("master_pb");
|
||||
}
|
||||
pub mod filer_pb {
|
||||
tonic::include_proto!("filer_pb");
|
||||
}
|
||||
}
|
||||
1051
seaweed-volume/src/main.rs
Normal file
1051
seaweed-volume/src/main.rs
Normal file
File diff suppressed because it is too large
Load Diff
448
seaweed-volume/src/metrics.rs
Normal file
448
seaweed-volume/src/metrics.rs
Normal file
@@ -0,0 +1,448 @@
|
||||
//! Prometheus metrics for the volume server.
|
||||
//!
|
||||
//! Mirrors the Go SeaweedFS volume server metrics.
|
||||
|
||||
use prometheus::{
|
||||
self, Encoder, GaugeVec, HistogramOpts, HistogramVec, IntCounterVec, IntGauge, IntGaugeVec,
|
||||
Opts, Registry, TextEncoder,
|
||||
};
|
||||
use std::sync::Once;
|
||||
|
||||
use crate::version;
|
||||
|
||||
#[derive(Clone, Debug, Default, PartialEq, Eq)]
|
||||
pub struct PushGatewayConfig {
|
||||
pub address: String,
|
||||
pub interval_seconds: u32,
|
||||
}
|
||||
|
||||
lazy_static::lazy_static! {
|
||||
pub static ref REGISTRY: Registry = Registry::new();
|
||||
|
||||
// ---- Request metrics (Go: VolumeServerRequestCounter, VolumeServerRequestHistogram) ----
|
||||
|
||||
/// Request counter with labels `type` (HTTP method) and `code` (HTTP status).
|
||||
pub static ref REQUEST_COUNTER: IntCounterVec = IntCounterVec::new(
|
||||
Opts::new("SeaweedFS_volumeServer_request_total", "Volume server requests"),
|
||||
&["type", "code"],
|
||||
).expect("metric can be created");
|
||||
|
||||
/// Request duration histogram with label `type` (HTTP method).
|
||||
pub static ref REQUEST_DURATION: HistogramVec = HistogramVec::new(
|
||||
HistogramOpts::new(
|
||||
"SeaweedFS_volumeServer_request_seconds",
|
||||
"Volume server request duration in seconds",
|
||||
).buckets(exponential_buckets(0.0001, 2.0, 24)),
|
||||
&["type"],
|
||||
).expect("metric can be created");
|
||||
|
||||
// ---- Handler counters (Go: VolumeServerHandlerCounter) ----
|
||||
|
||||
/// Handler-level operation counter with label `type`.
|
||||
pub static ref HANDLER_COUNTER: IntCounterVec = IntCounterVec::new(
|
||||
Opts::new("SeaweedFS_volumeServer_handler_total", "Volume server handler counters"),
|
||||
&["type"],
|
||||
).expect("metric can be created");
|
||||
|
||||
// ---- Vacuuming metrics (Go: VolumeServerVacuuming*) ----
|
||||
|
||||
/// Vacuuming compact counter with label `success` (true/false).
|
||||
pub static ref VACUUMING_COMPACT_COUNTER: IntCounterVec = IntCounterVec::new(
|
||||
Opts::new("SeaweedFS_volumeServer_vacuuming_compact_count", "Counter of volume vacuuming Compact counter"),
|
||||
&["success"],
|
||||
).expect("metric can be created");
|
||||
|
||||
/// Vacuuming commit counter with label `success` (true/false).
|
||||
pub static ref VACUUMING_COMMIT_COUNTER: IntCounterVec = IntCounterVec::new(
|
||||
Opts::new("SeaweedFS_volumeServer_vacuuming_commit_count", "Counter of volume vacuuming commit counter"),
|
||||
&["success"],
|
||||
).expect("metric can be created");
|
||||
|
||||
/// Vacuuming duration histogram with label `type` (compact/commit).
|
||||
pub static ref VACUUMING_HISTOGRAM: HistogramVec = HistogramVec::new(
|
||||
HistogramOpts::new(
|
||||
"SeaweedFS_volumeServer_vacuuming_seconds",
|
||||
"Volume vacuuming duration in seconds",
|
||||
).buckets(exponential_buckets(0.0001, 2.0, 24)),
|
||||
&["type"],
|
||||
).expect("metric can be created");
|
||||
|
||||
// ---- Volume gauges (Go: VolumeServerVolumeGauge, VolumeServerReadOnlyVolumeGauge) ----
|
||||
|
||||
/// Volumes per collection and type (volume/ec_shards).
|
||||
pub static ref VOLUME_GAUGE: GaugeVec = GaugeVec::new(
|
||||
Opts::new("SeaweedFS_volumeServer_volumes", "Number of volumes"),
|
||||
&["collection", "type"],
|
||||
).expect("metric can be created");
|
||||
|
||||
/// Read-only volumes per collection and type.
|
||||
pub static ref READ_ONLY_VOLUME_GAUGE: GaugeVec = GaugeVec::new(
|
||||
Opts::new("SeaweedFS_volumeServer_read_only_volumes", "Number of read-only volumes."),
|
||||
&["collection", "type"],
|
||||
).expect("metric can be created");
|
||||
|
||||
/// Maximum number of volumes this server can hold.
|
||||
pub static ref MAX_VOLUMES: IntGauge = IntGauge::new(
|
||||
"SeaweedFS_volumeServer_max_volumes",
|
||||
"Maximum number of volumes",
|
||||
).expect("metric can be created");
|
||||
|
||||
// ---- Disk size gauges (Go: VolumeServerDiskSizeGauge) ----
|
||||
|
||||
/// Actual disk size used by volumes per collection and type (normal/deleted_bytes/ec).
|
||||
pub static ref DISK_SIZE_GAUGE: GaugeVec = GaugeVec::new(
|
||||
Opts::new("SeaweedFS_volumeServer_total_disk_size", "Actual disk size used by volumes"),
|
||||
&["collection", "type"],
|
||||
).expect("metric can be created");
|
||||
|
||||
// ---- Resource gauges (Go: VolumeServerResourceGauge) ----
|
||||
|
||||
/// Disk resource usage per directory and type (all/used/free/avail).
|
||||
pub static ref RESOURCE_GAUGE: GaugeVec = GaugeVec::new(
|
||||
Opts::new("SeaweedFS_volumeServer_resource", "Server resource usage"),
|
||||
&["name", "type"],
|
||||
).expect("metric can be created");
|
||||
|
||||
// ---- In-flight gauges (Go: VolumeServerInFlightRequestsGauge, InFlightDownload/UploadSize) ----
|
||||
|
||||
/// In-flight requests per HTTP method.
|
||||
pub static ref INFLIGHT_REQUESTS_GAUGE: IntGaugeVec = IntGaugeVec::new(
|
||||
Opts::new("SeaweedFS_volumeServer_in_flight_requests", "Current number of in-flight requests being handled by volume server."),
|
||||
&["type"],
|
||||
).expect("metric can be created");
|
||||
|
||||
/// Concurrent download limit in bytes.
|
||||
pub static ref CONCURRENT_DOWNLOAD_LIMIT: IntGauge = IntGauge::new(
|
||||
"SeaweedFS_volumeServer_concurrent_download_limit",
|
||||
"Limit for total concurrent download size in bytes",
|
||||
).expect("metric can be created");
|
||||
|
||||
/// Concurrent upload limit in bytes.
|
||||
pub static ref CONCURRENT_UPLOAD_LIMIT: IntGauge = IntGauge::new(
|
||||
"SeaweedFS_volumeServer_concurrent_upload_limit",
|
||||
"Limit for total concurrent upload size in bytes",
|
||||
).expect("metric can be created");
|
||||
|
||||
/// Current in-flight download bytes.
|
||||
pub static ref INFLIGHT_DOWNLOAD_SIZE: IntGauge = IntGauge::new(
|
||||
"SeaweedFS_volumeServer_in_flight_download_size",
|
||||
"In flight total download size.",
|
||||
).expect("metric can be created");
|
||||
|
||||
/// Current in-flight upload bytes.
|
||||
pub static ref INFLIGHT_UPLOAD_SIZE: IntGauge = IntGauge::new(
|
||||
"SeaweedFS_volumeServer_in_flight_upload_size",
|
||||
"In flight total upload size.",
|
||||
).expect("metric can be created");
|
||||
|
||||
// ---- Legacy aliases for backward compat with existing code ----
|
||||
|
||||
/// Total number of volumes on this server (flat gauge).
|
||||
pub static ref VOLUMES_TOTAL: IntGauge = IntGauge::new(
|
||||
"volume_server_volumes_total",
|
||||
"Total number of volumes",
|
||||
).expect("metric can be created");
|
||||
|
||||
/// Disk size in bytes per directory.
|
||||
pub static ref DISK_SIZE_BYTES: IntGaugeVec = IntGaugeVec::new(
|
||||
Opts::new("volume_server_disk_size_bytes", "Disk size in bytes"),
|
||||
&["dir"],
|
||||
).expect("metric can be created");
|
||||
|
||||
/// Disk free bytes per directory.
|
||||
pub static ref DISK_FREE_BYTES: IntGaugeVec = IntGaugeVec::new(
|
||||
Opts::new("volume_server_disk_free_bytes", "Disk free space in bytes"),
|
||||
&["dir"],
|
||||
).expect("metric can be created");
|
||||
|
||||
/// Current number of in-flight requests (flat gauge).
|
||||
pub static ref INFLIGHT_REQUESTS: IntGauge = IntGauge::new(
|
||||
"volume_server_inflight_requests",
|
||||
"Current number of in-flight requests",
|
||||
).expect("metric can be created");
|
||||
|
||||
/// Total number of files stored across all volumes.
|
||||
pub static ref VOLUME_FILE_COUNT: IntGauge = IntGauge::new(
|
||||
"volume_server_volume_file_count",
|
||||
"Total number of files stored across all volumes",
|
||||
).expect("metric can be created");
|
||||
|
||||
// ---- Build info (Go: BuildInfo) ----
|
||||
|
||||
/// Build information gauge, always set to 1. Matches Go:
|
||||
/// Namespace="SeaweedFS", Subsystem="build", Name="info",
|
||||
/// labels: version, commit, sizelimit, goos, goarch.
|
||||
pub static ref BUILD_INFO: GaugeVec = GaugeVec::new(
|
||||
Opts::new("SeaweedFS_build_info", "A metric with a constant '1' value labeled by version, commit, sizelimit, goos, and goarch from which SeaweedFS was built."),
|
||||
&["version", "commit", "sizelimit", "goos", "goarch"],
|
||||
).expect("metric can be created");
|
||||
}
|
||||
|
||||
/// Generate exponential bucket boundaries for histograms.
|
||||
fn exponential_buckets(start: f64, factor: f64, count: usize) -> Vec<f64> {
|
||||
let mut buckets = Vec::with_capacity(count);
|
||||
let mut val = start;
|
||||
for _ in 0..count {
|
||||
buckets.push(val);
|
||||
val *= factor;
|
||||
}
|
||||
buckets
|
||||
}
|
||||
|
||||
// Handler counter type constants (matches Go's metrics_names.go).
|
||||
pub const WRITE_TO_LOCAL_DISK: &str = "writeToLocalDisk";
|
||||
pub const WRITE_TO_REPLICAS: &str = "writeToReplicas";
|
||||
pub const DOWNLOAD_LIMIT_COND: &str = "downloadLimitCondition";
|
||||
pub const UPLOAD_LIMIT_COND: &str = "uploadLimitCondition";
|
||||
pub const READ_PROXY_REQ: &str = "readProxyRequest";
|
||||
pub const READ_REDIRECT_REQ: &str = "readRedirectRequest";
|
||||
pub const EMPTY_READ_PROXY_LOC: &str = "emptyReadProxyLocaction";
|
||||
pub const FAILED_READ_PROXY_REQ: &str = "failedReadProxyRequest";
|
||||
|
||||
// Error metric name constants.
|
||||
pub const ERROR_SIZE_MISMATCH_OFFSET_SIZE: &str = "errorSizeMismatchOffsetSize";
|
||||
pub const ERROR_SIZE_MISMATCH: &str = "errorSizeMismatch";
|
||||
pub const ERROR_CRC: &str = "errorCRC";
|
||||
pub const ERROR_INDEX_OUT_OF_RANGE: &str = "errorIndexOutOfRange";
|
||||
pub const ERROR_GET_NOT_FOUND: &str = "errorGetNotFound";
|
||||
pub const ERROR_GET_INTERNAL: &str = "errorGetInternal";
|
||||
pub const ERROR_WRITE_TO_LOCAL_DISK: &str = "errorWriteToLocalDisk";
|
||||
pub const ERROR_UNMARSHAL_PAIRS: &str = "errorUnmarshalPairs";
|
||||
pub const ERROR_WRITE_TO_REPLICAS: &str = "errorWriteToReplicas";
|
||||
|
||||
// Go volume heartbeat metric label values.
|
||||
pub const READ_ONLY_LABEL_IS_READ_ONLY: &str = "IsReadOnly";
|
||||
pub const READ_ONLY_LABEL_NO_WRITE_OR_DELETE: &str = "noWriteOrDelete";
|
||||
pub const READ_ONLY_LABEL_NO_WRITE_CAN_DELETE: &str = "noWriteCanDelete";
|
||||
pub const READ_ONLY_LABEL_IS_DISK_SPACE_LOW: &str = "isDiskSpaceLow";
|
||||
pub const DISK_SIZE_LABEL_NORMAL: &str = "normal";
|
||||
pub const DISK_SIZE_LABEL_DELETED_BYTES: &str = "deleted_bytes";
|
||||
pub const DISK_SIZE_LABEL_EC: &str = "ec";
|
||||
|
||||
static REGISTER_METRICS: Once = Once::new();
|
||||
|
||||
/// Register all metrics with the custom registry.
|
||||
/// Call this once at startup.
|
||||
pub fn register_metrics() {
|
||||
REGISTER_METRICS.call_once(|| {
|
||||
let metrics: Vec<Box<dyn prometheus::core::Collector>> = vec![
|
||||
// New Go-compatible metrics
|
||||
Box::new(REQUEST_COUNTER.clone()),
|
||||
Box::new(REQUEST_DURATION.clone()),
|
||||
Box::new(HANDLER_COUNTER.clone()),
|
||||
Box::new(VACUUMING_COMPACT_COUNTER.clone()),
|
||||
Box::new(VACUUMING_COMMIT_COUNTER.clone()),
|
||||
Box::new(VACUUMING_HISTOGRAM.clone()),
|
||||
Box::new(VOLUME_GAUGE.clone()),
|
||||
Box::new(READ_ONLY_VOLUME_GAUGE.clone()),
|
||||
Box::new(MAX_VOLUMES.clone()),
|
||||
Box::new(DISK_SIZE_GAUGE.clone()),
|
||||
Box::new(RESOURCE_GAUGE.clone()),
|
||||
Box::new(INFLIGHT_REQUESTS_GAUGE.clone()),
|
||||
Box::new(CONCURRENT_DOWNLOAD_LIMIT.clone()),
|
||||
Box::new(CONCURRENT_UPLOAD_LIMIT.clone()),
|
||||
Box::new(INFLIGHT_DOWNLOAD_SIZE.clone()),
|
||||
Box::new(INFLIGHT_UPLOAD_SIZE.clone()),
|
||||
// Legacy metrics
|
||||
Box::new(VOLUMES_TOTAL.clone()),
|
||||
Box::new(DISK_SIZE_BYTES.clone()),
|
||||
Box::new(DISK_FREE_BYTES.clone()),
|
||||
Box::new(INFLIGHT_REQUESTS.clone()),
|
||||
Box::new(VOLUME_FILE_COUNT.clone()),
|
||||
// Build info
|
||||
Box::new(BUILD_INFO.clone()),
|
||||
];
|
||||
for m in metrics {
|
||||
REGISTRY.register(m).expect("metric registered");
|
||||
}
|
||||
|
||||
// Set build info gauge to 1 with version/commit/sizelimit/os/arch labels (matches Go).
|
||||
BUILD_INFO
|
||||
.with_label_values(&[
|
||||
version::version(),
|
||||
version::commit(),
|
||||
version::size_limit(),
|
||||
std::env::consts::OS,
|
||||
std::env::consts::ARCH,
|
||||
])
|
||||
.set(1.0);
|
||||
});
|
||||
}
|
||||
|
||||
/// Gather all metrics and encode them in Prometheus text exposition format.
|
||||
pub fn gather_metrics() -> String {
|
||||
let encoder = TextEncoder::new();
|
||||
let metric_families = REGISTRY.gather();
|
||||
let mut buffer = Vec::new();
|
||||
encoder
|
||||
.encode(&metric_families, &mut buffer)
|
||||
.expect("encoding metrics");
|
||||
String::from_utf8(buffer).expect("metrics are valid UTF-8")
|
||||
}
|
||||
|
||||
pub fn delete_collection_metrics(collection: &str) {
|
||||
// Mirrors Go's DeletePartialMatch(prometheus.Labels{"collection": collection})
|
||||
// which removes ALL metric entries matching the collection label, regardless
|
||||
// of other label values (like "type"). We gather the metric families to discover
|
||||
// all type values dynamically, matching Go's partial-match behavior.
|
||||
delete_partial_match_collection(&VOLUME_GAUGE, collection);
|
||||
delete_partial_match_collection(&READ_ONLY_VOLUME_GAUGE, collection);
|
||||
delete_partial_match_collection(&DISK_SIZE_GAUGE, collection);
|
||||
}
|
||||
|
||||
/// Remove all metric entries from a GaugeVec where the "collection" label matches.
|
||||
/// This emulates Go's `DeletePartialMatch(prometheus.Labels{"collection": collection})`.
|
||||
fn delete_partial_match_collection(gauge: &GaugeVec, collection: &str) {
|
||||
use prometheus::core::Collector;
|
||||
let families = gauge.collect();
|
||||
for family in &families {
|
||||
for metric in family.get_metric() {
|
||||
let labels = metric.get_label();
|
||||
let mut matches_collection = false;
|
||||
let mut type_value = None;
|
||||
for label in labels {
|
||||
if label.get_name() == "collection" && label.get_value() == collection {
|
||||
matches_collection = true;
|
||||
}
|
||||
if label.get_name() == "type" {
|
||||
type_value = Some(label.get_value().to_string());
|
||||
}
|
||||
}
|
||||
if matches_collection {
|
||||
if let Some(ref tv) = type_value {
|
||||
let _ = gauge.remove_label_values(&[collection, tv]);
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
pub fn build_pushgateway_url(address: &str, job: &str, instance: &str) -> String {
|
||||
let base = if address.starts_with("http://") || address.starts_with("https://") {
|
||||
address.to_string()
|
||||
} else {
|
||||
format!("http://{}", address)
|
||||
};
|
||||
let base = base.trim_end_matches('/');
|
||||
format!("{}/metrics/job/{}/instance/{}", base, job, instance)
|
||||
}
|
||||
|
||||
pub async fn push_metrics_once(
|
||||
client: &reqwest::Client,
|
||||
address: &str,
|
||||
job: &str,
|
||||
instance: &str,
|
||||
) -> Result<(), String> {
|
||||
let url = build_pushgateway_url(address, job, instance);
|
||||
let response = client
|
||||
.put(&url)
|
||||
.header(
|
||||
reqwest::header::CONTENT_TYPE,
|
||||
"text/plain; version=0.0.4; charset=utf-8",
|
||||
)
|
||||
.body(gather_metrics())
|
||||
.send()
|
||||
.await
|
||||
.map_err(|e| format!("push metrics request failed: {}", e))?;
|
||||
|
||||
if response.status().is_success() {
|
||||
Ok(())
|
||||
} else {
|
||||
Err(format!(
|
||||
"push metrics failed with status {}",
|
||||
response.status()
|
||||
))
|
||||
}
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
use axum::{routing::put, Router};
|
||||
use std::sync::{Arc, Mutex};
|
||||
|
||||
#[test]
|
||||
fn test_gather_metrics_returns_text() {
|
||||
register_metrics();
|
||||
REQUEST_COUNTER.with_label_values(&["GET", "200"]).inc();
|
||||
let output = gather_metrics();
|
||||
assert!(output.contains("SeaweedFS_volumeServer_request_total"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_build_pushgateway_url() {
|
||||
assert_eq!(
|
||||
build_pushgateway_url("localhost:9091", "volumeServer", "test-instance"),
|
||||
"http://localhost:9091/metrics/job/volumeServer/instance/test-instance"
|
||||
);
|
||||
assert_eq!(
|
||||
build_pushgateway_url("https://push.example", "volumeServer", "node-a"),
|
||||
"https://push.example/metrics/job/volumeServer/instance/node-a"
|
||||
);
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn test_push_metrics_once() {
|
||||
register_metrics();
|
||||
|
||||
let captured = Arc::new(Mutex::new(None::<String>));
|
||||
let captured_clone = captured.clone();
|
||||
|
||||
let app = Router::new().route(
|
||||
"/metrics/job/volumeServer/instance/test-instance",
|
||||
put(move |body: String| {
|
||||
let captured = captured_clone.clone();
|
||||
async move {
|
||||
*captured.lock().unwrap() = Some(body);
|
||||
"ok"
|
||||
}
|
||||
}),
|
||||
);
|
||||
|
||||
let listener = tokio::net::TcpListener::bind("127.0.0.1:0").await.unwrap();
|
||||
let addr = listener.local_addr().unwrap();
|
||||
let server = tokio::spawn(async move {
|
||||
axum::serve(listener, app).await.unwrap();
|
||||
});
|
||||
|
||||
let client = reqwest::Client::new();
|
||||
push_metrics_once(
|
||||
&client,
|
||||
&format!("127.0.0.1:{}", addr.port()),
|
||||
"volumeServer",
|
||||
"test-instance",
|
||||
)
|
||||
.await
|
||||
.unwrap();
|
||||
|
||||
let body = captured.lock().unwrap().clone().unwrap();
|
||||
assert!(body.contains("SeaweedFS_volumeServer_request_total"));
|
||||
|
||||
server.abort();
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_delete_collection_metrics_removes_collection_labelsets() {
|
||||
register_metrics();
|
||||
|
||||
VOLUME_GAUGE.with_label_values(&["pics", "volume"]).set(2.0);
|
||||
VOLUME_GAUGE.with_label_values(&["pics", "ec_shards"]).set(3.0);
|
||||
READ_ONLY_VOLUME_GAUGE
|
||||
.with_label_values(&["pics", "volume"])
|
||||
.set(1.0);
|
||||
DISK_SIZE_GAUGE
|
||||
.with_label_values(&["pics", "normal"])
|
||||
.set(10.0);
|
||||
DISK_SIZE_GAUGE
|
||||
.with_label_values(&["pics", "deleted_bytes"])
|
||||
.set(4.0);
|
||||
|
||||
delete_collection_metrics("pics");
|
||||
|
||||
let output = gather_metrics();
|
||||
assert!(!output.contains("collection=\"pics\",type=\"volume\""));
|
||||
assert!(!output.contains("collection=\"pics\",type=\"ec_shards\""));
|
||||
assert!(!output.contains("collection=\"pics\",type=\"normal\""));
|
||||
assert!(!output.contains("collection=\"pics\",type=\"deleted_bytes\""));
|
||||
}
|
||||
}
|
||||
157
seaweed-volume/src/remote_storage/mod.rs
Normal file
157
seaweed-volume/src/remote_storage/mod.rs
Normal file
@@ -0,0 +1,157 @@
|
||||
//! Remote storage backends for tiered storage support.
|
||||
//!
|
||||
//! Provides a trait-based abstraction over cloud storage providers (S3, GCS, Azure, etc.)
|
||||
//! and a registry to create clients from protobuf RemoteConf messages.
|
||||
|
||||
pub mod s3;
|
||||
pub mod s3_tier;
|
||||
|
||||
use crate::pb::remote_pb::{RemoteConf, RemoteStorageLocation};
|
||||
|
||||
/// Error type for remote storage operations.
|
||||
#[derive(Debug, thiserror::Error)]
|
||||
pub enum RemoteStorageError {
|
||||
#[error("remote storage type {0} not found")]
|
||||
TypeNotFound(String),
|
||||
#[error("remote object not found: {0}")]
|
||||
ObjectNotFound(String),
|
||||
#[error("remote storage error: {0}")]
|
||||
Other(String),
|
||||
#[error("io error: {0}")]
|
||||
Io(#[from] std::io::Error),
|
||||
}
|
||||
|
||||
/// Metadata about a remote file entry.
|
||||
#[derive(Debug, Clone)]
|
||||
pub struct RemoteEntry {
|
||||
pub size: i64,
|
||||
pub last_modified_at: i64, // Unix seconds
|
||||
pub e_tag: String,
|
||||
pub storage_name: String,
|
||||
}
|
||||
|
||||
/// Trait for remote storage clients. Matches Go's RemoteStorageClient interface.
|
||||
#[async_trait::async_trait]
|
||||
pub trait RemoteStorageClient: Send + Sync {
|
||||
/// Read (part of) a file from remote storage.
|
||||
async fn read_file(
|
||||
&self,
|
||||
loc: &RemoteStorageLocation,
|
||||
offset: i64,
|
||||
size: i64,
|
||||
) -> Result<Vec<u8>, RemoteStorageError>;
|
||||
|
||||
/// Write a file to remote storage.
|
||||
async fn write_file(
|
||||
&self,
|
||||
loc: &RemoteStorageLocation,
|
||||
data: &[u8],
|
||||
) -> Result<RemoteEntry, RemoteStorageError>;
|
||||
|
||||
/// Get metadata for a file in remote storage.
|
||||
async fn stat_file(
|
||||
&self,
|
||||
loc: &RemoteStorageLocation,
|
||||
) -> Result<RemoteEntry, RemoteStorageError>;
|
||||
|
||||
/// Delete a file from remote storage.
|
||||
async fn delete_file(&self, loc: &RemoteStorageLocation) -> Result<(), RemoteStorageError>;
|
||||
|
||||
/// List all buckets.
|
||||
async fn list_buckets(&self) -> Result<Vec<String>, RemoteStorageError>;
|
||||
|
||||
/// The RemoteConf used to create this client.
|
||||
fn remote_conf(&self) -> &RemoteConf;
|
||||
}
|
||||
|
||||
/// Create a new remote storage client from a RemoteConf.
|
||||
pub fn make_remote_storage_client(
|
||||
conf: &RemoteConf,
|
||||
) -> Result<Box<dyn RemoteStorageClient>, RemoteStorageError> {
|
||||
match conf.r#type.as_str() {
|
||||
// All S3-compatible backends use the same client with different credentials
|
||||
"s3" | "wasabi" | "backblaze" | "aliyun" | "tencent" | "baidu" | "filebase" | "storj"
|
||||
| "contabo" => {
|
||||
let (access_key, secret_key, endpoint, region) = extract_s3_credentials(conf);
|
||||
Ok(Box::new(s3::S3RemoteStorageClient::new(
|
||||
conf.clone(),
|
||||
&access_key,
|
||||
&secret_key,
|
||||
®ion,
|
||||
&endpoint,
|
||||
conf.s3_force_path_style,
|
||||
)))
|
||||
}
|
||||
other => Err(RemoteStorageError::TypeNotFound(other.to_string())),
|
||||
}
|
||||
}
|
||||
|
||||
/// Extract S3-compatible credentials from a RemoteConf based on its type.
|
||||
fn extract_s3_credentials(conf: &RemoteConf) -> (String, String, String, String) {
|
||||
match conf.r#type.as_str() {
|
||||
"s3" => (
|
||||
conf.s3_access_key.clone(),
|
||||
conf.s3_secret_key.clone(),
|
||||
conf.s3_endpoint.clone(),
|
||||
if conf.s3_region.is_empty() {
|
||||
"us-east-1".to_string()
|
||||
} else {
|
||||
conf.s3_region.clone()
|
||||
},
|
||||
),
|
||||
"wasabi" => (
|
||||
conf.wasabi_access_key.clone(),
|
||||
conf.wasabi_secret_key.clone(),
|
||||
conf.wasabi_endpoint.clone(),
|
||||
conf.wasabi_region.clone(),
|
||||
),
|
||||
"backblaze" => (
|
||||
conf.backblaze_key_id.clone(),
|
||||
conf.backblaze_application_key.clone(),
|
||||
conf.backblaze_endpoint.clone(),
|
||||
conf.backblaze_region.clone(),
|
||||
),
|
||||
"aliyun" => (
|
||||
conf.aliyun_access_key.clone(),
|
||||
conf.aliyun_secret_key.clone(),
|
||||
conf.aliyun_endpoint.clone(),
|
||||
conf.aliyun_region.clone(),
|
||||
),
|
||||
"tencent" => (
|
||||
conf.tencent_secret_id.clone(),
|
||||
conf.tencent_secret_key.clone(),
|
||||
conf.tencent_endpoint.clone(),
|
||||
String::new(),
|
||||
),
|
||||
"baidu" => (
|
||||
conf.baidu_access_key.clone(),
|
||||
conf.baidu_secret_key.clone(),
|
||||
conf.baidu_endpoint.clone(),
|
||||
conf.baidu_region.clone(),
|
||||
),
|
||||
"filebase" => (
|
||||
conf.filebase_access_key.clone(),
|
||||
conf.filebase_secret_key.clone(),
|
||||
conf.filebase_endpoint.clone(),
|
||||
String::new(),
|
||||
),
|
||||
"storj" => (
|
||||
conf.storj_access_key.clone(),
|
||||
conf.storj_secret_key.clone(),
|
||||
conf.storj_endpoint.clone(),
|
||||
String::new(),
|
||||
),
|
||||
"contabo" => (
|
||||
conf.contabo_access_key.clone(),
|
||||
conf.contabo_secret_key.clone(),
|
||||
conf.contabo_endpoint.clone(),
|
||||
conf.contabo_region.clone(),
|
||||
),
|
||||
_ => (
|
||||
conf.s3_access_key.clone(),
|
||||
conf.s3_secret_key.clone(),
|
||||
conf.s3_endpoint.clone(),
|
||||
conf.s3_region.clone(),
|
||||
),
|
||||
}
|
||||
}
|
||||
186
seaweed-volume/src/remote_storage/s3.rs
Normal file
186
seaweed-volume/src/remote_storage/s3.rs
Normal file
@@ -0,0 +1,186 @@
|
||||
//! S3-compatible remote storage client.
|
||||
//!
|
||||
//! Works with AWS S3, MinIO, SeaweedFS S3, and all S3-compatible providers.
|
||||
|
||||
use aws_sdk_s3::config::{BehaviorVersion, Credentials, Region};
|
||||
use aws_sdk_s3::primitives::ByteStream;
|
||||
use aws_sdk_s3::Client;
|
||||
|
||||
use super::{RemoteEntry, RemoteStorageClient, RemoteStorageError};
|
||||
use crate::pb::remote_pb::{RemoteConf, RemoteStorageLocation};
|
||||
|
||||
/// S3-compatible remote storage client.
|
||||
pub struct S3RemoteStorageClient {
|
||||
client: Client,
|
||||
conf: RemoteConf,
|
||||
}
|
||||
|
||||
impl S3RemoteStorageClient {
|
||||
/// Create a new S3 client from credentials and endpoint configuration.
|
||||
pub fn new(
|
||||
conf: RemoteConf,
|
||||
access_key: &str,
|
||||
secret_key: &str,
|
||||
region: &str,
|
||||
endpoint: &str,
|
||||
force_path_style: bool,
|
||||
) -> Self {
|
||||
let region = if region.is_empty() {
|
||||
"us-east-1"
|
||||
} else {
|
||||
region
|
||||
};
|
||||
|
||||
let credentials = Credentials::new(
|
||||
access_key,
|
||||
secret_key,
|
||||
None, // session token
|
||||
None, // expiry
|
||||
"seaweedfs-volume",
|
||||
);
|
||||
|
||||
let mut s3_config = aws_sdk_s3::Config::builder()
|
||||
.behavior_version(BehaviorVersion::latest())
|
||||
.region(Region::new(region.to_string()))
|
||||
.credentials_provider(credentials)
|
||||
.force_path_style(force_path_style);
|
||||
|
||||
if !endpoint.is_empty() {
|
||||
s3_config = s3_config.endpoint_url(endpoint);
|
||||
}
|
||||
|
||||
let client = Client::from_conf(s3_config.build());
|
||||
|
||||
S3RemoteStorageClient { client, conf }
|
||||
}
|
||||
}
|
||||
|
||||
#[async_trait::async_trait]
|
||||
impl RemoteStorageClient for S3RemoteStorageClient {
|
||||
async fn read_file(
|
||||
&self,
|
||||
loc: &RemoteStorageLocation,
|
||||
offset: i64,
|
||||
size: i64,
|
||||
) -> Result<Vec<u8>, RemoteStorageError> {
|
||||
let key = loc.path.trim_start_matches('/');
|
||||
|
||||
let mut req = self.client.get_object().bucket(&loc.bucket).key(key);
|
||||
|
||||
// Set byte range if specified
|
||||
if size > 0 {
|
||||
let end = offset + size - 1;
|
||||
req = req.range(format!("bytes={}-{}", offset, end));
|
||||
} else if offset > 0 {
|
||||
req = req.range(format!("bytes={}-", offset));
|
||||
}
|
||||
|
||||
let resp = req.send().await.map_err(|e| {
|
||||
let msg = format!("{}", e);
|
||||
if msg.contains("NoSuchKey") || msg.contains("404") {
|
||||
RemoteStorageError::ObjectNotFound(format!("{}/{}", loc.bucket, key))
|
||||
} else {
|
||||
RemoteStorageError::Other(format!("s3 get object: {}", e))
|
||||
}
|
||||
})?;
|
||||
|
||||
let data = resp
|
||||
.body
|
||||
.collect()
|
||||
.await
|
||||
.map_err(|e| RemoteStorageError::Other(format!("s3 read body: {}", e)))?;
|
||||
|
||||
Ok(data.into_bytes().to_vec())
|
||||
}
|
||||
|
||||
async fn write_file(
|
||||
&self,
|
||||
loc: &RemoteStorageLocation,
|
||||
data: &[u8],
|
||||
) -> Result<RemoteEntry, RemoteStorageError> {
|
||||
let key = loc.path.trim_start_matches('/');
|
||||
|
||||
let resp = self
|
||||
.client
|
||||
.put_object()
|
||||
.bucket(&loc.bucket)
|
||||
.key(key)
|
||||
.body(ByteStream::from(data.to_vec()))
|
||||
.send()
|
||||
.await
|
||||
.map_err(|e| RemoteStorageError::Other(format!("s3 put object: {}", e)))?;
|
||||
|
||||
Ok(RemoteEntry {
|
||||
size: data.len() as i64,
|
||||
last_modified_at: std::time::SystemTime::now()
|
||||
.duration_since(std::time::UNIX_EPOCH)
|
||||
.unwrap_or_default()
|
||||
.as_secs() as i64,
|
||||
e_tag: resp.e_tag().unwrap_or_default().to_string(),
|
||||
storage_name: loc.name.clone(),
|
||||
})
|
||||
}
|
||||
|
||||
async fn stat_file(
|
||||
&self,
|
||||
loc: &RemoteStorageLocation,
|
||||
) -> Result<RemoteEntry, RemoteStorageError> {
|
||||
let key = loc.path.trim_start_matches('/');
|
||||
|
||||
let resp = self
|
||||
.client
|
||||
.head_object()
|
||||
.bucket(&loc.bucket)
|
||||
.key(key)
|
||||
.send()
|
||||
.await
|
||||
.map_err(|e| {
|
||||
let msg = format!("{}", e);
|
||||
if msg.contains("404") || msg.contains("NotFound") {
|
||||
RemoteStorageError::ObjectNotFound(format!("{}/{}", loc.bucket, key))
|
||||
} else {
|
||||
RemoteStorageError::Other(format!("s3 head object: {}", e))
|
||||
}
|
||||
})?;
|
||||
|
||||
Ok(RemoteEntry {
|
||||
size: resp.content_length().unwrap_or(0),
|
||||
last_modified_at: resp.last_modified().map(|t| t.secs()).unwrap_or(0),
|
||||
e_tag: resp.e_tag().unwrap_or_default().to_string(),
|
||||
storage_name: loc.name.clone(),
|
||||
})
|
||||
}
|
||||
|
||||
async fn delete_file(&self, loc: &RemoteStorageLocation) -> Result<(), RemoteStorageError> {
|
||||
let key = loc.path.trim_start_matches('/');
|
||||
|
||||
self.client
|
||||
.delete_object()
|
||||
.bucket(&loc.bucket)
|
||||
.key(key)
|
||||
.send()
|
||||
.await
|
||||
.map_err(|e| RemoteStorageError::Other(format!("s3 delete object: {}", e)))?;
|
||||
|
||||
Ok(())
|
||||
}
|
||||
|
||||
async fn list_buckets(&self) -> Result<Vec<String>, RemoteStorageError> {
|
||||
let resp = self
|
||||
.client
|
||||
.list_buckets()
|
||||
.send()
|
||||
.await
|
||||
.map_err(|e| RemoteStorageError::Other(format!("s3 list buckets: {}", e)))?;
|
||||
|
||||
Ok(resp
|
||||
.buckets()
|
||||
.iter()
|
||||
.filter_map(|b| b.name().map(String::from))
|
||||
.collect())
|
||||
}
|
||||
|
||||
fn remote_conf(&self) -> &RemoteConf {
|
||||
&self.conf
|
||||
}
|
||||
}
|
||||
514
seaweed-volume/src/remote_storage/s3_tier.rs
Normal file
514
seaweed-volume/src/remote_storage/s3_tier.rs
Normal file
@@ -0,0 +1,514 @@
|
||||
//! S3-compatible tiered storage backend for volume .dat file upload/download.
|
||||
//!
|
||||
//! Provides multipart upload and concurrent download with progress callbacks,
|
||||
//! matching the Go SeaweedFS S3 backend behavior.
|
||||
|
||||
use std::collections::HashMap;
|
||||
use std::future::Future;
|
||||
use std::sync::{Arc, OnceLock, RwLock};
|
||||
|
||||
use aws_sdk_s3::config::{BehaviorVersion, Credentials, Region};
|
||||
use aws_sdk_s3::types::{CompletedMultipartUpload, CompletedPart};
|
||||
use aws_sdk_s3::Client;
|
||||
use tokio::io::{AsyncReadExt, AsyncSeekExt, AsyncWriteExt};
|
||||
use tokio::sync::Semaphore;
|
||||
|
||||
/// Concurrency limit for multipart upload/download (matches Go's s3manager).
|
||||
const CONCURRENCY: usize = 5;
|
||||
|
||||
/// Configuration for an S3 tier backend.
|
||||
#[derive(Debug, Clone)]
|
||||
pub struct S3TierConfig {
|
||||
pub access_key: String,
|
||||
pub secret_key: String,
|
||||
pub region: String,
|
||||
pub bucket: String,
|
||||
pub endpoint: String,
|
||||
pub storage_class: String,
|
||||
pub force_path_style: bool,
|
||||
}
|
||||
|
||||
/// S3 tier backend for uploading/downloading volume .dat files.
|
||||
pub struct S3TierBackend {
|
||||
client: Client,
|
||||
pub bucket: String,
|
||||
pub storage_class: String,
|
||||
}
|
||||
|
||||
impl S3TierBackend {
|
||||
/// Create a new S3 tier backend from configuration.
|
||||
pub fn new(config: &S3TierConfig) -> Self {
|
||||
let region = if config.region.is_empty() {
|
||||
"us-east-1"
|
||||
} else {
|
||||
&config.region
|
||||
};
|
||||
|
||||
let credentials = Credentials::new(
|
||||
&config.access_key,
|
||||
&config.secret_key,
|
||||
None,
|
||||
None,
|
||||
"seaweedfs-volume-tier",
|
||||
);
|
||||
|
||||
let mut s3_config = aws_sdk_s3::Config::builder()
|
||||
.behavior_version(BehaviorVersion::latest())
|
||||
.region(Region::new(region.to_string()))
|
||||
.credentials_provider(credentials)
|
||||
.force_path_style(config.force_path_style);
|
||||
|
||||
if !config.endpoint.is_empty() {
|
||||
s3_config = s3_config.endpoint_url(&config.endpoint);
|
||||
}
|
||||
|
||||
let client = Client::from_conf(s3_config.build());
|
||||
|
||||
S3TierBackend {
|
||||
client,
|
||||
bucket: config.bucket.clone(),
|
||||
storage_class: if config.storage_class.is_empty() {
|
||||
"STANDARD_IA".to_string()
|
||||
} else {
|
||||
config.storage_class.clone()
|
||||
},
|
||||
}
|
||||
}
|
||||
|
||||
/// Upload a local file to S3 using multipart upload with concurrent parts
|
||||
/// and progress reporting.
|
||||
///
|
||||
/// Returns (s3_key, file_size) on success.
|
||||
/// The progress callback receives (bytes_uploaded, percentage).
|
||||
/// Uses 64MB part size and 5 concurrent uploads (matches Go s3manager).
|
||||
pub async fn upload_file<F>(
|
||||
&self,
|
||||
file_path: &str,
|
||||
progress_fn: F,
|
||||
) -> Result<(String, u64), String>
|
||||
where
|
||||
F: FnMut(i64, f32) + Send + Sync + 'static,
|
||||
{
|
||||
let key = uuid::Uuid::new_v4().to_string();
|
||||
|
||||
let metadata = tokio::fs::metadata(file_path)
|
||||
.await
|
||||
.map_err(|e| format!("failed to stat file {}: {}", file_path, e))?;
|
||||
let file_size = metadata.len();
|
||||
|
||||
// Calculate part size: start at 64MB, scale up for very large files (matches Go)
|
||||
let mut part_size: u64 = 64 * 1024 * 1024;
|
||||
while part_size * 1000 < file_size {
|
||||
part_size *= 4;
|
||||
}
|
||||
|
||||
// Initiate multipart upload
|
||||
let create_resp = self
|
||||
.client
|
||||
.create_multipart_upload()
|
||||
.bucket(&self.bucket)
|
||||
.key(&key)
|
||||
.storage_class(
|
||||
self.storage_class
|
||||
.parse()
|
||||
.unwrap_or(aws_sdk_s3::types::StorageClass::StandardIa),
|
||||
)
|
||||
.send()
|
||||
.await
|
||||
.map_err(|e| format!("failed to create multipart upload: {}", e))?;
|
||||
|
||||
let upload_id = create_resp
|
||||
.upload_id()
|
||||
.ok_or_else(|| "no upload_id in multipart upload response".to_string())?
|
||||
.to_string();
|
||||
|
||||
// Build list of (part_number, offset, size) for all parts
|
||||
let mut parts_plan: Vec<(i32, u64, usize)> = Vec::new();
|
||||
let mut offset: u64 = 0;
|
||||
let mut part_number: i32 = 1;
|
||||
while offset < file_size {
|
||||
let remaining = file_size - offset;
|
||||
let this_part_size = std::cmp::min(part_size, remaining) as usize;
|
||||
parts_plan.push((part_number, offset, this_part_size));
|
||||
offset += this_part_size as u64;
|
||||
part_number += 1;
|
||||
}
|
||||
|
||||
// Upload parts concurrently with a semaphore limiting to CONCURRENCY
|
||||
let semaphore = Arc::new(Semaphore::new(CONCURRENCY));
|
||||
let client = &self.client;
|
||||
let bucket = &self.bucket;
|
||||
let file_path_owned = file_path.to_string();
|
||||
let progress = Arc::new(std::sync::Mutex::new((0u64, progress_fn)));
|
||||
|
||||
let mut handles = Vec::with_capacity(parts_plan.len());
|
||||
for (pn, off, size) in parts_plan {
|
||||
let sem = semaphore.clone();
|
||||
let client = client.clone();
|
||||
let bucket = bucket.clone();
|
||||
let key = key.clone();
|
||||
let upload_id = upload_id.clone();
|
||||
let fp = file_path_owned.clone();
|
||||
let progress = progress.clone();
|
||||
|
||||
handles.push(tokio::spawn(async move {
|
||||
let _permit = sem
|
||||
.acquire()
|
||||
.await
|
||||
.map_err(|e| format!("semaphore error: {}", e))?;
|
||||
|
||||
// Read this part's data from the file at the correct offset
|
||||
let mut file = tokio::fs::File::open(&fp)
|
||||
.await
|
||||
.map_err(|e| format!("failed to open file {}: {}", fp, e))?;
|
||||
file.seek(std::io::SeekFrom::Start(off))
|
||||
.await
|
||||
.map_err(|e| format!("failed to seek to offset {}: {}", off, e))?;
|
||||
let mut buf = vec![0u8; size];
|
||||
file.read_exact(&mut buf)
|
||||
.await
|
||||
.map_err(|e| format!("failed to read file at offset {}: {}", off, e))?;
|
||||
|
||||
let upload_part_resp = client
|
||||
.upload_part()
|
||||
.bucket(&bucket)
|
||||
.key(&key)
|
||||
.upload_id(&upload_id)
|
||||
.part_number(pn)
|
||||
.body(buf.into())
|
||||
.send()
|
||||
.await
|
||||
.map_err(|e| {
|
||||
format!("failed to upload part {} at offset {}: {}", pn, off, e)
|
||||
})?;
|
||||
|
||||
let e_tag = upload_part_resp.e_tag().unwrap_or_default().to_string();
|
||||
|
||||
// Report progress
|
||||
{
|
||||
let mut guard = progress.lock().unwrap();
|
||||
guard.0 += size as u64;
|
||||
let uploaded = guard.0;
|
||||
let pct = if file_size > 0 {
|
||||
(uploaded as f32 * 100.0) / file_size as f32
|
||||
} else {
|
||||
100.0
|
||||
};
|
||||
(guard.1)(uploaded as i64, pct);
|
||||
}
|
||||
|
||||
Ok::<_, String>(
|
||||
CompletedPart::builder()
|
||||
.e_tag(e_tag)
|
||||
.part_number(pn)
|
||||
.build(),
|
||||
)
|
||||
}));
|
||||
}
|
||||
|
||||
// Collect results, preserving part order
|
||||
let mut completed_parts = Vec::with_capacity(handles.len());
|
||||
for handle in handles {
|
||||
let part = handle
|
||||
.await
|
||||
.map_err(|e| format!("upload task panicked: {}", e))??;
|
||||
completed_parts.push(part);
|
||||
}
|
||||
|
||||
// Complete multipart upload
|
||||
let completed_upload = CompletedMultipartUpload::builder()
|
||||
.set_parts(Some(completed_parts))
|
||||
.build();
|
||||
|
||||
self.client
|
||||
.complete_multipart_upload()
|
||||
.bucket(&self.bucket)
|
||||
.key(&key)
|
||||
.upload_id(&upload_id)
|
||||
.multipart_upload(completed_upload)
|
||||
.send()
|
||||
.await
|
||||
.map_err(|e| format!("failed to complete multipart upload: {}", e))?;
|
||||
|
||||
Ok((key, file_size))
|
||||
}
|
||||
|
||||
/// Download a file from S3 to a local path with concurrent range requests
|
||||
/// and progress reporting.
|
||||
///
|
||||
/// Returns the file size on success.
|
||||
/// Uses 64MB part size and 5 concurrent downloads (matches Go s3manager).
|
||||
pub async fn download_file<F>(
|
||||
&self,
|
||||
dest_path: &str,
|
||||
key: &str,
|
||||
progress_fn: F,
|
||||
) -> Result<u64, String>
|
||||
where
|
||||
F: FnMut(i64, f32) + Send + Sync + 'static,
|
||||
{
|
||||
// Get file size first
|
||||
let head_resp = self
|
||||
.client
|
||||
.head_object()
|
||||
.bucket(&self.bucket)
|
||||
.key(key)
|
||||
.send()
|
||||
.await
|
||||
.map_err(|e| format!("failed to head object {}: {}", key, e))?;
|
||||
|
||||
let file_size = head_resp.content_length().unwrap_or(0) as u64;
|
||||
|
||||
// Pre-allocate file to full size so concurrent WriteAt-style writes work
|
||||
{
|
||||
let file = tokio::fs::OpenOptions::new()
|
||||
.write(true)
|
||||
.create(true)
|
||||
.truncate(true)
|
||||
.open(dest_path)
|
||||
.await
|
||||
.map_err(|e| format!("failed to open dest file {}: {}", dest_path, e))?;
|
||||
file.set_len(file_size)
|
||||
.await
|
||||
.map_err(|e| format!("failed to set file length: {}", e))?;
|
||||
}
|
||||
|
||||
let part_size: u64 = 64 * 1024 * 1024;
|
||||
|
||||
// Build list of (offset, size) for all parts
|
||||
let mut parts_plan: Vec<(u64, u64)> = Vec::new();
|
||||
let mut offset: u64 = 0;
|
||||
while offset < file_size {
|
||||
let remaining = file_size - offset;
|
||||
let this_part_size = std::cmp::min(part_size, remaining);
|
||||
parts_plan.push((offset, this_part_size));
|
||||
offset += this_part_size;
|
||||
}
|
||||
|
||||
// Download parts concurrently with a semaphore limiting to CONCURRENCY
|
||||
let semaphore = Arc::new(Semaphore::new(CONCURRENCY));
|
||||
let client = &self.client;
|
||||
let bucket = &self.bucket;
|
||||
let dest_path_owned = dest_path.to_string();
|
||||
let key_owned = key.to_string();
|
||||
let progress = Arc::new(std::sync::Mutex::new((0u64, progress_fn)));
|
||||
|
||||
let mut handles = Vec::with_capacity(parts_plan.len());
|
||||
for (off, size) in parts_plan {
|
||||
let sem = semaphore.clone();
|
||||
let client = client.clone();
|
||||
let bucket = bucket.clone();
|
||||
let key = key_owned.clone();
|
||||
let dp = dest_path_owned.clone();
|
||||
let progress = progress.clone();
|
||||
|
||||
handles.push(tokio::spawn(async move {
|
||||
let _permit = sem
|
||||
.acquire()
|
||||
.await
|
||||
.map_err(|e| format!("semaphore error: {}", e))?;
|
||||
|
||||
let end = off + size - 1;
|
||||
let range = format!("bytes={}-{}", off, end);
|
||||
|
||||
let get_resp = client
|
||||
.get_object()
|
||||
.bucket(&bucket)
|
||||
.key(&key)
|
||||
.range(&range)
|
||||
.send()
|
||||
.await
|
||||
.map_err(|e| format!("failed to get object {} range {}: {}", key, range, e))?;
|
||||
|
||||
let body = get_resp
|
||||
.body
|
||||
.collect()
|
||||
.await
|
||||
.map_err(|e| format!("failed to read body: {}", e))?;
|
||||
let bytes = body.into_bytes();
|
||||
|
||||
// Write at the correct offset (like Go's WriteAt)
|
||||
let mut file = tokio::fs::OpenOptions::new()
|
||||
.write(true)
|
||||
.open(&dp)
|
||||
.await
|
||||
.map_err(|e| format!("failed to open dest file {}: {}", dp, e))?;
|
||||
file.seek(std::io::SeekFrom::Start(off))
|
||||
.await
|
||||
.map_err(|e| format!("failed to seek to offset {}: {}", off, e))?;
|
||||
file.write_all(&bytes)
|
||||
.await
|
||||
.map_err(|e| format!("failed to write to {}: {}", dp, e))?;
|
||||
|
||||
// Report progress
|
||||
{
|
||||
let mut guard = progress.lock().unwrap();
|
||||
guard.0 += bytes.len() as u64;
|
||||
let downloaded = guard.0;
|
||||
let pct = if file_size > 0 {
|
||||
(downloaded as f32 * 100.0) / file_size as f32
|
||||
} else {
|
||||
100.0
|
||||
};
|
||||
(guard.1)(downloaded as i64, pct);
|
||||
}
|
||||
|
||||
Ok::<_, String>(())
|
||||
}));
|
||||
}
|
||||
|
||||
// Wait for all download tasks
|
||||
for handle in handles {
|
||||
handle
|
||||
.await
|
||||
.map_err(|e| format!("download task panicked: {}", e))??;
|
||||
}
|
||||
|
||||
Ok(file_size)
|
||||
}
|
||||
|
||||
pub async fn read_range(&self, key: &str, offset: u64, size: usize) -> Result<Vec<u8>, String> {
|
||||
let end = offset + (size as u64).saturating_sub(1);
|
||||
let range = format!("bytes={}-{}", offset, end);
|
||||
let resp = self
|
||||
.client
|
||||
.get_object()
|
||||
.bucket(&self.bucket)
|
||||
.key(key)
|
||||
.range(&range)
|
||||
.send()
|
||||
.await
|
||||
.map_err(|e| format!("failed to get object {} range {}: {}", key, range, e))?;
|
||||
|
||||
let body = resp
|
||||
.body
|
||||
.collect()
|
||||
.await
|
||||
.map_err(|e| format!("failed to read object {} body: {}", key, e))?;
|
||||
Ok(body.into_bytes().to_vec())
|
||||
}
|
||||
|
||||
/// Delete a file from S3.
|
||||
pub async fn delete_file(&self, key: &str) -> Result<(), String> {
|
||||
self.client
|
||||
.delete_object()
|
||||
.bucket(&self.bucket)
|
||||
.key(key)
|
||||
.send()
|
||||
.await
|
||||
.map_err(|e| format!("failed to delete object {}: {}", key, e))?;
|
||||
Ok(())
|
||||
}
|
||||
|
||||
pub fn delete_file_blocking(&self, key: &str) -> Result<(), String> {
|
||||
let client = self.client.clone();
|
||||
let bucket = self.bucket.clone();
|
||||
let key = key.to_string();
|
||||
block_on_tier_future(async move {
|
||||
client
|
||||
.delete_object()
|
||||
.bucket(&bucket)
|
||||
.key(&key)
|
||||
.send()
|
||||
.await
|
||||
.map_err(|e| format!("failed to delete object {}: {}", key, e))?;
|
||||
Ok(())
|
||||
})
|
||||
}
|
||||
|
||||
pub fn read_range_blocking(
|
||||
&self,
|
||||
key: &str,
|
||||
offset: u64,
|
||||
size: usize,
|
||||
) -> Result<Vec<u8>, String> {
|
||||
let client = self.client.clone();
|
||||
let bucket = self.bucket.clone();
|
||||
let key = key.to_string();
|
||||
block_on_tier_future(async move {
|
||||
let end = offset + (size as u64).saturating_sub(1);
|
||||
let range = format!("bytes={}-{}", offset, end);
|
||||
let resp = client
|
||||
.get_object()
|
||||
.bucket(&bucket)
|
||||
.key(&key)
|
||||
.range(&range)
|
||||
.send()
|
||||
.await
|
||||
.map_err(|e| format!("failed to get object {} range {}: {}", key, range, e))?;
|
||||
|
||||
let body = resp
|
||||
.body
|
||||
.collect()
|
||||
.await
|
||||
.map_err(|e| format!("failed to read object {} body: {}", key, e))?;
|
||||
Ok(body.into_bytes().to_vec())
|
||||
})
|
||||
}
|
||||
}
|
||||
|
||||
/// Parse a backend name like "s3" or "s3.default" into (backend_type, backend_id).
|
||||
/// Matches Go's `BackendNameToTypeId`.
|
||||
pub fn backend_name_to_type_id(backend_name: &str) -> (String, String) {
|
||||
let parts: Vec<&str> = backend_name.split('.').collect();
|
||||
match parts.len() {
|
||||
1 => (backend_name.to_string(), "default".to_string()),
|
||||
2 => (parts[0].to_string(), parts[1].to_string()),
|
||||
_ => (String::new(), String::new()),
|
||||
}
|
||||
}
|
||||
|
||||
/// A registry of configured S3 tier backends, keyed by backend name (e.g., "s3.default").
|
||||
#[derive(Default)]
|
||||
pub struct S3TierRegistry {
|
||||
backends: HashMap<String, Arc<S3TierBackend>>,
|
||||
}
|
||||
|
||||
impl S3TierRegistry {
|
||||
pub fn new() -> Self {
|
||||
Self {
|
||||
backends: HashMap::new(),
|
||||
}
|
||||
}
|
||||
|
||||
/// Register a backend with the given name.
|
||||
pub fn register(&mut self, name: String, backend: S3TierBackend) {
|
||||
self.backends.insert(name, Arc::new(backend));
|
||||
}
|
||||
|
||||
/// Look up a backend by name.
|
||||
pub fn get(&self, name: &str) -> Option<Arc<S3TierBackend>> {
|
||||
self.backends.get(name).cloned()
|
||||
}
|
||||
|
||||
/// List all registered backend names.
|
||||
pub fn names(&self) -> Vec<String> {
|
||||
self.backends.keys().cloned().collect()
|
||||
}
|
||||
|
||||
pub fn clear(&mut self) {
|
||||
self.backends.clear();
|
||||
}
|
||||
}
|
||||
|
||||
static GLOBAL_S3_TIER_REGISTRY: OnceLock<RwLock<S3TierRegistry>> = OnceLock::new();
|
||||
|
||||
pub fn global_s3_tier_registry() -> &'static RwLock<S3TierRegistry> {
|
||||
GLOBAL_S3_TIER_REGISTRY.get_or_init(|| RwLock::new(S3TierRegistry::new()))
|
||||
}
|
||||
|
||||
fn block_on_tier_future<F, T>(future: F) -> Result<T, String>
|
||||
where
|
||||
F: Future<Output = Result<T, String>> + Send + 'static,
|
||||
T: Send + 'static,
|
||||
{
|
||||
std::thread::spawn(move || {
|
||||
let runtime = tokio::runtime::Builder::new_current_thread()
|
||||
.enable_all()
|
||||
.build()
|
||||
.map_err(|e| format!("failed to build tokio runtime: {}", e))?;
|
||||
runtime.block_on(future)
|
||||
})
|
||||
.join()
|
||||
.map_err(|_| "tier runtime thread panicked".to_string())?
|
||||
}
|
||||
481
seaweed-volume/src/security.rs
Normal file
481
seaweed-volume/src/security.rs
Normal file
@@ -0,0 +1,481 @@
|
||||
//! Security: JWT validation and IP whitelist checking.
|
||||
//!
|
||||
//! Matches Go's security/guard.go and security/jwt.go.
|
||||
//! - Guard: combines whitelist IP checking with JWT token validation
|
||||
//! - JWT: HS256 HMAC signing with file-id claims
|
||||
|
||||
pub mod tls;
|
||||
|
||||
use std::collections::HashSet;
|
||||
use std::net::IpAddr;
|
||||
use std::time::{SystemTime, UNIX_EPOCH};
|
||||
|
||||
use jsonwebtoken::{decode, encode, Algorithm, DecodingKey, EncodingKey, Header, Validation};
|
||||
use serde::{Deserialize, Serialize};
|
||||
|
||||
// ============================================================================
|
||||
// JWT Claims
|
||||
// ============================================================================
|
||||
|
||||
/// Claims for volume server file access tokens.
|
||||
/// Matches Go's `SeaweedFileIdClaims`.
|
||||
#[derive(Debug, Serialize, Deserialize)]
|
||||
pub struct FileIdClaims {
|
||||
/// File ID this token grants access to (e.g., "3,01637037d6").
|
||||
#[serde(skip_serializing_if = "Option::is_none")]
|
||||
pub fid: Option<String>,
|
||||
|
||||
/// Expiration time (Unix timestamp).
|
||||
#[serde(skip_serializing_if = "Option::is_none")]
|
||||
pub exp: Option<u64>,
|
||||
|
||||
/// Not before (Unix timestamp).
|
||||
#[serde(skip_serializing_if = "Option::is_none")]
|
||||
pub nbf: Option<u64>,
|
||||
}
|
||||
|
||||
/// Signing key wrapper (empty = security disabled).
|
||||
#[derive(Clone)]
|
||||
pub struct SigningKey(pub Vec<u8>);
|
||||
|
||||
impl SigningKey {
|
||||
pub fn is_empty(&self) -> bool {
|
||||
self.0.is_empty()
|
||||
}
|
||||
|
||||
pub fn from_string(s: &str) -> Self {
|
||||
SigningKey(s.as_bytes().to_vec())
|
||||
}
|
||||
}
|
||||
|
||||
/// Generate a JWT token for file access.
|
||||
pub fn gen_jwt(
|
||||
signing_key: &SigningKey,
|
||||
expires_after_sec: i64,
|
||||
file_id: &str,
|
||||
) -> Result<String, JwtError> {
|
||||
if signing_key.is_empty() {
|
||||
return Err(JwtError::NoSigningKey);
|
||||
}
|
||||
|
||||
let now = SystemTime::now()
|
||||
.duration_since(UNIX_EPOCH)
|
||||
.unwrap_or_default()
|
||||
.as_secs();
|
||||
|
||||
let claims = FileIdClaims {
|
||||
fid: Some(file_id.to_string()),
|
||||
exp: if expires_after_sec > 0 {
|
||||
Some(now + expires_after_sec as u64)
|
||||
} else {
|
||||
None
|
||||
},
|
||||
nbf: None,
|
||||
};
|
||||
|
||||
let token = encode(
|
||||
&Header::new(Algorithm::HS256),
|
||||
&claims,
|
||||
&EncodingKey::from_secret(&signing_key.0),
|
||||
)?;
|
||||
|
||||
Ok(token)
|
||||
}
|
||||
|
||||
/// Decode and validate a JWT token.
|
||||
pub fn decode_jwt(signing_key: &SigningKey, token: &str) -> Result<FileIdClaims, JwtError> {
|
||||
if signing_key.is_empty() {
|
||||
return Err(JwtError::NoSigningKey);
|
||||
}
|
||||
|
||||
let mut validation = Validation::new(Algorithm::HS256);
|
||||
// Match Go behavior: tokens without exp are accepted (Go's jwt-go does not require exp)
|
||||
// But if exp IS present, it must be valid (not expired).
|
||||
validation.required_spec_claims.clear();
|
||||
validation.validate_exp = true;
|
||||
// Go's jwt-go/v5 validates nbf when present
|
||||
validation.validate_nbf = true;
|
||||
validation.leeway = 0;
|
||||
|
||||
let data = decode::<FileIdClaims>(
|
||||
token,
|
||||
&DecodingKey::from_secret(&signing_key.0),
|
||||
&validation,
|
||||
)?;
|
||||
|
||||
Ok(data.claims)
|
||||
}
|
||||
|
||||
// ============================================================================
|
||||
// Guard
|
||||
// ============================================================================
|
||||
|
||||
/// Security guard: IP whitelist + JWT token validation.
|
||||
pub struct Guard {
|
||||
whitelist_ips: HashSet<String>,
|
||||
whitelist_cidrs: Vec<(IpAddr, u8)>, // (network, prefix_len)
|
||||
pub signing_key: SigningKey,
|
||||
pub expires_after_sec: i64,
|
||||
pub read_signing_key: SigningKey,
|
||||
pub read_expires_after_sec: i64,
|
||||
/// Combined flag: true when whitelist is non-empty OR signing key is present.
|
||||
/// Matches Go's `isWriteActive = !isEmptyWhiteList || len(SigningKey) != 0`.
|
||||
is_write_active: bool,
|
||||
}
|
||||
|
||||
impl Guard {
|
||||
pub fn new(
|
||||
whitelist: &[String],
|
||||
signing_key: SigningKey,
|
||||
expires_after_sec: i64,
|
||||
read_signing_key: SigningKey,
|
||||
read_expires_after_sec: i64,
|
||||
) -> Self {
|
||||
let mut guard = Guard {
|
||||
whitelist_ips: HashSet::new(),
|
||||
whitelist_cidrs: Vec::new(),
|
||||
signing_key,
|
||||
expires_after_sec,
|
||||
read_signing_key,
|
||||
read_expires_after_sec,
|
||||
is_write_active: false,
|
||||
};
|
||||
guard.update_whitelist(whitelist);
|
||||
guard
|
||||
}
|
||||
|
||||
/// Update the IP whitelist.
|
||||
pub fn update_whitelist(&mut self, entries: &[String]) {
|
||||
self.whitelist_ips.clear();
|
||||
self.whitelist_cidrs.clear();
|
||||
|
||||
for entry in entries {
|
||||
let entry = entry.trim();
|
||||
if entry.is_empty() {
|
||||
continue;
|
||||
}
|
||||
if entry.contains('/') {
|
||||
// CIDR range
|
||||
if let Some((ip, prefix)) = parse_cidr(entry) {
|
||||
self.whitelist_cidrs.push((ip, prefix));
|
||||
} else {
|
||||
tracing::error!("Parse CIDR {} in whitelist failed", entry);
|
||||
}
|
||||
} else {
|
||||
// Exact IP/hostname
|
||||
self.whitelist_ips.insert(entry.to_string());
|
||||
}
|
||||
}
|
||||
|
||||
// Match Go: isWriteActive = !isEmptyWhiteList || len(SigningKey) != 0
|
||||
let is_empty_whitelist = self.whitelist_ips.is_empty() && self.whitelist_cidrs.is_empty();
|
||||
self.is_write_active = !is_empty_whitelist || !self.signing_key.is_empty();
|
||||
}
|
||||
|
||||
/// Check if a remote IP is in the whitelist.
|
||||
/// Returns true if write security is inactive (no whitelist and no signing key),
|
||||
/// if the whitelist is empty, or if the IP matches.
|
||||
pub fn check_whitelist(&self, remote_addr: &str) -> bool {
|
||||
if !self.is_write_active {
|
||||
return true;
|
||||
}
|
||||
if self.whitelist_ips.is_empty() && self.whitelist_cidrs.is_empty() {
|
||||
return true;
|
||||
}
|
||||
|
||||
let host = extract_host(remote_addr);
|
||||
|
||||
// Check exact match
|
||||
if self.whitelist_ips.contains(&host) {
|
||||
return true;
|
||||
}
|
||||
|
||||
// Check CIDR ranges
|
||||
if let Ok(ip) = host.parse::<IpAddr>() {
|
||||
for &(ref network, prefix_len) in &self.whitelist_cidrs {
|
||||
if ip_in_cidr(&ip, network, prefix_len) {
|
||||
return true;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
false
|
||||
}
|
||||
|
||||
/// Check if a read signing key is configured.
|
||||
pub fn has_read_signing_key(&self) -> bool {
|
||||
!self.read_signing_key.is_empty()
|
||||
}
|
||||
|
||||
/// Validate a request's JWT token.
|
||||
/// `is_write` determines which signing key to use.
|
||||
/// Returns Ok(()) if valid, or if security is disabled.
|
||||
pub fn check_jwt(&self, token: Option<&str>, is_write: bool) -> Result<(), JwtError> {
|
||||
let key = if is_write {
|
||||
&self.signing_key
|
||||
} else {
|
||||
&self.read_signing_key
|
||||
};
|
||||
|
||||
if key.is_empty() {
|
||||
return Ok(()); // Security disabled for this operation type
|
||||
}
|
||||
|
||||
let token = token.ok_or(JwtError::MissingToken)?;
|
||||
decode_jwt(key, token)?;
|
||||
Ok(())
|
||||
}
|
||||
|
||||
/// Check JWT and validate the file ID claim matches.
|
||||
pub fn check_jwt_for_file(
|
||||
&self,
|
||||
token: Option<&str>,
|
||||
expected_fid: &str,
|
||||
is_write: bool,
|
||||
) -> Result<(), JwtError> {
|
||||
let key = if is_write {
|
||||
&self.signing_key
|
||||
} else {
|
||||
&self.read_signing_key
|
||||
};
|
||||
|
||||
if key.is_empty() {
|
||||
return Ok(());
|
||||
}
|
||||
|
||||
let token = token.ok_or(JwtError::MissingToken)?;
|
||||
let claims = decode_jwt(key, token)?;
|
||||
|
||||
match claims.fid {
|
||||
None => {
|
||||
return Err(JwtError::MissingFileIdClaim);
|
||||
}
|
||||
Some(ref fid) if fid != expected_fid => {
|
||||
return Err(JwtError::FileIdMismatch {
|
||||
expected: expected_fid.to_string(),
|
||||
got: fid.to_string(),
|
||||
});
|
||||
}
|
||||
_ => {}
|
||||
}
|
||||
|
||||
Ok(())
|
||||
}
|
||||
}
|
||||
|
||||
// ============================================================================
|
||||
// Helpers
|
||||
// ============================================================================
|
||||
|
||||
/// Extract host from "host:port" or "[::1]:port" format.
|
||||
fn extract_host(addr: &str) -> String {
|
||||
// Handle IPv6 with brackets
|
||||
if addr.starts_with('[') {
|
||||
if let Some(end) = addr.find(']') {
|
||||
return addr[1..end].to_string();
|
||||
}
|
||||
}
|
||||
// Handle host:port
|
||||
if let Some(pos) = addr.rfind(':') {
|
||||
return addr[..pos].to_string();
|
||||
}
|
||||
addr.to_string()
|
||||
}
|
||||
|
||||
/// Parse CIDR notation "192.168.1.0/24" into (IpAddr, prefix_len).
|
||||
fn parse_cidr(cidr: &str) -> Option<(IpAddr, u8)> {
|
||||
let parts: Vec<&str> = cidr.split('/').collect();
|
||||
if parts.len() != 2 {
|
||||
return None;
|
||||
}
|
||||
let ip: IpAddr = parts[0].parse().ok()?;
|
||||
let prefix: u8 = parts[1].parse().ok()?;
|
||||
Some((ip, prefix))
|
||||
}
|
||||
|
||||
/// Check if an IP is within a CIDR range.
|
||||
fn ip_in_cidr(ip: &IpAddr, network: &IpAddr, prefix_len: u8) -> bool {
|
||||
match (ip, network) {
|
||||
(IpAddr::V4(ip), IpAddr::V4(net)) => {
|
||||
let ip_bits = u32::from(*ip);
|
||||
let net_bits = u32::from(*net);
|
||||
let mask = if prefix_len == 0 {
|
||||
0
|
||||
} else if prefix_len >= 32 {
|
||||
u32::MAX
|
||||
} else {
|
||||
u32::MAX << (32 - prefix_len)
|
||||
};
|
||||
(ip_bits & mask) == (net_bits & mask)
|
||||
}
|
||||
(IpAddr::V6(ip), IpAddr::V6(net)) => {
|
||||
let ip_bits = u128::from(*ip);
|
||||
let net_bits = u128::from(*net);
|
||||
let mask = if prefix_len == 0 {
|
||||
0
|
||||
} else if prefix_len >= 128 {
|
||||
u128::MAX
|
||||
} else {
|
||||
u128::MAX << (128 - prefix_len)
|
||||
};
|
||||
(ip_bits & mask) == (net_bits & mask)
|
||||
}
|
||||
_ => false, // V4/V6 mismatch
|
||||
}
|
||||
}
|
||||
|
||||
// ============================================================================
|
||||
// Errors
|
||||
// ============================================================================
|
||||
|
||||
#[derive(Debug, thiserror::Error)]
|
||||
pub enum JwtError {
|
||||
#[error("no signing key configured")]
|
||||
NoSigningKey,
|
||||
|
||||
#[error("missing JWT token")]
|
||||
MissingToken,
|
||||
|
||||
#[error("JWT error: {0}")]
|
||||
Jwt(#[from] jsonwebtoken::errors::Error),
|
||||
|
||||
#[error("JWT token missing required fid claim")]
|
||||
MissingFileIdClaim,
|
||||
|
||||
#[error("file ID mismatch: expected {expected}, got {got}")]
|
||||
FileIdMismatch { expected: String, got: String },
|
||||
}
|
||||
|
||||
// ============================================================================
|
||||
// Tests
|
||||
// ============================================================================
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn test_jwt_round_trip() {
|
||||
let key = SigningKey::from_string("test-secret-key");
|
||||
let token = gen_jwt(&key, 3600, "3,01637037d6").unwrap();
|
||||
let claims = decode_jwt(&key, &token).unwrap();
|
||||
assert_eq!(claims.fid, Some("3,01637037d6".to_string()));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_jwt_no_signing_key() {
|
||||
let key = SigningKey(vec![]);
|
||||
assert!(gen_jwt(&key, 3600, "1,abc").is_err());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_jwt_invalid_token() {
|
||||
let key = SigningKey::from_string("secret");
|
||||
let result = decode_jwt(&key, "invalid.token.here");
|
||||
assert!(result.is_err());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_jwt_wrong_key() {
|
||||
let key1 = SigningKey::from_string("secret1");
|
||||
let key2 = SigningKey::from_string("secret2");
|
||||
let token = gen_jwt(&key1, 3600, "1,abc").unwrap();
|
||||
assert!(decode_jwt(&key2, &token).is_err());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_guard_empty_whitelist() {
|
||||
let guard = Guard::new(&[], SigningKey(vec![]), 0, SigningKey(vec![]), 0);
|
||||
assert!(guard.check_whitelist("192.168.1.1:8080"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_guard_whitelist_exact() {
|
||||
let guard = Guard::new(
|
||||
&["192.168.1.1".to_string(), "10.0.0.1".to_string()],
|
||||
SigningKey(vec![]),
|
||||
0,
|
||||
SigningKey(vec![]),
|
||||
0,
|
||||
);
|
||||
assert!(guard.check_whitelist("192.168.1.1:8080"));
|
||||
assert!(guard.check_whitelist("10.0.0.1:1234"));
|
||||
assert!(!guard.check_whitelist("172.16.0.1:8080"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_guard_whitelist_cidr() {
|
||||
let guard = Guard::new(
|
||||
&["10.0.0.0/8".to_string()],
|
||||
SigningKey(vec![]),
|
||||
0,
|
||||
SigningKey(vec![]),
|
||||
0,
|
||||
);
|
||||
assert!(guard.check_whitelist("10.1.2.3:8080"));
|
||||
assert!(guard.check_whitelist("10.255.255.255:80"));
|
||||
assert!(!guard.check_whitelist("11.0.0.1:80"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_guard_check_jwt_disabled() {
|
||||
let guard = Guard::new(&[], SigningKey(vec![]), 0, SigningKey(vec![]), 0);
|
||||
// No signing key = security disabled
|
||||
assert!(guard.check_jwt(None, true).is_ok());
|
||||
assert!(guard.check_jwt(None, false).is_ok());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_guard_check_jwt_enabled() {
|
||||
let key = SigningKey::from_string("write-secret");
|
||||
let read_key = SigningKey::from_string("read-secret");
|
||||
let guard = Guard::new(&[], key.clone(), 3600, read_key.clone(), 3600);
|
||||
|
||||
// Missing token
|
||||
assert!(guard.check_jwt(None, true).is_err());
|
||||
|
||||
// Valid write token
|
||||
let token = gen_jwt(&key, 3600, "1,abc").unwrap();
|
||||
assert!(guard.check_jwt(Some(&token), true).is_ok());
|
||||
|
||||
// Write token for read should fail (different key)
|
||||
assert!(guard.check_jwt(Some(&token), false).is_err());
|
||||
|
||||
// Valid read token
|
||||
let read_token = gen_jwt(&read_key, 3600, "1,abc").unwrap();
|
||||
assert!(guard.check_jwt(Some(&read_token), false).is_ok());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_guard_check_jwt_file_id() {
|
||||
let key = SigningKey::from_string("secret");
|
||||
let guard = Guard::new(&[], key.clone(), 3600, SigningKey(vec![]), 0);
|
||||
|
||||
let token = gen_jwt(&key, 3600, "3,01637037d6").unwrap();
|
||||
|
||||
// Correct file ID
|
||||
assert!(guard
|
||||
.check_jwt_for_file(Some(&token), "3,01637037d6", true)
|
||||
.is_ok());
|
||||
|
||||
// Wrong file ID
|
||||
let err = guard.check_jwt_for_file(Some(&token), "4,deadbeef", true);
|
||||
assert!(matches!(err, Err(JwtError::FileIdMismatch { .. })));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_extract_host() {
|
||||
assert_eq!(extract_host("192.168.1.1:8080"), "192.168.1.1");
|
||||
assert_eq!(extract_host("[::1]:8080"), "::1");
|
||||
assert_eq!(extract_host("localhost"), "localhost");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_ip_in_cidr() {
|
||||
let net: IpAddr = "10.0.0.0".parse().unwrap();
|
||||
let ip1: IpAddr = "10.1.2.3".parse().unwrap();
|
||||
let ip2: IpAddr = "11.0.0.1".parse().unwrap();
|
||||
assert!(ip_in_cidr(&ip1, &net, 8));
|
||||
assert!(!ip_in_cidr(&ip2, &net, 8));
|
||||
}
|
||||
}
|
||||
437
seaweed-volume/src/security/tls.rs
Normal file
437
seaweed-volume/src/security/tls.rs
Normal file
@@ -0,0 +1,437 @@
|
||||
use std::collections::HashSet;
|
||||
use std::fmt;
|
||||
use std::sync::Arc;
|
||||
|
||||
use rustls::client::danger::HandshakeSignatureValid;
|
||||
use rustls::crypto::aws_lc_rs;
|
||||
use rustls::crypto::CryptoProvider;
|
||||
use rustls::pki_types::UnixTime;
|
||||
use rustls::pki_types::{CertificateDer, PrivateKeyDer};
|
||||
use rustls::server::danger::{ClientCertVerified, ClientCertVerifier};
|
||||
use rustls::server::WebPkiClientVerifier;
|
||||
use rustls::{
|
||||
CipherSuite, DigitallySignedStruct, DistinguishedName, RootCertStore, ServerConfig,
|
||||
SignatureScheme, SupportedCipherSuite, SupportedProtocolVersion,
|
||||
};
|
||||
use x509_parser::prelude::{FromDer, X509Certificate};
|
||||
|
||||
#[derive(Clone, Debug, Default, PartialEq, Eq)]
|
||||
pub struct TlsPolicy {
|
||||
pub min_version: String,
|
||||
pub max_version: String,
|
||||
pub cipher_suites: String,
|
||||
}
|
||||
|
||||
#[derive(Clone, Debug, Default, PartialEq, Eq)]
|
||||
pub struct GrpcClientAuthPolicy {
|
||||
pub allowed_common_names: Vec<String>,
|
||||
pub allowed_wildcard_domain: String,
|
||||
}
|
||||
|
||||
#[derive(Debug, Clone, PartialEq, Eq)]
|
||||
pub struct TlsPolicyError(String);
|
||||
|
||||
impl fmt::Display for TlsPolicyError {
|
||||
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
|
||||
f.write_str(&self.0)
|
||||
}
|
||||
}
|
||||
|
||||
impl std::error::Error for TlsPolicyError {}
|
||||
|
||||
#[derive(Clone, Copy, Debug, PartialEq, Eq, PartialOrd, Ord)]
|
||||
enum GoTlsVersion {
|
||||
Ssl3,
|
||||
Tls10,
|
||||
Tls11,
|
||||
Tls12,
|
||||
Tls13,
|
||||
}
|
||||
|
||||
#[derive(Debug)]
|
||||
struct CommonNameVerifier {
|
||||
inner: Arc<dyn ClientCertVerifier>,
|
||||
allowed_common_names: HashSet<String>,
|
||||
allowed_wildcard_domain: String,
|
||||
}
|
||||
|
||||
impl ClientCertVerifier for CommonNameVerifier {
|
||||
fn offer_client_auth(&self) -> bool {
|
||||
self.inner.offer_client_auth()
|
||||
}
|
||||
|
||||
fn client_auth_mandatory(&self) -> bool {
|
||||
self.inner.client_auth_mandatory()
|
||||
}
|
||||
|
||||
fn root_hint_subjects(&self) -> &[DistinguishedName] {
|
||||
self.inner.root_hint_subjects()
|
||||
}
|
||||
|
||||
fn verify_client_cert(
|
||||
&self,
|
||||
end_entity: &CertificateDer<'_>,
|
||||
intermediates: &[CertificateDer<'_>],
|
||||
now: UnixTime,
|
||||
) -> Result<ClientCertVerified, rustls::Error> {
|
||||
self.inner
|
||||
.verify_client_cert(end_entity, intermediates, now)?;
|
||||
let common_name = parse_common_name(end_entity).map_err(|e| {
|
||||
rustls::Error::General(format!(
|
||||
"parse client certificate common name failed: {}",
|
||||
e
|
||||
))
|
||||
})?;
|
||||
if common_name_is_allowed(
|
||||
&common_name,
|
||||
&self.allowed_common_names,
|
||||
&self.allowed_wildcard_domain,
|
||||
) {
|
||||
return Ok(ClientCertVerified::assertion());
|
||||
}
|
||||
Err(rustls::Error::General(format!(
|
||||
"Authenticate: invalid subject client common name: {}",
|
||||
common_name
|
||||
)))
|
||||
}
|
||||
|
||||
fn verify_tls12_signature(
|
||||
&self,
|
||||
message: &[u8],
|
||||
cert: &CertificateDer<'_>,
|
||||
dss: &DigitallySignedStruct,
|
||||
) -> Result<HandshakeSignatureValid, rustls::Error> {
|
||||
self.inner.verify_tls12_signature(message, cert, dss)
|
||||
}
|
||||
|
||||
fn verify_tls13_signature(
|
||||
&self,
|
||||
message: &[u8],
|
||||
cert: &CertificateDer<'_>,
|
||||
dss: &DigitallySignedStruct,
|
||||
) -> Result<HandshakeSignatureValid, rustls::Error> {
|
||||
self.inner.verify_tls13_signature(message, cert, dss)
|
||||
}
|
||||
|
||||
fn supported_verify_schemes(&self) -> Vec<SignatureScheme> {
|
||||
self.inner.supported_verify_schemes()
|
||||
}
|
||||
}
|
||||
|
||||
pub fn build_rustls_server_config(
|
||||
cert_path: &str,
|
||||
key_path: &str,
|
||||
ca_path: &str,
|
||||
policy: &TlsPolicy,
|
||||
) -> Result<ServerConfig, TlsPolicyError> {
|
||||
build_rustls_server_config_with_client_auth(cert_path, key_path, ca_path, policy, None)
|
||||
}
|
||||
|
||||
pub fn build_rustls_server_config_with_grpc_client_auth(
|
||||
cert_path: &str,
|
||||
key_path: &str,
|
||||
ca_path: &str,
|
||||
policy: &TlsPolicy,
|
||||
client_auth_policy: &GrpcClientAuthPolicy,
|
||||
) -> Result<ServerConfig, TlsPolicyError> {
|
||||
build_rustls_server_config_with_client_auth(
|
||||
cert_path,
|
||||
key_path,
|
||||
ca_path,
|
||||
policy,
|
||||
Some(client_auth_policy),
|
||||
)
|
||||
}
|
||||
|
||||
fn build_rustls_server_config_with_client_auth(
|
||||
cert_path: &str,
|
||||
key_path: &str,
|
||||
ca_path: &str,
|
||||
policy: &TlsPolicy,
|
||||
client_auth_policy: Option<&GrpcClientAuthPolicy>,
|
||||
) -> Result<ServerConfig, TlsPolicyError> {
|
||||
let cert_chain = read_cert_chain(cert_path)?;
|
||||
let private_key = read_private_key(key_path)?;
|
||||
let provider = build_crypto_provider(policy)?;
|
||||
let versions = build_supported_versions(policy)?;
|
||||
|
||||
let builder = ServerConfig::builder_with_provider(provider.clone())
|
||||
.with_protocol_versions(&versions)
|
||||
.map_err(|e| TlsPolicyError(format!("invalid TLS version policy: {}", e)))?;
|
||||
|
||||
let builder = if ca_path.is_empty() {
|
||||
builder.with_no_client_auth()
|
||||
} else {
|
||||
let roots = read_root_store(ca_path)?;
|
||||
let verifier =
|
||||
WebPkiClientVerifier::builder_with_provider(Arc::new(roots), provider.clone())
|
||||
.build()
|
||||
.map_err(|e| TlsPolicyError(format!("build client verifier failed: {}", e)))?;
|
||||
let verifier: Arc<dyn ClientCertVerifier> = if let Some(client_auth_policy) =
|
||||
client_auth_policy.filter(|policy| {
|
||||
!policy.allowed_common_names.is_empty()
|
||||
|| !policy.allowed_wildcard_domain.is_empty()
|
||||
}) {
|
||||
Arc::new(CommonNameVerifier {
|
||||
inner: verifier,
|
||||
allowed_common_names: client_auth_policy
|
||||
.allowed_common_names
|
||||
.iter()
|
||||
.cloned()
|
||||
.collect(),
|
||||
allowed_wildcard_domain: client_auth_policy.allowed_wildcard_domain.clone(),
|
||||
})
|
||||
} else {
|
||||
verifier
|
||||
};
|
||||
builder.with_client_cert_verifier(verifier)
|
||||
};
|
||||
|
||||
builder
|
||||
.with_single_cert(cert_chain, private_key)
|
||||
.map_err(|e| TlsPolicyError(format!("build rustls server config failed: {}", e)))
|
||||
}
|
||||
|
||||
fn read_cert_chain(cert_path: &str) -> Result<Vec<CertificateDer<'static>>, TlsPolicyError> {
|
||||
let cert_pem = std::fs::read(cert_path).map_err(|e| {
|
||||
TlsPolicyError(format!(
|
||||
"Failed to read TLS cert file '{}': {}",
|
||||
cert_path, e
|
||||
))
|
||||
})?;
|
||||
rustls_pemfile::certs(&mut &cert_pem[..])
|
||||
.collect::<Result<Vec<_>, _>>()
|
||||
.map_err(|e| {
|
||||
TlsPolicyError(format!(
|
||||
"Failed to parse TLS cert PEM '{}': {}",
|
||||
cert_path, e
|
||||
))
|
||||
})
|
||||
}
|
||||
|
||||
fn read_private_key(key_path: &str) -> Result<PrivateKeyDer<'static>, TlsPolicyError> {
|
||||
let key_pem = std::fs::read(key_path).map_err(|e| {
|
||||
TlsPolicyError(format!("Failed to read TLS key file '{}': {}", key_path, e))
|
||||
})?;
|
||||
rustls_pemfile::private_key(&mut &key_pem[..])
|
||||
.map_err(|e| TlsPolicyError(format!("Failed to parse TLS key PEM '{}': {}", key_path, e)))?
|
||||
.ok_or_else(|| TlsPolicyError(format!("No private key found in '{}'", key_path)))
|
||||
}
|
||||
|
||||
fn read_root_store(ca_path: &str) -> Result<RootCertStore, TlsPolicyError> {
|
||||
let ca_pem = std::fs::read(ca_path)
|
||||
.map_err(|e| TlsPolicyError(format!("Failed to read TLS CA file '{}': {}", ca_path, e)))?;
|
||||
let ca_certs = rustls_pemfile::certs(&mut &ca_pem[..])
|
||||
.collect::<Result<Vec<_>, _>>()
|
||||
.map_err(|e| TlsPolicyError(format!("Failed to parse TLS CA PEM '{}': {}", ca_path, e)))?;
|
||||
let mut roots = RootCertStore::empty();
|
||||
for cert in ca_certs {
|
||||
roots
|
||||
.add(cert)
|
||||
.map_err(|e| TlsPolicyError(format!("Failed to add CA cert '{}': {}", ca_path, e)))?;
|
||||
}
|
||||
Ok(roots)
|
||||
}
|
||||
|
||||
fn build_crypto_provider(policy: &TlsPolicy) -> Result<Arc<CryptoProvider>, TlsPolicyError> {
|
||||
let mut provider = aws_lc_rs::default_provider();
|
||||
let cipher_suites = parse_cipher_suites(&provider.cipher_suites, &policy.cipher_suites)?;
|
||||
if !cipher_suites.is_empty() {
|
||||
provider.cipher_suites = cipher_suites;
|
||||
}
|
||||
Ok(Arc::new(provider))
|
||||
}
|
||||
|
||||
pub fn build_supported_versions(
|
||||
policy: &TlsPolicy,
|
||||
) -> Result<Vec<&'static SupportedProtocolVersion>, TlsPolicyError> {
|
||||
let min_version = parse_go_tls_version(&policy.min_version)?;
|
||||
let max_version = parse_go_tls_version(&policy.max_version)?;
|
||||
let versions = [&rustls::version::TLS13, &rustls::version::TLS12]
|
||||
.into_iter()
|
||||
.filter(|version| {
|
||||
let current = go_tls_version_for_supported(version);
|
||||
min_version.map(|min| current >= min).unwrap_or(true)
|
||||
&& max_version.map(|max| current <= max).unwrap_or(true)
|
||||
})
|
||||
.collect::<Vec<_>>();
|
||||
|
||||
if versions.is_empty() {
|
||||
return Err(TlsPolicyError(format!(
|
||||
"TLS version range min='{}' max='{}' is unsupported by rustls",
|
||||
policy.min_version, policy.max_version
|
||||
)));
|
||||
}
|
||||
|
||||
Ok(versions)
|
||||
}
|
||||
|
||||
fn parse_go_tls_version(value: &str) -> Result<Option<GoTlsVersion>, TlsPolicyError> {
|
||||
match value.trim() {
|
||||
"" => Ok(None),
|
||||
"SSLv3" => Ok(Some(GoTlsVersion::Ssl3)),
|
||||
"TLS 1.0" => Ok(Some(GoTlsVersion::Tls10)),
|
||||
"TLS 1.1" => Ok(Some(GoTlsVersion::Tls11)),
|
||||
"TLS 1.2" => Ok(Some(GoTlsVersion::Tls12)),
|
||||
"TLS 1.3" => Ok(Some(GoTlsVersion::Tls13)),
|
||||
other => Err(TlsPolicyError(format!("invalid TLS version {}", other))),
|
||||
}
|
||||
}
|
||||
|
||||
fn parse_cipher_suites(
|
||||
available: &[SupportedCipherSuite],
|
||||
value: &str,
|
||||
) -> Result<Vec<SupportedCipherSuite>, TlsPolicyError> {
|
||||
let trimmed = value.trim();
|
||||
if trimmed.is_empty() {
|
||||
return Ok(Vec::new());
|
||||
}
|
||||
|
||||
trimmed
|
||||
.split(',')
|
||||
.map(|name| {
|
||||
let suite = parse_cipher_suite_name(name.trim())?;
|
||||
available
|
||||
.iter()
|
||||
.copied()
|
||||
.find(|candidate| candidate.suite() == suite)
|
||||
.ok_or_else(|| {
|
||||
TlsPolicyError(format!(
|
||||
"TLS cipher suite '{}' is unsupported by the Rust implementation",
|
||||
name.trim()
|
||||
))
|
||||
})
|
||||
})
|
||||
.collect()
|
||||
}
|
||||
|
||||
fn parse_cipher_suite_name(value: &str) -> Result<CipherSuite, TlsPolicyError> {
|
||||
match value {
|
||||
"TLS_AES_128_GCM_SHA256" => Ok(CipherSuite::TLS13_AES_128_GCM_SHA256),
|
||||
"TLS_AES_256_GCM_SHA384" => Ok(CipherSuite::TLS13_AES_256_GCM_SHA384),
|
||||
"TLS_CHACHA20_POLY1305_SHA256" => Ok(CipherSuite::TLS13_CHACHA20_POLY1305_SHA256),
|
||||
"TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256" => {
|
||||
Ok(CipherSuite::TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256)
|
||||
}
|
||||
"TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384" => {
|
||||
Ok(CipherSuite::TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384)
|
||||
}
|
||||
"TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256" => {
|
||||
Ok(CipherSuite::TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256)
|
||||
}
|
||||
"TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256" => {
|
||||
Ok(CipherSuite::TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256)
|
||||
}
|
||||
"TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384" => {
|
||||
Ok(CipherSuite::TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384)
|
||||
}
|
||||
"TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256" => {
|
||||
Ok(CipherSuite::TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256)
|
||||
}
|
||||
other => Err(TlsPolicyError(format!(
|
||||
"TLS cipher suite '{}' is unsupported by the Rust implementation",
|
||||
other
|
||||
))),
|
||||
}
|
||||
}
|
||||
|
||||
fn parse_common_name(cert: &CertificateDer<'_>) -> Result<String, TlsPolicyError> {
|
||||
let (_, certificate) = X509Certificate::from_der(cert.as_ref())
|
||||
.map_err(|e| TlsPolicyError(format!("parse X.509 certificate failed: {}", e)))?;
|
||||
let common_name = certificate
|
||||
.subject()
|
||||
.iter_common_name()
|
||||
.next()
|
||||
.and_then(|common_name| common_name.as_str().ok())
|
||||
.map(str::to_string);
|
||||
match common_name {
|
||||
Some(common_name) => Ok(common_name),
|
||||
None => Ok(String::new()),
|
||||
}
|
||||
}
|
||||
|
||||
fn common_name_is_allowed(
|
||||
common_name: &str,
|
||||
allowed_common_names: &HashSet<String>,
|
||||
allowed_wildcard_domain: &str,
|
||||
) -> bool {
|
||||
(!allowed_wildcard_domain.is_empty() && common_name.ends_with(allowed_wildcard_domain))
|
||||
|| allowed_common_names.contains(common_name)
|
||||
}
|
||||
|
||||
fn go_tls_version_for_supported(version: &SupportedProtocolVersion) -> GoTlsVersion {
|
||||
match version.version {
|
||||
rustls::ProtocolVersion::TLSv1_2 => GoTlsVersion::Tls12,
|
||||
rustls::ProtocolVersion::TLSv1_3 => GoTlsVersion::Tls13,
|
||||
_ => unreachable!("rustls only exposes TLS 1.2 and 1.3"),
|
||||
}
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::{build_supported_versions, common_name_is_allowed, parse_cipher_suites, TlsPolicy};
|
||||
use rustls::crypto::aws_lc_rs;
|
||||
use std::collections::HashSet;
|
||||
|
||||
#[test]
|
||||
fn test_build_supported_versions_defaults_to_tls12_and_tls13() {
|
||||
let versions = build_supported_versions(&TlsPolicy::default()).unwrap();
|
||||
assert_eq!(
|
||||
versions,
|
||||
vec![&rustls::version::TLS13, &rustls::version::TLS12]
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_build_supported_versions_filters_to_tls13() {
|
||||
let versions = build_supported_versions(&TlsPolicy {
|
||||
min_version: "TLS 1.3".to_string(),
|
||||
max_version: "TLS 1.3".to_string(),
|
||||
cipher_suites: String::new(),
|
||||
})
|
||||
.unwrap();
|
||||
assert_eq!(versions, vec![&rustls::version::TLS13]);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_build_supported_versions_rejects_unsupported_legacy_range() {
|
||||
let err = build_supported_versions(&TlsPolicy {
|
||||
min_version: "TLS 1.0".to_string(),
|
||||
max_version: "TLS 1.1".to_string(),
|
||||
cipher_suites: String::new(),
|
||||
})
|
||||
.unwrap_err();
|
||||
assert!(err.to_string().contains("unsupported by rustls"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_parse_cipher_suites_accepts_go_names() {
|
||||
let cipher_suites = parse_cipher_suites(
|
||||
&aws_lc_rs::default_provider().cipher_suites,
|
||||
"TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_AES_128_GCM_SHA256",
|
||||
)
|
||||
.unwrap();
|
||||
assert_eq!(cipher_suites.len(), 2);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_common_name_is_allowed_matches_exact_and_wildcard() {
|
||||
let allowed_common_names =
|
||||
HashSet::from([String::from("volume-a.internal"), String::from("worker-7")]);
|
||||
assert!(common_name_is_allowed(
|
||||
"volume-a.internal",
|
||||
&allowed_common_names,
|
||||
"",
|
||||
));
|
||||
assert!(common_name_is_allowed(
|
||||
"node.prod.example.com",
|
||||
&allowed_common_names,
|
||||
".example.com",
|
||||
));
|
||||
assert!(!common_name_is_allowed(
|
||||
"node.prod.other.net",
|
||||
&allowed_common_names,
|
||||
".example.com",
|
||||
));
|
||||
}
|
||||
}
|
||||
159
seaweed-volume/src/server/debug.rs
Normal file
159
seaweed-volume/src/server/debug.rs
Normal file
@@ -0,0 +1,159 @@
|
||||
use axum::body::Body;
|
||||
use axum::extract::Query;
|
||||
use axum::http::{header, StatusCode};
|
||||
use axum::response::{IntoResponse, Response};
|
||||
use axum::routing::{any, get};
|
||||
use axum::Router;
|
||||
use pprof::protos::Message;
|
||||
use serde::Deserialize;
|
||||
|
||||
#[derive(Deserialize, Default)]
|
||||
struct ProfileQuery {
|
||||
seconds: Option<u64>,
|
||||
}
|
||||
|
||||
pub fn build_debug_router() -> Router {
|
||||
Router::new()
|
||||
.route("/debug/pprof/", get(pprof_index_handler))
|
||||
.route("/debug/pprof/cmdline", get(pprof_cmdline_handler))
|
||||
.route("/debug/pprof/profile", get(pprof_profile_handler))
|
||||
.route("/debug/pprof/symbol", any(pprof_symbol_handler))
|
||||
.route("/debug/pprof/trace", get(pprof_trace_handler))
|
||||
}
|
||||
|
||||
async fn pprof_index_handler() -> Response {
|
||||
let body = concat!(
|
||||
"<html><head><title>/debug/pprof/</title></head><body>",
|
||||
"<a href=\"cmdline\">cmdline</a><br>",
|
||||
"<a href=\"profile\">profile</a><br>",
|
||||
"<a href=\"symbol\">symbol</a><br>",
|
||||
"<a href=\"trace\">trace</a><br>",
|
||||
"</body></html>",
|
||||
);
|
||||
(
|
||||
StatusCode::OK,
|
||||
[(header::CONTENT_TYPE, "text/html; charset=utf-8")],
|
||||
body,
|
||||
)
|
||||
.into_response()
|
||||
}
|
||||
|
||||
async fn pprof_cmdline_handler() -> Response {
|
||||
let body = std::env::args().collect::<Vec<_>>().join("\0");
|
||||
(
|
||||
StatusCode::OK,
|
||||
[(header::CONTENT_TYPE, "text/plain; charset=utf-8")],
|
||||
body,
|
||||
)
|
||||
.into_response()
|
||||
}
|
||||
|
||||
async fn pprof_profile_handler(Query(query): Query<ProfileQuery>) -> Response {
|
||||
let seconds = query.seconds.unwrap_or(30).clamp(1, 300);
|
||||
let guard = match pprof::ProfilerGuard::new(100) {
|
||||
Ok(guard) => guard,
|
||||
Err(e) => {
|
||||
return (
|
||||
StatusCode::INTERNAL_SERVER_ERROR,
|
||||
format!("failed to start profiler: {}", e),
|
||||
)
|
||||
.into_response();
|
||||
}
|
||||
};
|
||||
|
||||
tokio::time::sleep(std::time::Duration::from_secs(seconds)).await;
|
||||
|
||||
let report = match guard.report().build() {
|
||||
Ok(report) => report,
|
||||
Err(e) => {
|
||||
return (
|
||||
StatusCode::INTERNAL_SERVER_ERROR,
|
||||
format!("failed to build profile report: {}", e),
|
||||
)
|
||||
.into_response();
|
||||
}
|
||||
};
|
||||
|
||||
let profile = match report.pprof() {
|
||||
Ok(profile) => profile,
|
||||
Err(e) => {
|
||||
return (
|
||||
StatusCode::INTERNAL_SERVER_ERROR,
|
||||
format!("failed to encode profile: {}", e),
|
||||
)
|
||||
.into_response();
|
||||
}
|
||||
};
|
||||
|
||||
let mut bytes = Vec::new();
|
||||
if let Err(e) = profile.encode(&mut bytes) {
|
||||
return (
|
||||
StatusCode::INTERNAL_SERVER_ERROR,
|
||||
format!("failed to serialize profile: {}", e),
|
||||
)
|
||||
.into_response();
|
||||
}
|
||||
|
||||
(
|
||||
StatusCode::OK,
|
||||
[(header::CONTENT_TYPE, "application/octet-stream")],
|
||||
bytes,
|
||||
)
|
||||
.into_response()
|
||||
}
|
||||
|
||||
async fn pprof_symbol_handler() -> Response {
|
||||
(
|
||||
StatusCode::OK,
|
||||
[(header::CONTENT_TYPE, "text/plain; charset=utf-8")],
|
||||
"num_symbols: 0\n",
|
||||
)
|
||||
.into_response()
|
||||
}
|
||||
|
||||
async fn pprof_trace_handler(Query(query): Query<ProfileQuery>) -> Response {
|
||||
let seconds = query.seconds.unwrap_or(1).clamp(1, 30);
|
||||
tokio::time::sleep(std::time::Duration::from_secs(seconds)).await;
|
||||
Response::builder()
|
||||
.status(StatusCode::OK)
|
||||
.header(header::CONTENT_TYPE, "application/octet-stream")
|
||||
.body(Body::from(Vec::<u8>::new()))
|
||||
.unwrap()
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
use axum::http::Request;
|
||||
use tower::ServiceExt;
|
||||
|
||||
#[tokio::test]
|
||||
async fn test_debug_index_route() {
|
||||
let app = build_debug_router();
|
||||
let response = app
|
||||
.oneshot(
|
||||
Request::builder()
|
||||
.uri("/debug/pprof/")
|
||||
.body(Body::empty())
|
||||
.unwrap(),
|
||||
)
|
||||
.await
|
||||
.unwrap();
|
||||
assert_eq!(response.status(), StatusCode::OK);
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn test_debug_cmdline_route() {
|
||||
let app = build_debug_router();
|
||||
let response = app
|
||||
.oneshot(
|
||||
Request::builder()
|
||||
.uri("/debug/pprof/cmdline")
|
||||
.body(Body::empty())
|
||||
.unwrap(),
|
||||
)
|
||||
.await
|
||||
.unwrap();
|
||||
assert_eq!(response.status(), StatusCode::OK);
|
||||
}
|
||||
}
|
||||
BIN
seaweed-volume/src/server/favicon.ico
Normal file
BIN
seaweed-volume/src/server/favicon.ico
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 70 B |
206
seaweed-volume/src/server/grpc_client.rs
Normal file
206
seaweed-volume/src/server/grpc_client.rs
Normal file
@@ -0,0 +1,206 @@
|
||||
use std::error::Error;
|
||||
use std::fmt;
|
||||
use std::time::Duration;
|
||||
|
||||
use hyper::http::Uri;
|
||||
use tonic::transport::{Certificate, Channel, ClientTlsConfig, Endpoint, Identity};
|
||||
|
||||
use crate::config::VolumeServerConfig;
|
||||
|
||||
pub const GRPC_MAX_MESSAGE_SIZE: usize = 1 << 30;
|
||||
const GRPC_KEEPALIVE_INTERVAL: Duration = Duration::from_secs(60);
|
||||
const GRPC_KEEPALIVE_TIMEOUT: Duration = Duration::from_secs(20);
|
||||
const GRPC_INITIAL_WINDOW_SIZE: u32 = 16 * 1024 * 1024;
|
||||
|
||||
#[derive(Clone, Debug)]
|
||||
pub struct OutgoingGrpcTlsConfig {
|
||||
cert_pem: String,
|
||||
key_pem: String,
|
||||
ca_pem: String,
|
||||
}
|
||||
|
||||
#[derive(Debug)]
|
||||
pub struct GrpcClientError(String);
|
||||
|
||||
impl fmt::Display for GrpcClientError {
|
||||
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
|
||||
f.write_str(&self.0)
|
||||
}
|
||||
}
|
||||
|
||||
impl Error for GrpcClientError {}
|
||||
|
||||
pub fn load_outgoing_grpc_tls(
|
||||
config: &VolumeServerConfig,
|
||||
) -> Result<Option<OutgoingGrpcTlsConfig>, GrpcClientError> {
|
||||
if config.grpc_cert_file.is_empty()
|
||||
|| config.grpc_key_file.is_empty()
|
||||
|| config.grpc_ca_file.is_empty()
|
||||
{
|
||||
return Ok(None);
|
||||
}
|
||||
|
||||
let cert_pem = std::fs::read_to_string(&config.grpc_cert_file).map_err(|e| {
|
||||
GrpcClientError(format!(
|
||||
"Failed to read outgoing gRPC cert '{}': {}",
|
||||
config.grpc_cert_file, e
|
||||
))
|
||||
})?;
|
||||
let key_pem = std::fs::read_to_string(&config.grpc_key_file).map_err(|e| {
|
||||
GrpcClientError(format!(
|
||||
"Failed to read outgoing gRPC key '{}': {}",
|
||||
config.grpc_key_file, e
|
||||
))
|
||||
})?;
|
||||
let ca_pem = std::fs::read_to_string(&config.grpc_ca_file).map_err(|e| {
|
||||
GrpcClientError(format!(
|
||||
"Failed to read outgoing gRPC CA '{}': {}",
|
||||
config.grpc_ca_file, e
|
||||
))
|
||||
})?;
|
||||
|
||||
Ok(Some(OutgoingGrpcTlsConfig {
|
||||
cert_pem,
|
||||
key_pem,
|
||||
ca_pem,
|
||||
}))
|
||||
}
|
||||
|
||||
pub fn grpc_endpoint_uri(grpc_host_port: &str, tls: Option<&OutgoingGrpcTlsConfig>) -> String {
|
||||
let scheme = if tls.is_some() { "https" } else { "http" };
|
||||
format!("{}://{}", scheme, grpc_host_port)
|
||||
}
|
||||
|
||||
pub fn build_grpc_endpoint(
|
||||
grpc_host_port: &str,
|
||||
tls: Option<&OutgoingGrpcTlsConfig>,
|
||||
) -> Result<Endpoint, GrpcClientError> {
|
||||
let uri = grpc_endpoint_uri(grpc_host_port, tls);
|
||||
let mut endpoint = Channel::from_shared(uri.clone())
|
||||
.map_err(|e| GrpcClientError(format!("invalid gRPC endpoint {}: {}", uri, e)))?
|
||||
.http2_keep_alive_interval(GRPC_KEEPALIVE_INTERVAL)
|
||||
.keep_alive_timeout(GRPC_KEEPALIVE_TIMEOUT)
|
||||
.keep_alive_while_idle(false)
|
||||
.initial_stream_window_size(Some(GRPC_INITIAL_WINDOW_SIZE))
|
||||
.initial_connection_window_size(Some(GRPC_INITIAL_WINDOW_SIZE))
|
||||
.http2_adaptive_window(false);
|
||||
|
||||
if let Some(tls) = tls {
|
||||
let parsed = uri
|
||||
.parse::<Uri>()
|
||||
.map_err(|e| GrpcClientError(format!("invalid gRPC endpoint {}: {}", uri, e)))?;
|
||||
let host = parsed
|
||||
.host()
|
||||
.ok_or_else(|| GrpcClientError(format!("missing host in gRPC endpoint {}", uri)))?;
|
||||
let tls_config = ClientTlsConfig::new()
|
||||
.identity(Identity::from_pem(
|
||||
tls.cert_pem.clone(),
|
||||
tls.key_pem.clone(),
|
||||
))
|
||||
.ca_certificate(Certificate::from_pem(tls.ca_pem.clone()))
|
||||
.domain_name(host.to_string());
|
||||
endpoint = endpoint.tls_config(tls_config).map_err(|e| {
|
||||
GrpcClientError(format!("configure gRPC TLS for {} failed: {}", uri, e))
|
||||
})?;
|
||||
}
|
||||
|
||||
Ok(endpoint)
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::{build_grpc_endpoint, grpc_endpoint_uri, load_outgoing_grpc_tls};
|
||||
use crate::config::{NeedleMapKind, ReadMode, VolumeServerConfig};
|
||||
use crate::security::tls::TlsPolicy;
|
||||
|
||||
fn sample_config() -> VolumeServerConfig {
|
||||
VolumeServerConfig {
|
||||
port: 8080,
|
||||
grpc_port: 18080,
|
||||
public_port: 8080,
|
||||
ip: "127.0.0.1".to_string(),
|
||||
bind_ip: String::new(),
|
||||
public_url: "127.0.0.1:8080".to_string(),
|
||||
id: String::new(),
|
||||
masters: vec![],
|
||||
pre_stop_seconds: 0,
|
||||
idle_timeout: 0,
|
||||
data_center: String::new(),
|
||||
rack: String::new(),
|
||||
index_type: NeedleMapKind::InMemory,
|
||||
disk_type: String::new(),
|
||||
folders: vec![],
|
||||
folder_max_limits: vec![],
|
||||
folder_tags: vec![],
|
||||
min_free_spaces: vec![],
|
||||
disk_types: vec![],
|
||||
idx_folder: String::new(),
|
||||
white_list: vec![],
|
||||
fix_jpg_orientation: false,
|
||||
read_mode: ReadMode::Local,
|
||||
cpu_profile: String::new(),
|
||||
mem_profile: String::new(),
|
||||
compaction_byte_per_second: 0,
|
||||
maintenance_byte_per_second: 0,
|
||||
file_size_limit_bytes: 0,
|
||||
concurrent_upload_limit: 0,
|
||||
concurrent_download_limit: 0,
|
||||
inflight_upload_data_timeout: std::time::Duration::from_secs(0),
|
||||
inflight_download_data_timeout: std::time::Duration::from_secs(0),
|
||||
has_slow_read: false,
|
||||
read_buffer_size_mb: 0,
|
||||
ldb_timeout: 0,
|
||||
pprof: false,
|
||||
metrics_port: 0,
|
||||
metrics_ip: String::new(),
|
||||
debug: false,
|
||||
debug_port: 0,
|
||||
ui_enabled: false,
|
||||
jwt_signing_key: vec![],
|
||||
jwt_signing_expires_seconds: 0,
|
||||
jwt_read_signing_key: vec![],
|
||||
jwt_read_signing_expires_seconds: 0,
|
||||
https_cert_file: String::new(),
|
||||
https_key_file: String::new(),
|
||||
https_ca_file: String::new(),
|
||||
https_client_enabled: false,
|
||||
https_client_cert_file: String::new(),
|
||||
https_client_key_file: String::new(),
|
||||
https_client_ca_file: String::new(),
|
||||
grpc_cert_file: String::new(),
|
||||
grpc_key_file: String::new(),
|
||||
grpc_ca_file: String::new(),
|
||||
grpc_allowed_wildcard_domain: String::new(),
|
||||
grpc_volume_allowed_common_names: vec![],
|
||||
tls_policy: TlsPolicy::default(),
|
||||
enable_write_queue: false,
|
||||
security_file: String::new(),
|
||||
}
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_grpc_endpoint_uri_uses_https_when_tls_enabled() {
|
||||
let tls = super::OutgoingGrpcTlsConfig {
|
||||
cert_pem: "cert".to_string(),
|
||||
key_pem: "key".to_string(),
|
||||
ca_pem: "ca".to_string(),
|
||||
};
|
||||
assert_eq!(
|
||||
grpc_endpoint_uri("master.example.com:19333", Some(&tls)),
|
||||
"https://master.example.com:19333"
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_load_outgoing_grpc_tls_requires_cert_key_and_ca() {
|
||||
let mut config = sample_config();
|
||||
config.grpc_cert_file = "/tmp/client.pem".to_string();
|
||||
assert!(load_outgoing_grpc_tls(&config).unwrap().is_none());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_build_grpc_endpoint_without_tls_uses_http_scheme() {
|
||||
let endpoint = build_grpc_endpoint("127.0.0.1:19333", None).unwrap();
|
||||
assert_eq!(endpoint.uri().scheme_str(), Some("http"));
|
||||
}
|
||||
}
|
||||
4536
seaweed-volume/src/server/grpc_server.rs
Normal file
4536
seaweed-volume/src/server/grpc_server.rs
Normal file
File diff suppressed because it is too large
Load Diff
3913
seaweed-volume/src/server/handlers.rs
Normal file
3913
seaweed-volume/src/server/handlers.rs
Normal file
File diff suppressed because it is too large
Load Diff
1576
seaweed-volume/src/server/heartbeat.rs
Normal file
1576
seaweed-volume/src/server/heartbeat.rs
Normal file
File diff suppressed because it is too large
Load Diff
102
seaweed-volume/src/server/memory_status.rs
Normal file
102
seaweed-volume/src/server/memory_status.rs
Normal file
@@ -0,0 +1,102 @@
|
||||
use crate::pb::volume_server_pb;
|
||||
|
||||
pub fn collect_mem_status() -> volume_server_pb::MemStatus {
|
||||
#[allow(unused_mut)]
|
||||
let mut mem = volume_server_pb::MemStatus {
|
||||
goroutines: 1,
|
||||
..Default::default()
|
||||
};
|
||||
|
||||
#[cfg(target_os = "linux")]
|
||||
{
|
||||
if let Some((all, free)) = get_system_memory_linux() {
|
||||
mem.all = all;
|
||||
mem.free = free;
|
||||
mem.used = all.saturating_sub(free);
|
||||
}
|
||||
|
||||
if let Some(status) = read_process_status_linux() {
|
||||
if status.threads > 0 {
|
||||
mem.goroutines = status.threads as i32;
|
||||
}
|
||||
if let Some(rss) = status.rss {
|
||||
mem.self_ = rss;
|
||||
}
|
||||
if let Some(heap) = status.data.or(status.rss) {
|
||||
mem.heap = heap;
|
||||
}
|
||||
if let Some(stack) = status.stack {
|
||||
mem.stack = stack;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
mem
|
||||
}
|
||||
|
||||
#[cfg(target_os = "linux")]
|
||||
fn get_system_memory_linux() -> Option<(u64, u64)> {
|
||||
unsafe {
|
||||
let mut info: libc::sysinfo = std::mem::zeroed();
|
||||
if libc::sysinfo(&mut info) == 0 {
|
||||
let unit = info.mem_unit as u64;
|
||||
let total = info.totalram as u64 * unit;
|
||||
let free = info.freeram as u64 * unit;
|
||||
return Some((total, free));
|
||||
}
|
||||
}
|
||||
None
|
||||
}
|
||||
|
||||
#[cfg(target_os = "linux")]
|
||||
#[derive(Default)]
|
||||
struct ProcessStatus {
|
||||
threads: u64,
|
||||
rss: Option<u64>,
|
||||
data: Option<u64>,
|
||||
stack: Option<u64>,
|
||||
}
|
||||
|
||||
#[cfg(target_os = "linux")]
|
||||
fn read_process_status_linux() -> Option<ProcessStatus> {
|
||||
let status = std::fs::read_to_string("/proc/self/status").ok()?;
|
||||
let mut out = ProcessStatus::default();
|
||||
|
||||
for line in status.lines() {
|
||||
if let Some(value) = line.strip_prefix("Threads:") {
|
||||
out.threads = value.trim().parse().ok()?;
|
||||
continue;
|
||||
}
|
||||
if let Some(value) = parse_proc_status_kib_field(line, "VmRSS:") {
|
||||
out.rss = Some(value);
|
||||
continue;
|
||||
}
|
||||
if let Some(value) = parse_proc_status_kib_field(line, "VmData:") {
|
||||
out.data = Some(value);
|
||||
continue;
|
||||
}
|
||||
if let Some(value) = parse_proc_status_kib_field(line, "VmStk:") {
|
||||
out.stack = Some(value);
|
||||
}
|
||||
}
|
||||
|
||||
Some(out)
|
||||
}
|
||||
|
||||
#[cfg(target_os = "linux")]
|
||||
fn parse_proc_status_kib_field(line: &str, prefix: &str) -> Option<u64> {
|
||||
let raw = line.strip_prefix(prefix)?.trim();
|
||||
let value = raw.strip_suffix(" kB").unwrap_or(raw).trim();
|
||||
value.parse::<u64>().ok().map(|kib| kib * 1024)
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::collect_mem_status;
|
||||
|
||||
#[test]
|
||||
fn test_collect_mem_status_reports_live_process_state() {
|
||||
let mem = collect_mem_status();
|
||||
assert!(mem.goroutines > 0);
|
||||
}
|
||||
}
|
||||
12
seaweed-volume/src/server/mod.rs
Normal file
12
seaweed-volume/src/server/mod.rs
Normal file
@@ -0,0 +1,12 @@
|
||||
pub mod debug;
|
||||
pub mod grpc_client;
|
||||
pub mod grpc_server;
|
||||
pub mod handlers;
|
||||
pub mod heartbeat;
|
||||
pub mod memory_status;
|
||||
pub mod profiling;
|
||||
pub mod request_id;
|
||||
pub mod server_stats;
|
||||
pub mod ui;
|
||||
pub mod volume_server;
|
||||
pub mod write_queue;
|
||||
187
seaweed-volume/src/server/profiling.rs
Normal file
187
seaweed-volume/src/server/profiling.rs
Normal file
@@ -0,0 +1,187 @@
|
||||
use std::fs::File;
|
||||
use std::io::Write;
|
||||
use std::path::PathBuf;
|
||||
|
||||
use pprof::protos::Message;
|
||||
|
||||
use crate::config::VolumeServerConfig;
|
||||
|
||||
const GO_CPU_PROFILE_FREQUENCY: i32 = 100;
|
||||
const GO_PPROF_BLOCKLIST: [&str; 4] = ["libc", "libgcc", "pthread", "vdso"];
|
||||
|
||||
pub struct CpuProfileSession {
|
||||
output_path: PathBuf,
|
||||
guard: pprof::ProfilerGuard<'static>,
|
||||
}
|
||||
|
||||
impl CpuProfileSession {
|
||||
pub fn start(config: &VolumeServerConfig) -> Result<Option<Self>, String> {
|
||||
if config.cpu_profile.is_empty() {
|
||||
if !config.mem_profile.is_empty() && !config.pprof {
|
||||
tracing::warn!(
|
||||
"--memprofile is not yet supported in the Rust volume server; ignoring '{}'",
|
||||
config.mem_profile
|
||||
);
|
||||
}
|
||||
return Ok(None);
|
||||
}
|
||||
|
||||
if config.pprof {
|
||||
tracing::info!(
|
||||
"--pprof is enabled; ignoring --cpuprofile '{}' and --memprofile '{}'",
|
||||
config.cpu_profile,
|
||||
config.mem_profile
|
||||
);
|
||||
return Ok(None);
|
||||
}
|
||||
|
||||
if !config.mem_profile.is_empty() {
|
||||
tracing::warn!(
|
||||
"--memprofile is not yet supported in the Rust volume server; only --cpuprofile '{}' will be written",
|
||||
config.cpu_profile
|
||||
);
|
||||
}
|
||||
|
||||
let guard = pprof::ProfilerGuardBuilder::default()
|
||||
.frequency(GO_CPU_PROFILE_FREQUENCY)
|
||||
.blocklist(&GO_PPROF_BLOCKLIST)
|
||||
.build()
|
||||
.map_err(|e| {
|
||||
format!(
|
||||
"Failed to start CPU profiler '{}': {}",
|
||||
config.cpu_profile, e
|
||||
)
|
||||
})?;
|
||||
|
||||
Ok(Some(Self {
|
||||
output_path: PathBuf::from(&config.cpu_profile),
|
||||
guard,
|
||||
}))
|
||||
}
|
||||
|
||||
pub fn finish(self) -> Result<(), String> {
|
||||
let report = self
|
||||
.guard
|
||||
.report()
|
||||
.build()
|
||||
.map_err(|e| format!("Failed to build CPU profile report: {}", e))?;
|
||||
let profile = report
|
||||
.pprof()
|
||||
.map_err(|e| format!("Failed to encode CPU profile report: {}", e))?;
|
||||
|
||||
let mut bytes = Vec::new();
|
||||
profile
|
||||
.encode(&mut bytes)
|
||||
.map_err(|e| format!("Failed to serialize CPU profile report: {}", e))?;
|
||||
|
||||
let mut file = File::create(&self.output_path).map_err(|e| {
|
||||
format!(
|
||||
"Failed to create CPU profile '{}': {}",
|
||||
self.output_path.display(),
|
||||
e
|
||||
)
|
||||
})?;
|
||||
file.write_all(&bytes).map_err(|e| {
|
||||
format!(
|
||||
"Failed to write CPU profile '{}': {}",
|
||||
self.output_path.display(),
|
||||
e
|
||||
)
|
||||
})?;
|
||||
file.flush().map_err(|e| {
|
||||
format!(
|
||||
"Failed to flush CPU profile '{}': {}",
|
||||
self.output_path.display(),
|
||||
e
|
||||
)
|
||||
})?;
|
||||
|
||||
Ok(())
|
||||
}
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::CpuProfileSession;
|
||||
use crate::config::{NeedleMapKind, ReadMode, VolumeServerConfig};
|
||||
use crate::security::tls::TlsPolicy;
|
||||
|
||||
fn sample_config() -> VolumeServerConfig {
|
||||
VolumeServerConfig {
|
||||
port: 8080,
|
||||
grpc_port: 18080,
|
||||
public_port: 8080,
|
||||
ip: "127.0.0.1".to_string(),
|
||||
bind_ip: "127.0.0.1".to_string(),
|
||||
public_url: "127.0.0.1:8080".to_string(),
|
||||
id: "127.0.0.1:8080".to_string(),
|
||||
masters: vec![],
|
||||
pre_stop_seconds: 0,
|
||||
idle_timeout: 0,
|
||||
data_center: String::new(),
|
||||
rack: String::new(),
|
||||
index_type: NeedleMapKind::InMemory,
|
||||
disk_type: String::new(),
|
||||
folders: vec!["/tmp".to_string()],
|
||||
folder_max_limits: vec![8],
|
||||
folder_tags: vec![vec![]],
|
||||
min_free_spaces: vec![],
|
||||
disk_types: vec![String::new()],
|
||||
idx_folder: String::new(),
|
||||
white_list: vec![],
|
||||
fix_jpg_orientation: false,
|
||||
read_mode: ReadMode::Local,
|
||||
cpu_profile: String::new(),
|
||||
mem_profile: String::new(),
|
||||
compaction_byte_per_second: 0,
|
||||
maintenance_byte_per_second: 0,
|
||||
file_size_limit_bytes: 0,
|
||||
concurrent_upload_limit: 0,
|
||||
concurrent_download_limit: 0,
|
||||
inflight_upload_data_timeout: std::time::Duration::from_secs(0),
|
||||
inflight_download_data_timeout: std::time::Duration::from_secs(0),
|
||||
has_slow_read: false,
|
||||
read_buffer_size_mb: 4,
|
||||
ldb_timeout: 0,
|
||||
pprof: false,
|
||||
metrics_port: 0,
|
||||
metrics_ip: String::new(),
|
||||
debug: false,
|
||||
debug_port: 0,
|
||||
ui_enabled: false,
|
||||
jwt_signing_key: vec![],
|
||||
jwt_signing_expires_seconds: 0,
|
||||
jwt_read_signing_key: vec![],
|
||||
jwt_read_signing_expires_seconds: 0,
|
||||
https_cert_file: String::new(),
|
||||
https_key_file: String::new(),
|
||||
https_ca_file: String::new(),
|
||||
https_client_enabled: false,
|
||||
https_client_cert_file: String::new(),
|
||||
https_client_key_file: String::new(),
|
||||
https_client_ca_file: String::new(),
|
||||
grpc_cert_file: String::new(),
|
||||
grpc_key_file: String::new(),
|
||||
grpc_ca_file: String::new(),
|
||||
grpc_allowed_wildcard_domain: String::new(),
|
||||
grpc_volume_allowed_common_names: vec![],
|
||||
tls_policy: TlsPolicy::default(),
|
||||
enable_write_queue: false,
|
||||
security_file: String::new(),
|
||||
}
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_cpu_profile_session_skips_when_disabled() {
|
||||
let config = sample_config();
|
||||
assert!(CpuProfileSession::start(&config).unwrap().is_none());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_cpu_profile_session_skips_when_pprof_enabled() {
|
||||
let mut config = sample_config();
|
||||
config.cpu_profile = "/tmp/cpu.pb".to_string();
|
||||
config.pprof = true;
|
||||
assert!(CpuProfileSession::start(&config).unwrap().is_none());
|
||||
}
|
||||
}
|
||||
137
seaweed-volume/src/server/request_id.rs
Normal file
137
seaweed-volume/src/server/request_id.rs
Normal file
@@ -0,0 +1,137 @@
|
||||
use std::future::Future;
|
||||
use std::pin::Pin;
|
||||
use std::task::{Context, Poll};
|
||||
|
||||
use hyper::http::{self, HeaderValue};
|
||||
use tonic::metadata::MetadataValue;
|
||||
use tonic::{Request, Status};
|
||||
use tower::{Layer, Service};
|
||||
|
||||
tokio::task_local! {
|
||||
static CURRENT_REQUEST_ID: String;
|
||||
}
|
||||
|
||||
#[derive(Clone, Debug, Default)]
|
||||
pub struct GrpcRequestIdLayer;
|
||||
|
||||
#[derive(Clone, Debug)]
|
||||
pub struct GrpcRequestIdService<S> {
|
||||
inner: S,
|
||||
}
|
||||
|
||||
impl<S> Layer<S> for GrpcRequestIdLayer {
|
||||
type Service = GrpcRequestIdService<S>;
|
||||
|
||||
fn layer(&self, inner: S) -> Self::Service {
|
||||
GrpcRequestIdService { inner }
|
||||
}
|
||||
}
|
||||
|
||||
impl<S, B> Service<http::Request<B>> for GrpcRequestIdService<S>
|
||||
where
|
||||
S: Service<http::Request<B>, Response = http::Response<tonic::body::BoxBody>> + Send + 'static,
|
||||
S::Future: Send + 'static,
|
||||
B: Send + 'static,
|
||||
{
|
||||
type Response = http::Response<tonic::body::BoxBody>;
|
||||
type Error = S::Error;
|
||||
type Future = Pin<Box<dyn Future<Output = Result<Self::Response, Self::Error>> + Send>>;
|
||||
|
||||
fn poll_ready(&mut self, cx: &mut Context<'_>) -> Poll<Result<(), Self::Error>> {
|
||||
self.inner.poll_ready(cx)
|
||||
}
|
||||
|
||||
fn call(&mut self, mut request: http::Request<B>) -> Self::Future {
|
||||
let request_id = match request.headers().get("x-amz-request-id") {
|
||||
Some(value) => match value.to_str() {
|
||||
Ok(value) if !value.is_empty() => value.to_owned(),
|
||||
_ => generate_grpc_request_id(),
|
||||
},
|
||||
None => generate_grpc_request_id(),
|
||||
};
|
||||
|
||||
if let Ok(value) = HeaderValue::from_str(&request_id) {
|
||||
request.headers_mut().insert("x-amz-request-id", value);
|
||||
}
|
||||
|
||||
let future = self.inner.call(request);
|
||||
|
||||
Box::pin(async move {
|
||||
let mut response: http::Response<tonic::body::BoxBody> =
|
||||
scope_request_id(request_id.clone(), future).await?;
|
||||
if let Ok(value) = HeaderValue::from_str(&request_id) {
|
||||
response.headers_mut().insert("x-amz-request-id", value);
|
||||
}
|
||||
Ok(response)
|
||||
})
|
||||
}
|
||||
}
|
||||
|
||||
pub async fn scope_request_id<F, T>(request_id: String, future: F) -> T
|
||||
where
|
||||
F: Future<Output = T>,
|
||||
{
|
||||
CURRENT_REQUEST_ID.scope(request_id, future).await
|
||||
}
|
||||
|
||||
pub fn current_request_id() -> Option<String> {
|
||||
CURRENT_REQUEST_ID.try_with(Clone::clone).ok()
|
||||
}
|
||||
|
||||
pub fn outgoing_request_id_interceptor(mut request: Request<()>) -> Result<Request<()>, Status> {
|
||||
if let Some(request_id) = current_request_id() {
|
||||
let value = MetadataValue::try_from(request_id.as_str())
|
||||
.map_err(|_| Status::internal("invalid scoped request id"))?;
|
||||
request.metadata_mut().insert("x-amz-request-id", value);
|
||||
}
|
||||
Ok(request)
|
||||
}
|
||||
|
||||
pub fn generate_http_request_id() -> String {
|
||||
use rand::Rng;
|
||||
|
||||
let nanos = std::time::SystemTime::now()
|
||||
.duration_since(std::time::UNIX_EPOCH)
|
||||
.unwrap_or_default()
|
||||
.as_nanos() as u64;
|
||||
let rand_val: u32 = rand::thread_rng().gen();
|
||||
format!("{:X}{:08X}", nanos, rand_val)
|
||||
}
|
||||
|
||||
fn generate_grpc_request_id() -> String {
|
||||
uuid::Uuid::new_v4().to_string()
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::{current_request_id, outgoing_request_id_interceptor, scope_request_id};
|
||||
use tonic::Request;
|
||||
|
||||
#[tokio::test]
|
||||
async fn test_scope_request_id_exposes_current_value() {
|
||||
let request_id = "req-123".to_string();
|
||||
let current = scope_request_id(
|
||||
request_id.clone(),
|
||||
async move { current_request_id().unwrap() },
|
||||
)
|
||||
.await;
|
||||
assert_eq!(current, request_id);
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn test_outgoing_request_id_interceptor_propagates_scope() {
|
||||
let request = scope_request_id("req-456".to_string(), async move {
|
||||
outgoing_request_id_interceptor(Request::new(())).unwrap()
|
||||
})
|
||||
.await;
|
||||
assert_eq!(
|
||||
request
|
||||
.metadata()
|
||||
.get("x-amz-request-id")
|
||||
.unwrap()
|
||||
.to_str()
|
||||
.unwrap(),
|
||||
"req-456"
|
||||
);
|
||||
}
|
||||
}
|
||||
248
seaweed-volume/src/server/server_stats.rs
Normal file
248
seaweed-volume/src/server/server_stats.rs
Normal file
@@ -0,0 +1,248 @@
|
||||
use chrono::{Datelike, Local, Timelike};
|
||||
use serde::Serialize;
|
||||
use std::sync::{LazyLock, Mutex};
|
||||
use std::time::Instant;
|
||||
|
||||
static START_TIME: LazyLock<Instant> = LazyLock::new(Instant::now);
|
||||
static SERVER_STATS: LazyLock<ServerStats> = LazyLock::new(ServerStats::default);
|
||||
|
||||
#[derive(Default)]
|
||||
pub struct ServerStats {
|
||||
inner: Mutex<ServerStatsInner>,
|
||||
}
|
||||
|
||||
#[derive(Default)]
|
||||
struct ServerStatsInner {
|
||||
requests: DurationCounter,
|
||||
connections: DurationCounter,
|
||||
assign_requests: DurationCounter,
|
||||
read_requests: DurationCounter,
|
||||
write_requests: DurationCounter,
|
||||
delete_requests: DurationCounter,
|
||||
bytes_in: DurationCounter,
|
||||
bytes_out: DurationCounter,
|
||||
}
|
||||
|
||||
#[derive(Clone, Serialize)]
|
||||
#[serde(rename_all = "PascalCase")]
|
||||
pub struct ServerStatsSnapshot {
|
||||
pub requests: DurationCounterSnapshot,
|
||||
pub connections: DurationCounterSnapshot,
|
||||
pub assign_requests: DurationCounterSnapshot,
|
||||
pub read_requests: DurationCounterSnapshot,
|
||||
pub write_requests: DurationCounterSnapshot,
|
||||
pub delete_requests: DurationCounterSnapshot,
|
||||
pub bytes_in: DurationCounterSnapshot,
|
||||
pub bytes_out: DurationCounterSnapshot,
|
||||
}
|
||||
|
||||
#[derive(Clone, Serialize)]
|
||||
#[serde(rename_all = "PascalCase")]
|
||||
pub struct DurationCounterSnapshot {
|
||||
pub minute_counter: RoundRobinCounterSnapshot,
|
||||
pub hour_counter: RoundRobinCounterSnapshot,
|
||||
pub day_counter: RoundRobinCounterSnapshot,
|
||||
pub week_counter: RoundRobinCounterSnapshot,
|
||||
}
|
||||
|
||||
#[derive(Clone, Serialize)]
|
||||
#[serde(rename_all = "PascalCase")]
|
||||
pub struct RoundRobinCounterSnapshot {
|
||||
pub last_index: i32,
|
||||
pub values: Vec<i64>,
|
||||
pub counts: Vec<i64>,
|
||||
}
|
||||
|
||||
#[derive(Clone)]
|
||||
struct DurationCounter {
|
||||
minute_counter: RoundRobinCounter,
|
||||
hour_counter: RoundRobinCounter,
|
||||
day_counter: RoundRobinCounter,
|
||||
week_counter: RoundRobinCounter,
|
||||
}
|
||||
|
||||
#[derive(Clone)]
|
||||
struct RoundRobinCounter {
|
||||
last_index: i32,
|
||||
values: Vec<i64>,
|
||||
counts: Vec<i64>,
|
||||
}
|
||||
|
||||
impl Default for DurationCounter {
|
||||
fn default() -> Self {
|
||||
Self {
|
||||
minute_counter: RoundRobinCounter::new(60),
|
||||
hour_counter: RoundRobinCounter::new(60),
|
||||
day_counter: RoundRobinCounter::new(24),
|
||||
week_counter: RoundRobinCounter::new(7),
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
impl RoundRobinCounter {
|
||||
fn new(slots: usize) -> Self {
|
||||
Self {
|
||||
last_index: -1,
|
||||
values: vec![0; slots],
|
||||
counts: vec![0; slots],
|
||||
}
|
||||
}
|
||||
|
||||
fn add(&mut self, index: usize, val: i64) {
|
||||
if index >= self.values.len() {
|
||||
return;
|
||||
}
|
||||
while self.last_index != index as i32 {
|
||||
self.last_index = (self.last_index + 1).rem_euclid(self.values.len() as i32);
|
||||
self.values[self.last_index as usize] = 0;
|
||||
self.counts[self.last_index as usize] = 0;
|
||||
}
|
||||
self.values[index] += val;
|
||||
self.counts[index] += 1;
|
||||
}
|
||||
|
||||
fn snapshot(&self) -> RoundRobinCounterSnapshot {
|
||||
RoundRobinCounterSnapshot {
|
||||
last_index: self.last_index,
|
||||
values: self.values.clone(),
|
||||
counts: self.counts.clone(),
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
impl DurationCounter {
|
||||
fn add_now(&mut self, val: i64) {
|
||||
let now = Local::now();
|
||||
self.minute_counter.add(now.second() as usize, val);
|
||||
self.hour_counter.add(now.minute() as usize, val);
|
||||
self.day_counter.add(now.hour() as usize, val);
|
||||
self.week_counter
|
||||
.add(now.weekday().num_days_from_sunday() as usize, val);
|
||||
}
|
||||
|
||||
fn snapshot(&self) -> DurationCounterSnapshot {
|
||||
DurationCounterSnapshot {
|
||||
minute_counter: self.minute_counter.snapshot(),
|
||||
hour_counter: self.hour_counter.snapshot(),
|
||||
day_counter: self.day_counter.snapshot(),
|
||||
week_counter: self.week_counter.snapshot(),
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
impl ServerStatsInner {
|
||||
fn snapshot(&self) -> ServerStatsSnapshot {
|
||||
ServerStatsSnapshot {
|
||||
requests: self.requests.snapshot(),
|
||||
connections: self.connections.snapshot(),
|
||||
assign_requests: self.assign_requests.snapshot(),
|
||||
read_requests: self.read_requests.snapshot(),
|
||||
write_requests: self.write_requests.snapshot(),
|
||||
delete_requests: self.delete_requests.snapshot(),
|
||||
bytes_in: self.bytes_in.snapshot(),
|
||||
bytes_out: self.bytes_out.snapshot(),
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
impl ServerStats {
|
||||
fn update<F>(&self, update: F)
|
||||
where
|
||||
F: FnOnce(&mut ServerStatsInner),
|
||||
{
|
||||
let mut inner = self.inner.lock().unwrap();
|
||||
update(&mut inner);
|
||||
}
|
||||
|
||||
fn snapshot(&self) -> ServerStatsSnapshot {
|
||||
self.inner.lock().unwrap().snapshot()
|
||||
}
|
||||
}
|
||||
|
||||
impl RoundRobinCounterSnapshot {
|
||||
pub fn to_list(&self) -> Vec<i64> {
|
||||
if self.values.is_empty() {
|
||||
return Vec::new();
|
||||
}
|
||||
let mut ret = Vec::with_capacity(self.values.len());
|
||||
let mut index = self.last_index;
|
||||
let mut step = self.values.len();
|
||||
while step > 0 {
|
||||
step -= 1;
|
||||
index += 1;
|
||||
if index >= self.values.len() as i32 {
|
||||
index = 0;
|
||||
}
|
||||
ret.push(self.values[index as usize]);
|
||||
}
|
||||
ret
|
||||
}
|
||||
}
|
||||
|
||||
pub fn init_process_start() {
|
||||
LazyLock::force(&START_TIME);
|
||||
LazyLock::force(&SERVER_STATS);
|
||||
}
|
||||
|
||||
pub fn uptime_string() -> String {
|
||||
let secs = START_TIME.elapsed().as_secs();
|
||||
let hours = secs / 3600;
|
||||
let minutes = (secs % 3600) / 60;
|
||||
let seconds = secs % 60;
|
||||
let mut out = String::new();
|
||||
if hours > 0 {
|
||||
out.push_str(&format!("{}h", hours));
|
||||
}
|
||||
if hours > 0 || minutes > 0 {
|
||||
out.push_str(&format!("{}m", minutes));
|
||||
}
|
||||
out.push_str(&format!("{}s", seconds));
|
||||
out
|
||||
}
|
||||
|
||||
pub fn snapshot() -> ServerStatsSnapshot {
|
||||
SERVER_STATS.snapshot()
|
||||
}
|
||||
|
||||
pub fn record_request_open() {
|
||||
SERVER_STATS.update(|inner| inner.requests.add_now(1));
|
||||
}
|
||||
|
||||
pub fn record_request_close() {
|
||||
SERVER_STATS.update(|inner| inner.requests.add_now(-1));
|
||||
}
|
||||
|
||||
pub fn record_connection_open() {
|
||||
SERVER_STATS.update(|inner| inner.connections.add_now(1));
|
||||
}
|
||||
|
||||
pub fn record_connection_close() {
|
||||
SERVER_STATS.update(|inner| inner.connections.add_now(-1));
|
||||
}
|
||||
|
||||
pub fn record_read_request() {
|
||||
SERVER_STATS.update(|inner| inner.read_requests.add_now(1));
|
||||
}
|
||||
|
||||
pub fn record_write_request() {
|
||||
SERVER_STATS.update(|inner| inner.write_requests.add_now(1));
|
||||
}
|
||||
|
||||
pub fn record_delete_request() {
|
||||
SERVER_STATS.update(|inner| inner.delete_requests.add_now(1));
|
||||
}
|
||||
|
||||
pub fn record_bytes_in(bytes: i64) {
|
||||
SERVER_STATS.update(|inner| inner.bytes_in.add_now(bytes));
|
||||
}
|
||||
|
||||
pub fn record_bytes_out(bytes: i64) {
|
||||
SERVER_STATS.update(|inner| inner.bytes_out.add_now(bytes));
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
pub fn reset_for_tests() {
|
||||
LazyLock::force(&START_TIME);
|
||||
let mut inner = SERVER_STATS.inner.lock().unwrap();
|
||||
*inner = ServerStatsInner::default();
|
||||
}
|
||||
507
seaweed-volume/src/server/ui.rs
Normal file
507
seaweed-volume/src/server/ui.rs
Normal file
@@ -0,0 +1,507 @@
|
||||
use std::fmt::Write as _;
|
||||
|
||||
use crate::server::server_stats;
|
||||
use crate::server::volume_server::VolumeServerState;
|
||||
use crate::storage::store::Store;
|
||||
|
||||
pub struct EmbeddedAsset {
|
||||
pub content_type: &'static str,
|
||||
pub bytes: &'static [u8],
|
||||
}
|
||||
|
||||
struct UiDiskRow {
|
||||
dir: String,
|
||||
disk_type: String,
|
||||
all: u64,
|
||||
free: u64,
|
||||
used: u64,
|
||||
}
|
||||
|
||||
struct UiVolumeRow {
|
||||
id: u32,
|
||||
collection: String,
|
||||
disk_type: String,
|
||||
size: u64,
|
||||
file_count: i64,
|
||||
delete_count: i64,
|
||||
deleted_byte_count: u64,
|
||||
ttl: String,
|
||||
read_only: bool,
|
||||
version: u32,
|
||||
remote_storage_name: String,
|
||||
remote_storage_key: String,
|
||||
}
|
||||
|
||||
struct UiEcShardRow {
|
||||
shard_id: u8,
|
||||
size: u64,
|
||||
}
|
||||
|
||||
struct UiEcVolumeRow {
|
||||
volume_id: u32,
|
||||
collection: String,
|
||||
size: u64,
|
||||
shards: Vec<UiEcShardRow>,
|
||||
created_at: String,
|
||||
}
|
||||
|
||||
pub fn favicon_asset() -> EmbeddedAsset {
|
||||
EmbeddedAsset {
|
||||
content_type: "image/x-icon",
|
||||
bytes: include_bytes!(concat!(
|
||||
env!("CARGO_MANIFEST_DIR"),
|
||||
"/../weed/static/favicon.ico"
|
||||
)),
|
||||
}
|
||||
}
|
||||
|
||||
pub fn lookup_static_asset(path: &str) -> Option<EmbeddedAsset> {
|
||||
let path = path.trim_start_matches('/');
|
||||
let asset = match path {
|
||||
"bootstrap/3.3.1/css/bootstrap.min.css" => EmbeddedAsset {
|
||||
content_type: "text/css; charset=utf-8",
|
||||
bytes: include_bytes!(concat!(
|
||||
env!("CARGO_MANIFEST_DIR"),
|
||||
"/../weed/static/bootstrap/3.3.1/css/bootstrap.min.css"
|
||||
)),
|
||||
},
|
||||
"bootstrap/3.3.1/fonts/glyphicons-halflings-regular.eot" => EmbeddedAsset {
|
||||
content_type: "application/vnd.ms-fontobject",
|
||||
bytes: include_bytes!(concat!(
|
||||
env!("CARGO_MANIFEST_DIR"),
|
||||
"/../weed/static/bootstrap/3.3.1/fonts/glyphicons-halflings-regular.eot"
|
||||
)),
|
||||
},
|
||||
"bootstrap/3.3.1/fonts/glyphicons-halflings-regular.svg" => EmbeddedAsset {
|
||||
content_type: "image/svg+xml",
|
||||
bytes: include_bytes!(concat!(
|
||||
env!("CARGO_MANIFEST_DIR"),
|
||||
"/../weed/static/bootstrap/3.3.1/fonts/glyphicons-halflings-regular.svg"
|
||||
)),
|
||||
},
|
||||
"bootstrap/3.3.1/fonts/glyphicons-halflings-regular.ttf" => EmbeddedAsset {
|
||||
content_type: "font/ttf",
|
||||
bytes: include_bytes!(concat!(
|
||||
env!("CARGO_MANIFEST_DIR"),
|
||||
"/../weed/static/bootstrap/3.3.1/fonts/glyphicons-halflings-regular.ttf"
|
||||
)),
|
||||
},
|
||||
"bootstrap/3.3.1/fonts/glyphicons-halflings-regular.woff" => EmbeddedAsset {
|
||||
content_type: "font/woff",
|
||||
bytes: include_bytes!(concat!(
|
||||
env!("CARGO_MANIFEST_DIR"),
|
||||
"/../weed/static/bootstrap/3.3.1/fonts/glyphicons-halflings-regular.woff"
|
||||
)),
|
||||
},
|
||||
"images/folder.gif" => EmbeddedAsset {
|
||||
content_type: "image/gif",
|
||||
bytes: include_bytes!(concat!(
|
||||
env!("CARGO_MANIFEST_DIR"),
|
||||
"/../weed/static/images/folder.gif"
|
||||
)),
|
||||
},
|
||||
"javascript/jquery-3.6.0.min.js" => EmbeddedAsset {
|
||||
content_type: "application/javascript; charset=utf-8",
|
||||
bytes: include_bytes!(concat!(
|
||||
env!("CARGO_MANIFEST_DIR"),
|
||||
"/../weed/static/javascript/jquery-3.6.0.min.js"
|
||||
)),
|
||||
},
|
||||
"javascript/jquery-sparklines/2.1.2/jquery.sparkline.min.js" => EmbeddedAsset {
|
||||
content_type: "application/javascript; charset=utf-8",
|
||||
bytes: include_bytes!(concat!(
|
||||
env!("CARGO_MANIFEST_DIR"),
|
||||
"/../weed/static/javascript/jquery-sparklines/2.1.2/jquery.sparkline.min.js"
|
||||
)),
|
||||
},
|
||||
"seaweed50x50.png" => EmbeddedAsset {
|
||||
content_type: "image/png",
|
||||
bytes: include_bytes!(concat!(
|
||||
env!("CARGO_MANIFEST_DIR"),
|
||||
"/../weed/static/seaweed50x50.png"
|
||||
)),
|
||||
},
|
||||
_ => return None,
|
||||
};
|
||||
Some(asset)
|
||||
}
|
||||
|
||||
pub fn render_volume_server_html(state: &VolumeServerState) -> String {
|
||||
let counters = server_stats::snapshot();
|
||||
let (disk_rows, volume_rows, remote_volume_rows, ec_volume_rows) = {
|
||||
let store = state.store.read().unwrap();
|
||||
collect_ui_data(&store)
|
||||
};
|
||||
|
||||
let masters = if state.master_urls.is_empty() {
|
||||
"[]".to_string()
|
||||
} else {
|
||||
format!("[{}]", state.master_urls.join(" "))
|
||||
};
|
||||
let uptime = server_stats::uptime_string();
|
||||
let read_week = join_i64(&counters.read_requests.week_counter.to_list());
|
||||
let read_day = join_i64(&counters.read_requests.day_counter.to_list());
|
||||
let read_hour = join_i64(&counters.read_requests.hour_counter.to_list());
|
||||
let read_minute = join_i64(&counters.read_requests.minute_counter.to_list());
|
||||
|
||||
let mut disk_rows_html = String::new();
|
||||
for disk in &disk_rows {
|
||||
let _ = write!(
|
||||
disk_rows_html,
|
||||
"<tr><td>{}</td><td>{}</td><td>{}</td><td>{}</td><td>{:.2}%</td></tr>",
|
||||
escape_html(&disk.dir),
|
||||
escape_html(&disk.disk_type),
|
||||
bytes_to_human_readable(disk.all),
|
||||
bytes_to_human_readable(disk.free),
|
||||
percent_from(disk.all, disk.used),
|
||||
);
|
||||
}
|
||||
|
||||
let mut volume_rows_html = String::new();
|
||||
for volume in &volume_rows {
|
||||
let _ = write!(
|
||||
volume_rows_html,
|
||||
"<tr><td><code>{}</code></td><td>{}</td><td>{}</td><td>{}</td><td>{}</td><td>{} / {}</td><td>{}</td><td>{}</td><td>{}</td></tr>",
|
||||
volume.id,
|
||||
escape_html(&volume.collection),
|
||||
escape_html(&volume.disk_type),
|
||||
bytes_to_human_readable(volume.size),
|
||||
volume.file_count,
|
||||
volume.delete_count,
|
||||
bytes_to_human_readable(volume.deleted_byte_count),
|
||||
escape_html(&volume.ttl),
|
||||
volume.read_only,
|
||||
volume.version,
|
||||
);
|
||||
}
|
||||
|
||||
let remote_section = if remote_volume_rows.is_empty() {
|
||||
String::new()
|
||||
} else {
|
||||
let mut remote_rows_html = String::new();
|
||||
for volume in &remote_volume_rows {
|
||||
let _ = write!(
|
||||
remote_rows_html,
|
||||
"<tr><td><code>{}</code></td><td>{}</td><td>{}</td><td>{}</td><td>{} / {}</td><td>{}</td><td>{}</td></tr>",
|
||||
volume.id,
|
||||
escape_html(&volume.collection),
|
||||
bytes_to_human_readable(volume.size),
|
||||
volume.file_count,
|
||||
volume.delete_count,
|
||||
bytes_to_human_readable(volume.deleted_byte_count),
|
||||
escape_html(&volume.remote_storage_name),
|
||||
escape_html(&volume.remote_storage_key),
|
||||
);
|
||||
}
|
||||
format!(
|
||||
r#"<div class="row">
|
||||
<h2>Remote Volumes</h2>
|
||||
<table class="table table-striped">
|
||||
<thead>
|
||||
<tr>
|
||||
<th>Id</th>
|
||||
<th>Collection</th>
|
||||
<th>Size</th>
|
||||
<th>Files</th>
|
||||
<th>Trash</th>
|
||||
<th>Remote</th>
|
||||
<th>Key</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody>{}</tbody>
|
||||
</table>
|
||||
</div>"#,
|
||||
remote_rows_html
|
||||
)
|
||||
};
|
||||
|
||||
let ec_section = if ec_volume_rows.is_empty() {
|
||||
String::new()
|
||||
} else {
|
||||
let mut ec_rows_html = String::new();
|
||||
for ec in &ec_volume_rows {
|
||||
let mut shard_labels = String::new();
|
||||
for shard in &ec.shards {
|
||||
let _ = write!(
|
||||
shard_labels,
|
||||
"<span class=\"label label-info\" style=\"margin-right: 5px;\">{}: {}</span>",
|
||||
shard.shard_id,
|
||||
bytes_to_human_readable(shard.size)
|
||||
);
|
||||
}
|
||||
let _ = write!(
|
||||
ec_rows_html,
|
||||
"<tr><td><code>{}</code></td><td>{}</td><td>{}</td><td>{}</td><td>{}</td></tr>",
|
||||
ec.volume_id,
|
||||
escape_html(&ec.collection),
|
||||
bytes_to_human_readable(ec.size),
|
||||
shard_labels,
|
||||
escape_html(&ec.created_at),
|
||||
);
|
||||
}
|
||||
format!(
|
||||
r#"<div class="row">
|
||||
<h2>Erasure Coding Shards</h2>
|
||||
<table class="table table-striped">
|
||||
<thead>
|
||||
<tr>
|
||||
<th>Id</th>
|
||||
<th>Collection</th>
|
||||
<th>Total Size</th>
|
||||
<th>Shard Details</th>
|
||||
<th>CreatedAt</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody>{}</tbody>
|
||||
</table>
|
||||
</div>"#,
|
||||
ec_rows_html
|
||||
)
|
||||
};
|
||||
|
||||
format!(
|
||||
r#"<!DOCTYPE html>
|
||||
<html>
|
||||
<head>
|
||||
<title>SeaweedFS {version}</title>
|
||||
<link rel="stylesheet" href="/seaweedfsstatic/bootstrap/3.3.1/css/bootstrap.min.css">
|
||||
<script type="text/javascript" src="/seaweedfsstatic/javascript/jquery-3.6.0.min.js"></script>
|
||||
<script type="text/javascript" src="/seaweedfsstatic/javascript/jquery-sparklines/2.1.2/jquery.sparkline.min.js"></script>
|
||||
<script type="text/javascript">
|
||||
$(function () {{
|
||||
var periods = ['second', 'minute', 'hour', 'day'];
|
||||
for (var i = 0; i < periods.length; i++) {{
|
||||
var period = periods[i];
|
||||
$('.inlinesparkline-' + period).sparkline('html', {{
|
||||
type: 'line',
|
||||
barColor: 'red',
|
||||
tooltipSuffix: ' request per ' + period
|
||||
}});
|
||||
}}
|
||||
}});
|
||||
</script>
|
||||
<style>
|
||||
#jqstooltip {{
|
||||
height: 28px !important;
|
||||
width: 150px !important;
|
||||
}}
|
||||
</style>
|
||||
</head>
|
||||
<body>
|
||||
<div class="container">
|
||||
<div class="page-header">
|
||||
<h1>
|
||||
<a href="https://github.com/seaweedfs/seaweedfs"><img src="/seaweedfsstatic/seaweed50x50.png" alt="SeaweedFS"></a>
|
||||
SeaweedFS <small>{version}</small>
|
||||
</h1>
|
||||
</div>
|
||||
|
||||
<div class="row">
|
||||
<div class="col-sm-6">
|
||||
<h2>Disk Stats</h2>
|
||||
<table class="table table-striped">
|
||||
<thead>
|
||||
<tr>
|
||||
<th>Path</th>
|
||||
<th>Disk</th>
|
||||
<th>Total</th>
|
||||
<th>Free</th>
|
||||
<th>Usage</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody>{disk_rows_html}</tbody>
|
||||
</table>
|
||||
</div>
|
||||
|
||||
<div class="col-sm-6">
|
||||
<h2>System Stats</h2>
|
||||
<table class="table table-condensed table-striped">
|
||||
<tr><th>Masters</th><td>{masters}</td></tr>
|
||||
<tr><th>Weekly # ReadRequests</th><td><span class="inlinesparkline-day">{read_week}</span></td></tr>
|
||||
<tr><th>Daily # ReadRequests</th><td><span class="inlinesparkline-hour">{read_day}</span></td></tr>
|
||||
<tr><th>Hourly # ReadRequests</th><td><span class="inlinesparkline-minute">{read_hour}</span></td></tr>
|
||||
<tr><th>Last Minute # ReadRequests</th><td><span class="inlinesparkline-second">{read_minute}</span></td></tr>
|
||||
<tr><th>Up Time</th><td>{uptime}</td></tr>
|
||||
</table>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="row">
|
||||
<h2>Volumes</h2>
|
||||
<table class="table table-striped">
|
||||
<thead>
|
||||
<tr>
|
||||
<th>Id</th>
|
||||
<th>Collection</th>
|
||||
<th>Disk</th>
|
||||
<th>Data Size</th>
|
||||
<th>Files</th>
|
||||
<th>Trash</th>
|
||||
<th>TTL</th>
|
||||
<th>ReadOnly</th>
|
||||
<th>Version</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody>{volume_rows_html}</tbody>
|
||||
</table>
|
||||
</div>
|
||||
|
||||
{remote_section}
|
||||
{ec_section}
|
||||
</div>
|
||||
</body>
|
||||
</html>"#,
|
||||
version = escape_html(crate::version::version()),
|
||||
disk_rows_html = disk_rows_html,
|
||||
masters = escape_html(&masters),
|
||||
read_week = read_week,
|
||||
read_day = read_day,
|
||||
read_hour = read_hour,
|
||||
read_minute = read_minute,
|
||||
uptime = escape_html(&uptime),
|
||||
volume_rows_html = volume_rows_html,
|
||||
remote_section = remote_section,
|
||||
ec_section = ec_section,
|
||||
)
|
||||
}
|
||||
|
||||
fn collect_ui_data(
|
||||
store: &Store,
|
||||
) -> (
|
||||
Vec<UiDiskRow>,
|
||||
Vec<UiVolumeRow>,
|
||||
Vec<UiVolumeRow>,
|
||||
Vec<UiEcVolumeRow>,
|
||||
) {
|
||||
let mut disk_rows = Vec::new();
|
||||
let mut volumes = Vec::new();
|
||||
let mut remote_volumes = Vec::new();
|
||||
let mut ec_volumes = Vec::new();
|
||||
|
||||
for loc in &store.locations {
|
||||
let dir = absolute_display_path(&loc.directory);
|
||||
let (all, free) = crate::storage::disk_location::get_disk_stats(&dir);
|
||||
disk_rows.push(UiDiskRow {
|
||||
dir,
|
||||
disk_type: loc.disk_type.to_string(),
|
||||
all,
|
||||
free,
|
||||
used: all.saturating_sub(free),
|
||||
});
|
||||
|
||||
for (_, volume) in loc.volumes() {
|
||||
let (remote_storage_name, remote_storage_key) = volume.remote_storage_name_key();
|
||||
let row = UiVolumeRow {
|
||||
id: volume.id.0,
|
||||
collection: volume.collection.clone(),
|
||||
disk_type: loc.disk_type.to_string(),
|
||||
size: volume.content_size(),
|
||||
file_count: volume.file_count(),
|
||||
delete_count: volume.deleted_count(),
|
||||
deleted_byte_count: volume.deleted_size(),
|
||||
ttl: volume.super_block.ttl.to_string(),
|
||||
read_only: volume.is_read_only(),
|
||||
version: volume.version().0 as u32,
|
||||
remote_storage_name,
|
||||
remote_storage_key,
|
||||
};
|
||||
if row.remote_storage_name.is_empty() {
|
||||
volumes.push(row);
|
||||
} else {
|
||||
remote_volumes.push(row);
|
||||
}
|
||||
}
|
||||
|
||||
for (_, ec_volume) in loc.ec_volumes() {
|
||||
let mut shards = Vec::new();
|
||||
let mut total_size = 0u64;
|
||||
let mut created_at = String::from("-");
|
||||
for shard in ec_volume.shards.iter().flatten() {
|
||||
let shard_size = shard.file_size().max(0) as u64;
|
||||
total_size = total_size.saturating_add(shard_size);
|
||||
shards.push(UiEcShardRow {
|
||||
shard_id: shard.shard_id,
|
||||
size: shard_size,
|
||||
});
|
||||
if created_at == "-" {
|
||||
if let Ok(metadata) = std::fs::metadata(shard.file_name()) {
|
||||
if let Ok(modified) = metadata.modified() {
|
||||
let ts: chrono::DateTime<chrono::Local> = modified.into();
|
||||
created_at = ts.format("%Y-%m-%d %H:%M").to_string();
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
let preferred_size = ec_volume.dat_file_size.max(0) as u64;
|
||||
ec_volumes.push(UiEcVolumeRow {
|
||||
volume_id: ec_volume.volume_id.0,
|
||||
collection: ec_volume.collection.clone(),
|
||||
size: preferred_size.max(total_size),
|
||||
shards,
|
||||
created_at,
|
||||
});
|
||||
}
|
||||
}
|
||||
|
||||
disk_rows.sort_by(|left, right| left.dir.cmp(&right.dir));
|
||||
volumes.sort_by_key(|row| row.id);
|
||||
remote_volumes.sort_by_key(|row| row.id);
|
||||
ec_volumes.sort_by_key(|row| row.volume_id);
|
||||
|
||||
(disk_rows, volumes, remote_volumes, ec_volumes)
|
||||
}
|
||||
|
||||
fn absolute_display_path(path: &str) -> String {
|
||||
let p = std::path::Path::new(path);
|
||||
if p.is_absolute() {
|
||||
return path.to_string();
|
||||
}
|
||||
std::env::current_dir()
|
||||
.map(|cwd| cwd.join(p).to_string_lossy().to_string())
|
||||
.unwrap_or_else(|_| path.to_string())
|
||||
}
|
||||
|
||||
fn join_i64(values: &[i64]) -> String {
|
||||
values
|
||||
.iter()
|
||||
.map(std::string::ToString::to_string)
|
||||
.collect::<Vec<_>>()
|
||||
.join(",")
|
||||
}
|
||||
|
||||
fn percent_from(total: u64, part: u64) -> f64 {
|
||||
if total == 0 {
|
||||
return 0.0;
|
||||
}
|
||||
(part as f64 / total as f64) * 100.0
|
||||
}
|
||||
|
||||
fn bytes_to_human_readable(bytes: u64) -> String {
|
||||
const UNIT: u64 = 1024;
|
||||
if bytes < UNIT {
|
||||
return format!("{} B", bytes);
|
||||
}
|
||||
|
||||
let mut div = UNIT;
|
||||
let mut exp = 0usize;
|
||||
let mut n = bytes / UNIT;
|
||||
while n >= UNIT {
|
||||
div *= UNIT;
|
||||
n /= UNIT;
|
||||
exp += 1;
|
||||
}
|
||||
|
||||
format!(
|
||||
"{:.2} {}iB",
|
||||
bytes as f64 / div as f64,
|
||||
["K", "M", "G", "T", "P", "E"][exp]
|
||||
)
|
||||
}
|
||||
|
||||
fn escape_html(input: &str) -> String {
|
||||
input
|
||||
.replace('&', "&")
|
||||
.replace('<', "<")
|
||||
.replace('>', ">")
|
||||
.replace('"', """)
|
||||
}
|
||||
394
seaweed-volume/src/server/volume_server.rs
Normal file
394
seaweed-volume/src/server/volume_server.rs
Normal file
@@ -0,0 +1,394 @@
|
||||
//! VolumeServer: the main HTTP server for volume operations.
|
||||
//!
|
||||
//! Routes:
|
||||
//! GET/HEAD /{vid},{fid} — read a file
|
||||
//! POST/PUT /{vid},{fid} — write a file
|
||||
//! DELETE /{vid},{fid} — delete a file
|
||||
//! GET /status — server status
|
||||
//! GET /healthz — health check
|
||||
//!
|
||||
//! Matches Go's server/volume_server.go.
|
||||
|
||||
use std::net::SocketAddr;
|
||||
use std::sync::atomic::{AtomicBool, AtomicI64, AtomicU32, Ordering};
|
||||
use std::sync::{Arc, RwLock};
|
||||
|
||||
use axum::{
|
||||
extract::{connect_info::ConnectInfo, Request, State},
|
||||
http::{header, HeaderValue, Method, StatusCode},
|
||||
middleware::{self, Next},
|
||||
response::{IntoResponse, Response},
|
||||
routing::{any, get},
|
||||
Router,
|
||||
};
|
||||
|
||||
use crate::config::ReadMode;
|
||||
use crate::security::Guard;
|
||||
use crate::storage::store::Store;
|
||||
|
||||
use super::grpc_client::OutgoingGrpcTlsConfig;
|
||||
use super::handlers;
|
||||
use super::write_queue::WriteQueue;
|
||||
|
||||
#[derive(Clone, Debug, Default)]
|
||||
pub struct RuntimeMetricsConfig {
|
||||
pub push_gateway: crate::metrics::PushGatewayConfig,
|
||||
}
|
||||
|
||||
/// Shared state for the volume server.
|
||||
pub struct VolumeServerState {
|
||||
pub store: RwLock<Store>,
|
||||
pub guard: RwLock<Guard>,
|
||||
pub is_stopping: RwLock<bool>,
|
||||
/// Maintenance mode flag.
|
||||
pub maintenance: AtomicBool,
|
||||
/// State version — incremented on each SetState call.
|
||||
pub state_version: AtomicU32,
|
||||
/// Throttling: concurrent upload/download limits (in bytes, 0 = disabled).
|
||||
pub concurrent_upload_limit: i64,
|
||||
pub concurrent_download_limit: i64,
|
||||
pub inflight_upload_data_timeout: std::time::Duration,
|
||||
pub inflight_download_data_timeout: std::time::Duration,
|
||||
/// Current in-flight upload/download bytes.
|
||||
pub inflight_upload_bytes: AtomicI64,
|
||||
pub inflight_download_bytes: AtomicI64,
|
||||
/// Notify waiters when inflight bytes decrease.
|
||||
pub upload_notify: tokio::sync::Notify,
|
||||
pub download_notify: tokio::sync::Notify,
|
||||
/// Data center name from config.
|
||||
pub data_center: String,
|
||||
/// Rack name from config.
|
||||
pub rack: String,
|
||||
/// File size limit in bytes (0 = no limit).
|
||||
pub file_size_limit_bytes: i64,
|
||||
/// Default IO rate limit for maintenance copy/replication work.
|
||||
pub maintenance_byte_per_second: i64,
|
||||
/// Whether the server is connected to master (heartbeat active).
|
||||
pub is_heartbeating: AtomicBool,
|
||||
/// Whether master addresses are configured.
|
||||
pub has_master: bool,
|
||||
/// Seconds to wait before shutting down servers (graceful drain).
|
||||
pub pre_stop_seconds: u32,
|
||||
/// Notify heartbeat to send an immediate update when volume state changes.
|
||||
pub volume_state_notify: tokio::sync::Notify,
|
||||
/// Optional batched write queue for improved throughput under load.
|
||||
pub write_queue: std::sync::OnceLock<WriteQueue>,
|
||||
/// Registry of S3 tier backends for tiered storage operations.
|
||||
pub s3_tier_registry: std::sync::RwLock<crate::remote_storage::s3_tier::S3TierRegistry>,
|
||||
/// Read mode: local, proxy, or redirect for non-local volumes.
|
||||
pub read_mode: ReadMode,
|
||||
/// First master address for volume lookups (e.g., "localhost:9333").
|
||||
pub master_url: String,
|
||||
/// Seed master addresses for UI rendering.
|
||||
pub master_urls: Vec<String>,
|
||||
/// This server's own address (ip:port) for filtering self from lookup results.
|
||||
pub self_url: String,
|
||||
/// HTTP client for proxy requests and master lookups.
|
||||
pub http_client: reqwest::Client,
|
||||
/// Scheme used for outgoing master and peer HTTP requests ("http" or "https").
|
||||
pub outgoing_http_scheme: String,
|
||||
/// Optional client TLS material for outgoing gRPC connections.
|
||||
pub outgoing_grpc_tls: Option<OutgoingGrpcTlsConfig>,
|
||||
/// Metrics push settings learned from master heartbeat responses.
|
||||
pub metrics_runtime: std::sync::RwLock<RuntimeMetricsConfig>,
|
||||
pub metrics_notify: tokio::sync::Notify,
|
||||
/// Whether JPEG uploads should be normalized using EXIF orientation.
|
||||
pub fix_jpg_orientation: bool,
|
||||
/// Read tuning flags for large-file streaming.
|
||||
pub has_slow_read: bool,
|
||||
pub read_buffer_size_bytes: usize,
|
||||
/// Path to security.toml — stored for SIGHUP reload.
|
||||
pub security_file: String,
|
||||
/// Original CLI whitelist entries — stored for SIGHUP reload.
|
||||
pub cli_white_list: Vec<String>,
|
||||
/// Path to state.pb file for persisting VolumeServerState across restarts.
|
||||
pub state_file_path: String,
|
||||
}
|
||||
|
||||
impl VolumeServerState {
|
||||
/// Check if the server is in maintenance mode; return gRPC error if so.
|
||||
pub fn check_maintenance(&self) -> Result<(), tonic::Status> {
|
||||
if self.maintenance.load(Ordering::Relaxed) {
|
||||
let id = self.store.read().unwrap().id.clone();
|
||||
return Err(tonic::Status::unavailable(format!(
|
||||
"volume server {} is in maintenance mode",
|
||||
id
|
||||
)));
|
||||
}
|
||||
Ok(())
|
||||
}
|
||||
}
|
||||
|
||||
pub fn build_metrics_router() -> Router {
|
||||
Router::new().route("/metrics", get(handlers::metrics_handler))
|
||||
}
|
||||
|
||||
pub fn normalize_outgoing_http_url(scheme: &str, raw_target: &str) -> Result<String, String> {
|
||||
if raw_target.starts_with("http://") || raw_target.starts_with("https://") {
|
||||
let mut url = reqwest::Url::parse(raw_target)
|
||||
.map_err(|e| format!("invalid url {}: {}", raw_target, e))?;
|
||||
url.set_scheme(scheme)
|
||||
.map_err(|_| format!("invalid scheme {}", scheme))?;
|
||||
return Ok(url.to_string());
|
||||
}
|
||||
Ok(format!("{}://{}", scheme, raw_target))
|
||||
}
|
||||
|
||||
fn request_remote_addr(request: &Request) -> Option<SocketAddr> {
|
||||
request
|
||||
.extensions()
|
||||
.get::<ConnectInfo<SocketAddr>>()
|
||||
.map(|info| info.0)
|
||||
}
|
||||
|
||||
fn request_is_whitelisted(state: &VolumeServerState, request: &Request) -> bool {
|
||||
request_remote_addr(request)
|
||||
.map(|remote_addr| {
|
||||
state
|
||||
.guard
|
||||
.read()
|
||||
.unwrap()
|
||||
.check_whitelist(&remote_addr.to_string())
|
||||
})
|
||||
.unwrap_or(true)
|
||||
}
|
||||
|
||||
/// Middleware: set Server header, echo x-amz-request-id, set CORS if Origin present.
|
||||
async fn common_headers_middleware(request: Request, next: Next) -> Response {
|
||||
let origin = request.headers().get("origin").cloned();
|
||||
let request_id = super::request_id::generate_http_request_id();
|
||||
|
||||
let mut response =
|
||||
super::request_id::scope_request_id(
|
||||
request_id.clone(),
|
||||
async move { next.run(request).await },
|
||||
)
|
||||
.await;
|
||||
|
||||
let headers = response.headers_mut();
|
||||
if let Ok(val) = HeaderValue::from_str(crate::version::server_header()) {
|
||||
headers.insert("Server", val);
|
||||
}
|
||||
|
||||
if let Ok(val) = HeaderValue::from_str(&request_id) {
|
||||
headers.insert("X-Request-Id", val.clone());
|
||||
headers.insert("x-amz-request-id", val);
|
||||
}
|
||||
|
||||
if origin.is_some() {
|
||||
headers.insert("Access-Control-Allow-Origin", HeaderValue::from_static("*"));
|
||||
headers.insert(
|
||||
"Access-Control-Allow-Credentials",
|
||||
HeaderValue::from_static("true"),
|
||||
);
|
||||
}
|
||||
|
||||
response
|
||||
}
|
||||
|
||||
/// Admin store handler — dispatches based on HTTP method.
|
||||
/// Matches Go's privateStoreHandler: GET/HEAD → read, POST/PUT → write,
|
||||
/// DELETE → delete, OPTIONS → CORS headers, anything else → 400.
|
||||
async fn admin_store_handler(state: State<Arc<VolumeServerState>>, request: Request) -> Response {
|
||||
let start = std::time::Instant::now();
|
||||
let method = request.method().clone();
|
||||
let mut method_str = method.as_str().to_string();
|
||||
let request_bytes = request
|
||||
.headers()
|
||||
.get(header::CONTENT_LENGTH)
|
||||
.and_then(|value| value.to_str().ok())
|
||||
.and_then(|value| value.parse::<i64>().ok())
|
||||
.filter(|value| *value > 0)
|
||||
.unwrap_or(0);
|
||||
super::server_stats::record_request_open();
|
||||
crate::metrics::INFLIGHT_REQUESTS_GAUGE
|
||||
.with_label_values(&[&method_str])
|
||||
.inc();
|
||||
let whitelist_rejected = matches!(method, Method::POST | Method::PUT | Method::DELETE)
|
||||
&& !request_is_whitelisted(&state, &request);
|
||||
let response = match method.clone() {
|
||||
_ if whitelist_rejected => StatusCode::UNAUTHORIZED.into_response(),
|
||||
Method::GET | Method::HEAD => {
|
||||
super::server_stats::record_read_request();
|
||||
handlers::get_or_head_handler_from_request(state, request).await
|
||||
}
|
||||
Method::POST | Method::PUT => {
|
||||
super::server_stats::record_write_request();
|
||||
if request_bytes > 0 {
|
||||
super::server_stats::record_bytes_in(request_bytes);
|
||||
}
|
||||
handlers::post_handler(state, request).await
|
||||
}
|
||||
Method::DELETE => {
|
||||
super::server_stats::record_delete_request();
|
||||
handlers::delete_handler(state, request).await
|
||||
}
|
||||
Method::OPTIONS => {
|
||||
super::server_stats::record_read_request();
|
||||
admin_options_response()
|
||||
}
|
||||
_ => {
|
||||
let method_name = request.method().to_string();
|
||||
let query = request.uri().query().map(|q| q.to_string());
|
||||
method_str = "INVALID".to_string();
|
||||
handlers::json_error_with_query(
|
||||
StatusCode::BAD_REQUEST,
|
||||
format!("unsupported method {}", method_name),
|
||||
query.as_deref(),
|
||||
)
|
||||
}
|
||||
};
|
||||
if method == Method::GET {
|
||||
if let Some(response_bytes) = response
|
||||
.headers()
|
||||
.get(header::CONTENT_LENGTH)
|
||||
.and_then(|value| value.to_str().ok())
|
||||
.and_then(|value| value.parse::<i64>().ok())
|
||||
.filter(|value| *value > 0)
|
||||
{
|
||||
super::server_stats::record_bytes_out(response_bytes);
|
||||
}
|
||||
}
|
||||
super::server_stats::record_request_close();
|
||||
crate::metrics::INFLIGHT_REQUESTS_GAUGE
|
||||
.with_label_values(&[&method_str])
|
||||
.dec();
|
||||
crate::metrics::REQUEST_COUNTER
|
||||
.with_label_values(&[&method_str, response.status().as_str()])
|
||||
.inc();
|
||||
crate::metrics::REQUEST_DURATION
|
||||
.with_label_values(&[&method_str])
|
||||
.observe(start.elapsed().as_secs_f64());
|
||||
response
|
||||
}
|
||||
|
||||
/// Public store handler — dispatches based on HTTP method.
|
||||
/// Matches Go's publicReadOnlyHandler: GET/HEAD → read, OPTIONS → CORS,
|
||||
/// anything else → 200 (passthrough no-op).
|
||||
async fn public_store_handler(state: State<Arc<VolumeServerState>>, request: Request) -> Response {
|
||||
let start = std::time::Instant::now();
|
||||
let method = request.method().clone();
|
||||
let method_str = method.as_str().to_string();
|
||||
super::server_stats::record_request_open();
|
||||
crate::metrics::INFLIGHT_REQUESTS_GAUGE
|
||||
.with_label_values(&[&method_str])
|
||||
.inc();
|
||||
let response = match method.clone() {
|
||||
Method::GET | Method::HEAD => {
|
||||
super::server_stats::record_read_request();
|
||||
handlers::get_or_head_handler_from_request(state, request).await
|
||||
}
|
||||
Method::OPTIONS => {
|
||||
super::server_stats::record_read_request();
|
||||
public_options_response()
|
||||
}
|
||||
_ => StatusCode::OK.into_response(),
|
||||
};
|
||||
if method == Method::GET {
|
||||
if let Some(response_bytes) = response
|
||||
.headers()
|
||||
.get(header::CONTENT_LENGTH)
|
||||
.and_then(|value| value.to_str().ok())
|
||||
.and_then(|value| value.parse::<i64>().ok())
|
||||
.filter(|value| *value > 0)
|
||||
{
|
||||
super::server_stats::record_bytes_out(response_bytes);
|
||||
}
|
||||
}
|
||||
super::server_stats::record_request_close();
|
||||
crate::metrics::INFLIGHT_REQUESTS_GAUGE
|
||||
.with_label_values(&[&method_str])
|
||||
.dec();
|
||||
crate::metrics::REQUEST_COUNTER
|
||||
.with_label_values(&[&method_str, response.status().as_str()])
|
||||
.inc();
|
||||
crate::metrics::REQUEST_DURATION
|
||||
.with_label_values(&[&method_str])
|
||||
.observe(start.elapsed().as_secs_f64());
|
||||
response
|
||||
}
|
||||
|
||||
/// Build OPTIONS response for admin port.
|
||||
fn admin_options_response() -> Response {
|
||||
let mut response = StatusCode::OK.into_response();
|
||||
let headers = response.headers_mut();
|
||||
headers.insert(
|
||||
"Access-Control-Allow-Methods",
|
||||
HeaderValue::from_static("PUT, POST, GET, DELETE, OPTIONS"),
|
||||
);
|
||||
headers.insert(
|
||||
"Access-Control-Allow-Headers",
|
||||
HeaderValue::from_static("*"),
|
||||
);
|
||||
response
|
||||
}
|
||||
|
||||
/// Build OPTIONS response for public port.
|
||||
fn public_options_response() -> Response {
|
||||
let mut response = StatusCode::OK.into_response();
|
||||
let headers = response.headers_mut();
|
||||
headers.insert(
|
||||
"Access-Control-Allow-Methods",
|
||||
HeaderValue::from_static("GET, OPTIONS"),
|
||||
);
|
||||
headers.insert(
|
||||
"Access-Control-Allow-Headers",
|
||||
HeaderValue::from_static("*"),
|
||||
);
|
||||
response
|
||||
}
|
||||
|
||||
/// Build the admin (private) HTTP router — supports all operations.
|
||||
/// UI route is only registered when no signing keys are configured,
|
||||
/// matching Go's `if signingKey == "" || enableUiAccess` check.
|
||||
pub fn build_admin_router(state: Arc<VolumeServerState>) -> Router {
|
||||
let guard = state.guard.read().unwrap();
|
||||
// This helper can only derive the default Go behavior from the guard state:
|
||||
// UI stays enabled when the write signing key is empty. The explicit
|
||||
// `access.ui` override is handled by `build_admin_router_with_ui(...)`.
|
||||
let ui_enabled = guard.signing_key.0.is_empty();
|
||||
drop(guard);
|
||||
build_admin_router_with_ui(state, ui_enabled)
|
||||
}
|
||||
|
||||
/// Build the admin router with an explicit UI exposure flag.
|
||||
pub fn build_admin_router_with_ui(state: Arc<VolumeServerState>, ui_enabled: bool) -> Router {
|
||||
let mut router = Router::new()
|
||||
.route("/status", get(handlers::status_handler))
|
||||
.route("/healthz", get(handlers::healthz_handler))
|
||||
.route("/favicon.ico", get(handlers::favicon_handler))
|
||||
.route(
|
||||
"/seaweedfsstatic/*path",
|
||||
get(handlers::static_asset_handler),
|
||||
)
|
||||
.route("/", any(admin_store_handler))
|
||||
.route("/:path", any(admin_store_handler))
|
||||
.route("/:vid/:fid", any(admin_store_handler))
|
||||
.route("/:vid/:fid/:filename", any(admin_store_handler))
|
||||
.fallback(admin_store_handler);
|
||||
if ui_enabled {
|
||||
// Note: /stats/* endpoints are commented out in Go's volume_server.go (L130-134).
|
||||
// Only the UI endpoint is registered when UI access is enabled.
|
||||
router = router.route("/ui/index.html", get(handlers::ui_handler));
|
||||
}
|
||||
router
|
||||
.layer(middleware::from_fn(common_headers_middleware))
|
||||
.with_state(state)
|
||||
}
|
||||
|
||||
/// Build the public (read-only) HTTP router — only GET/HEAD.
|
||||
pub fn build_public_router(state: Arc<VolumeServerState>) -> Router {
|
||||
Router::new()
|
||||
.route("/favicon.ico", get(handlers::favicon_handler))
|
||||
.route(
|
||||
"/seaweedfsstatic/*path",
|
||||
get(handlers::static_asset_handler),
|
||||
)
|
||||
.route("/", any(public_store_handler))
|
||||
.route("/:path", any(public_store_handler))
|
||||
.route("/:vid/:fid", any(public_store_handler))
|
||||
.route("/:vid/:fid/:filename", any(public_store_handler))
|
||||
.fallback(public_store_handler)
|
||||
.layer(middleware::from_fn(common_headers_middleware))
|
||||
.with_state(state)
|
||||
}
|
||||
330
seaweed-volume/src/server/write_queue.rs
Normal file
330
seaweed-volume/src/server/write_queue.rs
Normal file
@@ -0,0 +1,330 @@
|
||||
//! Async batched write processing for the volume server.
|
||||
//!
|
||||
//! Instead of each upload handler directly calling `write_needle` and syncing,
|
||||
//! writes are submitted to a queue. A background worker drains the queue in
|
||||
//! batches (up to 128 entries), groups them by volume ID, processes them
|
||||
//! together, and syncs once per volume for the entire batch.
|
||||
|
||||
use std::sync::Arc;
|
||||
|
||||
use tokio::sync::{mpsc, oneshot};
|
||||
use tracing::debug;
|
||||
|
||||
use crate::storage::needle::needle::Needle;
|
||||
use crate::storage::types::{Size, VolumeId};
|
||||
use crate::storage::volume::VolumeError;
|
||||
|
||||
use super::volume_server::VolumeServerState;
|
||||
|
||||
/// Result of a single write operation: (offset, size, is_unchanged).
|
||||
pub type WriteResult = Result<(u64, Size, bool), VolumeError>;
|
||||
|
||||
/// A request to write a needle, submitted to the write queue.
|
||||
pub struct WriteRequest {
|
||||
pub volume_id: VolumeId,
|
||||
pub needle: Needle,
|
||||
pub response_tx: oneshot::Sender<WriteResult>,
|
||||
}
|
||||
|
||||
/// Maximum number of write requests to batch together.
|
||||
const MAX_BATCH_SIZE: usize = 128;
|
||||
|
||||
/// Maximum bytes to accumulate per batch before breaking (matches Go's 4MB limit).
|
||||
/// This prevents large writes from accumulating unbounded latency.
|
||||
const MAX_BATCH_BYTES: usize = 4 * 1024 * 1024;
|
||||
|
||||
/// Handle for submitting write requests to the background worker.
|
||||
#[derive(Clone)]
|
||||
pub struct WriteQueue {
|
||||
tx: mpsc::Sender<WriteRequest>,
|
||||
}
|
||||
|
||||
impl WriteQueue {
|
||||
/// Create a new write queue and spawn the background worker.
|
||||
///
|
||||
/// `capacity` controls the channel buffer size (backpressure kicks in when full).
|
||||
/// The worker holds a reference to `state` for accessing the store.
|
||||
pub fn new(state: Arc<VolumeServerState>, capacity: usize) -> Self {
|
||||
let (tx, rx) = mpsc::channel(capacity);
|
||||
let worker = WriteQueueWorker { rx, state };
|
||||
tokio::spawn(worker.run());
|
||||
WriteQueue { tx }
|
||||
}
|
||||
|
||||
/// Submit a write request and wait for the result.
|
||||
///
|
||||
/// Returns `Err` if the worker has shut down or the response channel was dropped.
|
||||
pub async fn submit(&self, volume_id: VolumeId, needle: Needle) -> WriteResult {
|
||||
let (response_tx, response_rx) = oneshot::channel();
|
||||
let request = WriteRequest {
|
||||
volume_id,
|
||||
needle,
|
||||
response_tx,
|
||||
};
|
||||
|
||||
// Send to queue; this awaits if the channel is full (backpressure).
|
||||
if self.tx.send(request).await.is_err() {
|
||||
return Err(VolumeError::Io(std::io::Error::new(
|
||||
std::io::ErrorKind::BrokenPipe,
|
||||
"write queue worker has shut down",
|
||||
)));
|
||||
}
|
||||
|
||||
// Wait for the worker to process our request.
|
||||
match response_rx.await {
|
||||
Ok(result) => result,
|
||||
Err(_) => Err(VolumeError::Io(std::io::Error::new(
|
||||
std::io::ErrorKind::BrokenPipe,
|
||||
"write queue worker dropped response channel",
|
||||
))),
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
/// Background worker that drains write requests and processes them in batches.
|
||||
struct WriteQueueWorker {
|
||||
rx: mpsc::Receiver<WriteRequest>,
|
||||
state: Arc<VolumeServerState>,
|
||||
}
|
||||
|
||||
impl WriteQueueWorker {
|
||||
async fn run(mut self) {
|
||||
debug!("write queue worker started");
|
||||
|
||||
loop {
|
||||
// Wait for the first request (blocks until one arrives or channel closes).
|
||||
let first = match self.rx.recv().await {
|
||||
Some(req) => req,
|
||||
None => {
|
||||
debug!("write queue channel closed, worker exiting");
|
||||
return;
|
||||
}
|
||||
};
|
||||
|
||||
// Drain as many additional requests as available, up to MAX_BATCH_SIZE
|
||||
// or MAX_BATCH_BYTES (matches Go: 128 requests or 4MB, whichever comes first).
|
||||
let mut batch = Vec::with_capacity(MAX_BATCH_SIZE);
|
||||
let mut batch_bytes: usize = first.needle.data.len();
|
||||
batch.push(first);
|
||||
|
||||
while batch.len() < MAX_BATCH_SIZE && batch_bytes < MAX_BATCH_BYTES {
|
||||
match self.rx.try_recv() {
|
||||
Ok(req) => {
|
||||
batch_bytes += req.needle.data.len();
|
||||
batch.push(req);
|
||||
}
|
||||
Err(_) => break,
|
||||
}
|
||||
}
|
||||
|
||||
let batch_size = batch.len();
|
||||
debug!("processing write batch of {} requests", batch_size);
|
||||
|
||||
// Process the batch in spawn_blocking since write_needle does file I/O.
|
||||
let state = self.state.clone();
|
||||
let _ = tokio::task::spawn_blocking(move || {
|
||||
process_batch(state, batch);
|
||||
})
|
||||
.await;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
/// Process a batch of write requests, grouped by volume ID.
|
||||
///
|
||||
/// Groups writes by volume to minimize the number of store lock acquisitions,
|
||||
/// then sends results back via each request's oneshot channel.
|
||||
fn process_batch(state: Arc<VolumeServerState>, batch: Vec<WriteRequest>) {
|
||||
// Group requests by volume ID for efficient processing.
|
||||
// We use a Vec of (VolumeId, Vec<(Needle, Sender)>) to preserve order
|
||||
// and avoid requiring Hash on VolumeId.
|
||||
let mut groups: Vec<(VolumeId, Vec<(Needle, oneshot::Sender<WriteResult>)>)> = Vec::new();
|
||||
|
||||
for req in batch {
|
||||
let vid = req.volume_id;
|
||||
if let Some(group) = groups.iter_mut().find(|(v, _)| *v == vid) {
|
||||
group.1.push((req.needle, req.response_tx));
|
||||
} else {
|
||||
groups.push((vid, vec![(req.needle, req.response_tx)]));
|
||||
}
|
||||
}
|
||||
|
||||
// Process each volume group under a single store lock.
|
||||
let mut store = state.store.write().unwrap();
|
||||
|
||||
for (vid, entries) in groups {
|
||||
for (mut needle, response_tx) in entries {
|
||||
let result = store.write_volume_needle(vid, &mut needle);
|
||||
// Send result back; ignore error if receiver dropped.
|
||||
let _ = response_tx.send(result);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
use crate::storage::types::VolumeId;
|
||||
|
||||
/// Helper to create a minimal VolumeServerState for testing.
|
||||
fn make_test_state() -> Arc<VolumeServerState> {
|
||||
use crate::security::{Guard, SigningKey};
|
||||
use crate::server::volume_server::RuntimeMetricsConfig;
|
||||
use crate::storage::needle_map::NeedleMapKind;
|
||||
use crate::storage::store::Store;
|
||||
use std::sync::atomic::{AtomicBool, AtomicI64, AtomicU32};
|
||||
use std::sync::RwLock;
|
||||
|
||||
let store = Store::new(NeedleMapKind::InMemory);
|
||||
let guard = Guard::new(&[], SigningKey(vec![]), 0, SigningKey(vec![]), 0);
|
||||
|
||||
Arc::new(VolumeServerState {
|
||||
store: RwLock::new(store),
|
||||
guard: RwLock::new(guard),
|
||||
is_stopping: RwLock::new(false),
|
||||
maintenance: AtomicBool::new(false),
|
||||
state_version: AtomicU32::new(0),
|
||||
concurrent_upload_limit: 0,
|
||||
concurrent_download_limit: 0,
|
||||
inflight_upload_data_timeout: std::time::Duration::ZERO,
|
||||
inflight_download_data_timeout: std::time::Duration::ZERO,
|
||||
inflight_upload_bytes: AtomicI64::new(0),
|
||||
inflight_download_bytes: AtomicI64::new(0),
|
||||
upload_notify: tokio::sync::Notify::new(),
|
||||
download_notify: tokio::sync::Notify::new(),
|
||||
data_center: String::new(),
|
||||
rack: String::new(),
|
||||
file_size_limit_bytes: 0,
|
||||
maintenance_byte_per_second: 0,
|
||||
is_heartbeating: AtomicBool::new(false),
|
||||
has_master: false,
|
||||
pre_stop_seconds: 0,
|
||||
volume_state_notify: tokio::sync::Notify::new(),
|
||||
write_queue: std::sync::OnceLock::new(),
|
||||
s3_tier_registry: std::sync::RwLock::new(
|
||||
crate::remote_storage::s3_tier::S3TierRegistry::new(),
|
||||
),
|
||||
read_mode: crate::config::ReadMode::Local,
|
||||
master_url: String::new(),
|
||||
master_urls: Vec::new(),
|
||||
self_url: String::new(),
|
||||
http_client: reqwest::Client::new(),
|
||||
outgoing_http_scheme: "http".to_string(),
|
||||
outgoing_grpc_tls: None,
|
||||
metrics_runtime: std::sync::RwLock::new(RuntimeMetricsConfig::default()),
|
||||
metrics_notify: tokio::sync::Notify::new(),
|
||||
fix_jpg_orientation: false,
|
||||
has_slow_read: true,
|
||||
read_buffer_size_bytes: 4 * 1024 * 1024,
|
||||
security_file: String::new(),
|
||||
cli_white_list: vec![],
|
||||
state_file_path: String::new(),
|
||||
})
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn test_write_queue_submit_no_volume() {
|
||||
// Submit a write to a non-existent volume -- should return VolumeError::NotFound.
|
||||
let state = make_test_state();
|
||||
let queue = WriteQueue::new(state, MAX_BATCH_SIZE);
|
||||
|
||||
let needle = Needle {
|
||||
id: 1.into(),
|
||||
cookie: 0x12345678.into(),
|
||||
data: vec![1, 2, 3],
|
||||
data_size: 3,
|
||||
..Needle::default()
|
||||
};
|
||||
|
||||
let result = queue.submit(VolumeId(999), needle).await;
|
||||
assert!(result.is_err());
|
||||
match result {
|
||||
Err(VolumeError::NotFound) => {} // expected
|
||||
other => panic!("expected NotFound, got {:?}", other),
|
||||
}
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn test_write_queue_concurrent_submissions() {
|
||||
// Submit multiple concurrent writes -- all should complete (with errors since no volume).
|
||||
let state = make_test_state();
|
||||
let queue = WriteQueue::new(state, MAX_BATCH_SIZE);
|
||||
|
||||
let mut handles = Vec::new();
|
||||
for i in 0..10u64 {
|
||||
let q = queue.clone();
|
||||
handles.push(tokio::spawn(async move {
|
||||
let needle = Needle {
|
||||
id: i.into(),
|
||||
cookie: 0xABCD.into(),
|
||||
data: vec![i as u8; 10],
|
||||
data_size: 10,
|
||||
..Needle::default()
|
||||
};
|
||||
q.submit(VolumeId(1), needle).await
|
||||
}));
|
||||
}
|
||||
|
||||
for handle in handles {
|
||||
let result = handle.await.unwrap();
|
||||
// All should fail with NotFound since there's no volume 1
|
||||
assert!(matches!(result, Err(VolumeError::NotFound)));
|
||||
}
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn test_write_queue_batching() {
|
||||
// Verify that many concurrent writes get processed (testing the batching path).
|
||||
let state = make_test_state();
|
||||
let queue = WriteQueue::new(state, MAX_BATCH_SIZE);
|
||||
|
||||
// Submit MAX_BATCH_SIZE requests concurrently
|
||||
let mut handles = Vec::new();
|
||||
for i in 0..MAX_BATCH_SIZE as u64 {
|
||||
let q = queue.clone();
|
||||
handles.push(tokio::spawn(async move {
|
||||
let needle = Needle {
|
||||
id: i.into(),
|
||||
cookie: 0x1111.into(),
|
||||
data: vec![0u8; 4],
|
||||
data_size: 4,
|
||||
..Needle::default()
|
||||
};
|
||||
q.submit(VolumeId(42), needle).await
|
||||
}));
|
||||
}
|
||||
|
||||
let mut results = Vec::new();
|
||||
for handle in handles {
|
||||
results.push(handle.await.unwrap());
|
||||
}
|
||||
|
||||
// All should complete (with NotFound errors since no volume exists)
|
||||
assert_eq!(results.len(), MAX_BATCH_SIZE);
|
||||
for r in results {
|
||||
assert!(matches!(r, Err(VolumeError::NotFound)));
|
||||
}
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn test_write_queue_dropped_sender() {
|
||||
// When the queue is dropped, subsequent submits should fail gracefully.
|
||||
let state = make_test_state();
|
||||
let queue = WriteQueue::new(state, 1);
|
||||
|
||||
// Clone then drop the original -- the worker keeps running via its rx handle.
|
||||
let queue2 = queue.clone();
|
||||
drop(queue);
|
||||
|
||||
// This should still work since the worker is alive.
|
||||
let needle = Needle {
|
||||
id: 1.into(),
|
||||
cookie: 0.into(),
|
||||
data: vec![],
|
||||
data_size: 0,
|
||||
..Needle::default()
|
||||
};
|
||||
let result = queue2.submit(VolumeId(1), needle).await;
|
||||
assert!(result.is_err()); // NotFound is fine -- the point is it doesn't panic
|
||||
}
|
||||
}
|
||||
951
seaweed-volume/src/storage/disk_location.rs
Normal file
951
seaweed-volume/src/storage/disk_location.rs
Normal file
@@ -0,0 +1,951 @@
|
||||
//! DiskLocation: manages volumes on a single disk/directory.
|
||||
//!
|
||||
//! Each DiskLocation represents one storage directory containing .dat + .idx files.
|
||||
//! A Store contains one or more DiskLocations (one per configured directory).
|
||||
//! Matches Go's storage/disk_location.go.
|
||||
|
||||
use std::collections::{HashMap, HashSet};
|
||||
use std::fs;
|
||||
use std::io;
|
||||
use std::sync::atomic::{AtomicBool, AtomicI32, AtomicU64, Ordering};
|
||||
use std::sync::Arc;
|
||||
|
||||
use tracing::{info, warn};
|
||||
|
||||
use crate::config::MinFreeSpace;
|
||||
use crate::storage::erasure_coding::ec_shard::{
|
||||
EcVolumeShard, DATA_SHARDS_COUNT, ERASURE_CODING_LARGE_BLOCK_SIZE,
|
||||
ERASURE_CODING_SMALL_BLOCK_SIZE,
|
||||
};
|
||||
use crate::storage::erasure_coding::ec_volume::EcVolume;
|
||||
use crate::storage::needle_map::NeedleMapKind;
|
||||
use crate::storage::super_block::ReplicaPlacement;
|
||||
use crate::storage::types::*;
|
||||
use crate::storage::volume::{remove_volume_files, volume_file_name, Volume, VolumeError};
|
||||
|
||||
/// A single disk location managing volumes in one directory.
|
||||
pub struct DiskLocation {
|
||||
pub directory: String,
|
||||
pub idx_directory: String,
|
||||
pub directory_uuid: String,
|
||||
pub disk_type: DiskType,
|
||||
pub tags: Vec<String>,
|
||||
pub max_volume_count: AtomicI32,
|
||||
pub original_max_volume_count: i32,
|
||||
volumes: HashMap<VolumeId, Volume>,
|
||||
ec_volumes: HashMap<VolumeId, EcVolume>,
|
||||
pub is_disk_space_low: Arc<AtomicBool>,
|
||||
pub available_space: AtomicU64,
|
||||
pub min_free_space: MinFreeSpace,
|
||||
}
|
||||
|
||||
impl DiskLocation {
|
||||
const UUID_FILE_NAME: &'static str = "vol_dir.uuid";
|
||||
|
||||
pub fn new(
|
||||
directory: &str,
|
||||
idx_directory: &str,
|
||||
max_volume_count: i32,
|
||||
disk_type: DiskType,
|
||||
min_free_space: MinFreeSpace,
|
||||
tags: Vec<String>,
|
||||
) -> io::Result<Self> {
|
||||
fs::create_dir_all(directory)?;
|
||||
|
||||
let idx_dir = if idx_directory.is_empty() {
|
||||
directory.to_string()
|
||||
} else {
|
||||
fs::create_dir_all(idx_directory)?;
|
||||
idx_directory.to_string()
|
||||
};
|
||||
let directory_uuid = Self::generate_directory_uuid(directory)?;
|
||||
|
||||
Ok(DiskLocation {
|
||||
directory: directory.to_string(),
|
||||
idx_directory: idx_dir,
|
||||
directory_uuid,
|
||||
disk_type,
|
||||
tags,
|
||||
max_volume_count: AtomicI32::new(max_volume_count),
|
||||
original_max_volume_count: max_volume_count,
|
||||
volumes: HashMap::new(),
|
||||
ec_volumes: HashMap::new(),
|
||||
is_disk_space_low: Arc::new(AtomicBool::new(false)),
|
||||
available_space: AtomicU64::new(0),
|
||||
min_free_space,
|
||||
})
|
||||
}
|
||||
|
||||
fn generate_directory_uuid(directory: &str) -> io::Result<String> {
|
||||
let path = std::path::Path::new(directory).join(Self::UUID_FILE_NAME);
|
||||
if path.exists() {
|
||||
let existing = fs::read_to_string(&path)?;
|
||||
if !existing.trim().is_empty() {
|
||||
return Ok(existing);
|
||||
}
|
||||
}
|
||||
|
||||
let dir_uuid = uuid::Uuid::new_v4().to_string();
|
||||
fs::write(path, &dir_uuid)?;
|
||||
Ok(dir_uuid)
|
||||
}
|
||||
|
||||
// ---- Volume management ----
|
||||
|
||||
/// Load existing volumes from the directory.
|
||||
///
|
||||
/// Matches Go's `loadExistingVolume`: checks for incomplete volumes (.note file),
|
||||
/// validates EC shards before skipping .dat loading, and cleans up stale
|
||||
/// compaction temp files (.cpd/.cpx).
|
||||
pub fn load_existing_volumes(&mut self, needle_map_kind: NeedleMapKind) -> io::Result<()> {
|
||||
// Ensure directory exists
|
||||
fs::create_dir_all(&self.directory)?;
|
||||
if self.directory != self.idx_directory {
|
||||
fs::create_dir_all(&self.idx_directory)?;
|
||||
}
|
||||
|
||||
// Scan for .dat files
|
||||
let entries = fs::read_dir(&self.directory)?;
|
||||
let mut dat_files: Vec<(String, VolumeId)> = Vec::new();
|
||||
let mut seen = HashSet::new();
|
||||
|
||||
for entry in entries {
|
||||
let entry = entry?;
|
||||
let name = entry.file_name().into_string().unwrap_or_default();
|
||||
if let Some((collection, vid)) = parse_volume_filename(&name) {
|
||||
if seen.insert((collection.clone(), vid)) {
|
||||
dat_files.push((collection, vid));
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
for (collection, vid) in dat_files {
|
||||
let volume_name = volume_file_name(&self.directory, &collection, vid);
|
||||
let idx_name = volume_file_name(&self.idx_directory, &collection, vid);
|
||||
|
||||
// Check for incomplete volume (.note file means a VolumeCopy was interrupted)
|
||||
let note_path = format!("{}.note", volume_name);
|
||||
if std::path::Path::new(¬e_path).exists() {
|
||||
let note = fs::read_to_string(¬e_path).unwrap_or_default();
|
||||
warn!(
|
||||
volume_id = vid.0,
|
||||
"volume was not completed: {}, removing files", note
|
||||
);
|
||||
remove_volume_files(&volume_name);
|
||||
remove_volume_files(&idx_name);
|
||||
continue;
|
||||
}
|
||||
|
||||
// If valid EC shards exist (.ecx file present), skip loading .dat
|
||||
let ecx_path = format!("{}.ecx", idx_name);
|
||||
let ecx_exists = if std::path::Path::new(&ecx_path).exists() {
|
||||
true
|
||||
} else if self.idx_directory != self.directory {
|
||||
// .ecx may have been created before -dir.idx was configured
|
||||
let fallback = format!("{}.ecx", volume_name);
|
||||
std::path::Path::new(&fallback).exists()
|
||||
} else {
|
||||
false
|
||||
};
|
||||
if ecx_exists {
|
||||
if self.validate_ec_volume(&collection, vid) {
|
||||
// Valid EC volume — don't load .dat
|
||||
continue;
|
||||
} else {
|
||||
warn!(
|
||||
volume_id = vid.0,
|
||||
"EC volume validation failed, removing incomplete EC files"
|
||||
);
|
||||
self.remove_ec_volume_files(&collection, vid);
|
||||
// Fall through to load .dat file
|
||||
}
|
||||
}
|
||||
|
||||
// Clean up stale compaction temp files
|
||||
let cpd_path = format!("{}.cpd", volume_name);
|
||||
let cpx_path = format!("{}.cpx", idx_name);
|
||||
if std::path::Path::new(&cpd_path).exists() {
|
||||
info!(volume_id = vid.0, "removing stale compaction file .cpd");
|
||||
let _ = fs::remove_file(&cpd_path);
|
||||
}
|
||||
if std::path::Path::new(&cpx_path).exists() {
|
||||
info!(volume_id = vid.0, "removing stale compaction file .cpx");
|
||||
let _ = fs::remove_file(&cpx_path);
|
||||
}
|
||||
|
||||
// Skip if already loaded (e.g., from a previous call)
|
||||
if self.volumes.contains_key(&vid) {
|
||||
continue;
|
||||
}
|
||||
|
||||
match Volume::new(
|
||||
&self.directory,
|
||||
&self.idx_directory,
|
||||
&collection,
|
||||
vid,
|
||||
needle_map_kind,
|
||||
None, // replica placement read from superblock
|
||||
None, // TTL read from superblock
|
||||
0, // no preallocate on load
|
||||
Version::current(),
|
||||
) {
|
||||
Ok(mut v) => {
|
||||
v.location_disk_space_low = self.is_disk_space_low.clone();
|
||||
crate::metrics::VOLUME_GAUGE
|
||||
.with_label_values(&[&collection, "volume"])
|
||||
.inc();
|
||||
self.volumes.insert(vid, v);
|
||||
}
|
||||
Err(e) => {
|
||||
warn!(volume_id = vid.0, error = %e, "failed to load volume");
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
Ok(())
|
||||
}
|
||||
|
||||
/// Validate EC volume shards: all shards must be same size, and if .dat exists,
|
||||
/// need at least DATA_SHARDS_COUNT shards with size matching expected.
|
||||
fn validate_ec_volume(&self, collection: &str, vid: VolumeId) -> bool {
|
||||
let base = volume_file_name(&self.directory, collection, vid);
|
||||
let dat_path = format!("{}.dat", base);
|
||||
|
||||
let mut expected_shard_size: Option<i64> = None;
|
||||
let dat_exists = std::path::Path::new(&dat_path).exists();
|
||||
|
||||
if dat_exists {
|
||||
if let Ok(meta) = fs::metadata(&dat_path) {
|
||||
expected_shard_size = Some(calculate_expected_shard_size(meta.len() as i64));
|
||||
} else {
|
||||
return false;
|
||||
}
|
||||
}
|
||||
|
||||
let mut shard_count = 0usize;
|
||||
let mut actual_shard_size: Option<i64> = None;
|
||||
const MAX_SHARD_COUNT: usize = 32;
|
||||
|
||||
for i in 0..MAX_SHARD_COUNT {
|
||||
let shard_path = format!("{}.ec{:02}", base, i);
|
||||
match fs::metadata(&shard_path) {
|
||||
Ok(meta) if meta.len() > 0 => {
|
||||
let size = meta.len() as i64;
|
||||
if let Some(prev) = actual_shard_size {
|
||||
if size != prev {
|
||||
warn!(
|
||||
volume_id = vid.0,
|
||||
shard = i,
|
||||
size,
|
||||
expected = prev,
|
||||
"EC shard size mismatch"
|
||||
);
|
||||
return false;
|
||||
}
|
||||
} else {
|
||||
actual_shard_size = Some(size);
|
||||
}
|
||||
shard_count += 1;
|
||||
}
|
||||
Err(e) if e.kind() != io::ErrorKind::NotFound => {
|
||||
warn!(
|
||||
volume_id = vid.0,
|
||||
shard = i,
|
||||
error = %e,
|
||||
"failed to stat EC shard"
|
||||
);
|
||||
return false;
|
||||
}
|
||||
_ => {} // not found or zero size — skip
|
||||
}
|
||||
}
|
||||
|
||||
// If .dat exists, validate shard size matches expected
|
||||
if dat_exists {
|
||||
if let (Some(actual), Some(expected)) = (actual_shard_size, expected_shard_size) {
|
||||
if actual != expected {
|
||||
warn!(
|
||||
volume_id = vid.0,
|
||||
actual_shard_size = actual,
|
||||
expected_shard_size = expected,
|
||||
"EC shard size doesn't match .dat file"
|
||||
);
|
||||
return false;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// Distributed EC (no .dat): any shard count is valid
|
||||
if !dat_exists {
|
||||
return true;
|
||||
}
|
||||
|
||||
// With .dat: need at least DATA_SHARDS_COUNT shards
|
||||
if shard_count < DATA_SHARDS_COUNT {
|
||||
warn!(
|
||||
volume_id = vid.0,
|
||||
shard_count,
|
||||
required = DATA_SHARDS_COUNT,
|
||||
"EC volume has .dat but too few shards"
|
||||
);
|
||||
return false;
|
||||
}
|
||||
|
||||
true
|
||||
}
|
||||
|
||||
/// Remove all EC-related files for a volume.
|
||||
fn remove_ec_volume_files(&self, collection: &str, vid: VolumeId) {
|
||||
let base = volume_file_name(&self.directory, collection, vid);
|
||||
let idx_base = volume_file_name(&self.idx_directory, collection, vid);
|
||||
const MAX_SHARD_COUNT: usize = 32;
|
||||
|
||||
// Remove index files from idx directory (.ecx, .ecj)
|
||||
let _ = fs::remove_file(format!("{}.ecx", idx_base));
|
||||
let _ = fs::remove_file(format!("{}.ecj", idx_base));
|
||||
// Also try data directory in case .ecx/.ecj were created before -dir.idx was configured
|
||||
if self.idx_directory != self.directory {
|
||||
let _ = fs::remove_file(format!("{}.ecx", base));
|
||||
let _ = fs::remove_file(format!("{}.ecj", base));
|
||||
}
|
||||
|
||||
// Remove all EC shard files (.ec00 ~ .ec31)
|
||||
for i in 0..MAX_SHARD_COUNT {
|
||||
let _ = fs::remove_file(format!("{}.ec{:02}", base, i));
|
||||
}
|
||||
}
|
||||
|
||||
/// Find a volume by ID.
|
||||
pub fn find_volume(&self, vid: VolumeId) -> Option<&Volume> {
|
||||
self.volumes.get(&vid)
|
||||
}
|
||||
|
||||
/// Find a volume by ID (mutable).
|
||||
pub fn find_volume_mut(&mut self, vid: VolumeId) -> Option<&mut Volume> {
|
||||
self.volumes.get_mut(&vid)
|
||||
}
|
||||
|
||||
/// Add a volume to this location.
|
||||
pub fn set_volume(&mut self, vid: VolumeId, volume: Volume) {
|
||||
let collection = volume.collection.clone();
|
||||
self.volumes.insert(vid, volume);
|
||||
crate::metrics::VOLUME_GAUGE
|
||||
.with_label_values(&[&collection, "volume"])
|
||||
.inc();
|
||||
}
|
||||
|
||||
/// Create a new volume in this location.
|
||||
pub fn create_volume(
|
||||
&mut self,
|
||||
vid: VolumeId,
|
||||
collection: &str,
|
||||
needle_map_kind: NeedleMapKind,
|
||||
replica_placement: Option<ReplicaPlacement>,
|
||||
ttl: Option<crate::storage::needle::ttl::TTL>,
|
||||
preallocate: u64,
|
||||
version: Version,
|
||||
) -> Result<(), VolumeError> {
|
||||
let mut v = Volume::new(
|
||||
&self.directory,
|
||||
&self.idx_directory,
|
||||
collection,
|
||||
vid,
|
||||
needle_map_kind,
|
||||
replica_placement,
|
||||
ttl,
|
||||
preallocate,
|
||||
version,
|
||||
)?;
|
||||
v.location_disk_space_low = self.is_disk_space_low.clone();
|
||||
crate::metrics::VOLUME_GAUGE
|
||||
.with_label_values(&[collection, "volume"])
|
||||
.inc();
|
||||
self.volumes.insert(vid, v);
|
||||
Ok(())
|
||||
}
|
||||
|
||||
/// Remove and close a volume.
|
||||
pub fn unload_volume(&mut self, vid: VolumeId) -> Option<Volume> {
|
||||
if let Some(mut v) = self.volumes.remove(&vid) {
|
||||
crate::metrics::VOLUME_GAUGE
|
||||
.with_label_values(&[&v.collection, "volume"])
|
||||
.dec();
|
||||
v.close();
|
||||
Some(v)
|
||||
} else {
|
||||
None
|
||||
}
|
||||
}
|
||||
|
||||
/// Remove, close, and delete all files for a volume.
|
||||
pub fn delete_volume(&mut self, vid: VolumeId, only_empty: bool) -> Result<(), VolumeError> {
|
||||
if let Some(mut v) = self.volumes.remove(&vid) {
|
||||
crate::metrics::VOLUME_GAUGE
|
||||
.with_label_values(&[&v.collection, "volume"])
|
||||
.dec();
|
||||
v.destroy(only_empty)?;
|
||||
Ok(())
|
||||
} else {
|
||||
Err(VolumeError::NotFound)
|
||||
}
|
||||
}
|
||||
|
||||
/// Delete all volumes in a collection.
|
||||
pub fn delete_collection(&mut self, collection: &str) -> Result<(), VolumeError> {
|
||||
let vids: Vec<VolumeId> = self
|
||||
.volumes
|
||||
.iter()
|
||||
.filter(|(_, v)| v.collection == collection && !v.is_compacting())
|
||||
.map(|(vid, _)| *vid)
|
||||
.collect();
|
||||
|
||||
for vid in vids {
|
||||
if let Some(mut v) = self.volumes.remove(&vid) {
|
||||
crate::metrics::VOLUME_GAUGE
|
||||
.with_label_values(&[&v.collection, "volume"])
|
||||
.dec();
|
||||
if let Err(e) = v.destroy(false) {
|
||||
warn!(volume_id = vid.0, error = %e, "delete collection: failed to destroy volume");
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
let ec_vids: Vec<VolumeId> = self
|
||||
.ec_volumes
|
||||
.iter()
|
||||
.filter(|(_, v)| v.collection == collection)
|
||||
.map(|(vid, _)| *vid)
|
||||
.collect();
|
||||
|
||||
for vid in ec_vids {
|
||||
if let Some(mut ec_vol) = self.ec_volumes.remove(&vid) {
|
||||
for _ in 0..ec_vol.shard_count() {
|
||||
crate::metrics::VOLUME_GAUGE
|
||||
.with_label_values(&[collection, "ec_shards"])
|
||||
.dec();
|
||||
}
|
||||
ec_vol.destroy();
|
||||
}
|
||||
}
|
||||
Ok(())
|
||||
}
|
||||
|
||||
// ---- Metrics ----
|
||||
|
||||
/// Number of volumes on this disk.
|
||||
pub fn volumes_len(&self) -> usize {
|
||||
self.volumes.len()
|
||||
}
|
||||
|
||||
/// Get all volume IDs, sorted.
|
||||
pub fn volume_ids(&self) -> Vec<VolumeId> {
|
||||
let mut ids: Vec<VolumeId> = self.volumes.keys().copied().collect();
|
||||
ids.sort();
|
||||
ids
|
||||
}
|
||||
|
||||
/// Iterate over all volumes.
|
||||
pub fn iter_volumes(&self) -> impl Iterator<Item = (&VolumeId, &Volume)> {
|
||||
self.volumes.iter()
|
||||
}
|
||||
|
||||
/// Number of free volume slots.
|
||||
/// Matches Go's FindFreeLocation formula:
|
||||
/// free = ((MaxVolumeCount - VolumesLen()) * DataShardsCount - EcShardCount()) / DataShardsCount
|
||||
pub fn free_volume_count(&self) -> i32 {
|
||||
use crate::storage::erasure_coding::ec_shard::DATA_SHARDS_COUNT;
|
||||
let max = self.max_volume_count.load(Ordering::Relaxed);
|
||||
let free_count = (max as i64 - self.volumes.len() as i64)
|
||||
* DATA_SHARDS_COUNT as i64
|
||||
- self.ec_shard_count() as i64;
|
||||
let effective_free = free_count / DATA_SHARDS_COUNT as i64;
|
||||
if effective_free > 0 {
|
||||
effective_free as i32
|
||||
} else {
|
||||
0
|
||||
}
|
||||
}
|
||||
|
||||
/// Iterate over all volumes.
|
||||
pub fn volumes(&self) -> impl Iterator<Item = (&VolumeId, &Volume)> {
|
||||
self.volumes.iter()
|
||||
}
|
||||
|
||||
/// Iterate over all volumes (mutable).
|
||||
pub fn volumes_mut(&mut self) -> impl Iterator<Item = (&VolumeId, &mut Volume)> {
|
||||
self.volumes.iter_mut()
|
||||
}
|
||||
|
||||
/// Sum of unused space in writable volumes (volumeSizeLimit - actual size per volume).
|
||||
/// Used by auto-max-volume-count to estimate how many more volumes can fit.
|
||||
pub fn unused_space(&self, volume_size_limit: u64) -> u64 {
|
||||
let mut unused: u64 = 0;
|
||||
for vol in self.volumes.values() {
|
||||
if vol.is_read_only() {
|
||||
continue;
|
||||
}
|
||||
let dat_size = vol.dat_file_size().unwrap_or(0);
|
||||
let idx_size = vol.idx_file_size();
|
||||
let used = dat_size + idx_size;
|
||||
if volume_size_limit > used {
|
||||
unused += volume_size_limit - used;
|
||||
}
|
||||
}
|
||||
unused
|
||||
}
|
||||
|
||||
/// Check disk space against min_free_space and update is_disk_space_low.
|
||||
pub fn check_disk_space(&self) {
|
||||
let (total, free) = get_disk_stats(&self.directory);
|
||||
if total == 0 {
|
||||
return;
|
||||
}
|
||||
let used = total.saturating_sub(free);
|
||||
let is_low = match &self.min_free_space {
|
||||
MinFreeSpace::Percent(pct) => {
|
||||
let free_pct = (free as f64 / total as f64) * 100.0;
|
||||
free_pct < *pct
|
||||
}
|
||||
MinFreeSpace::Bytes(min_bytes) => free < *min_bytes,
|
||||
};
|
||||
self.is_disk_space_low.store(is_low, Ordering::Relaxed);
|
||||
self.available_space.store(free, Ordering::Relaxed);
|
||||
|
||||
// Update resource gauges
|
||||
crate::metrics::RESOURCE_GAUGE
|
||||
.with_label_values(&[&self.directory, "all"])
|
||||
.set(total as f64);
|
||||
crate::metrics::RESOURCE_GAUGE
|
||||
.with_label_values(&[&self.directory, "used"])
|
||||
.set(used as f64);
|
||||
crate::metrics::RESOURCE_GAUGE
|
||||
.with_label_values(&[&self.directory, "free"])
|
||||
.set(free as f64);
|
||||
// "avail" is same as "free" for us (Go subtracts reserved blocks but we use statvfs f_bavail)
|
||||
crate::metrics::RESOURCE_GAUGE
|
||||
.with_label_values(&[&self.directory, "avail"])
|
||||
.set(free as f64);
|
||||
}
|
||||
|
||||
// ---- EC volume operations ----
|
||||
|
||||
/// Find an EC volume by ID.
|
||||
pub fn find_ec_volume(&self, vid: VolumeId) -> Option<&EcVolume> {
|
||||
self.ec_volumes.get(&vid)
|
||||
}
|
||||
|
||||
/// Find an EC volume by ID (mutable).
|
||||
pub fn find_ec_volume_mut(&mut self, vid: VolumeId) -> Option<&mut EcVolume> {
|
||||
self.ec_volumes.get_mut(&vid)
|
||||
}
|
||||
|
||||
/// Check if this location has an EC volume.
|
||||
pub fn has_ec_volume(&self, vid: VolumeId) -> bool {
|
||||
self.ec_volumes.contains_key(&vid)
|
||||
}
|
||||
|
||||
/// Remove an EC volume, returning it.
|
||||
pub fn remove_ec_volume(&mut self, vid: VolumeId) -> Option<EcVolume> {
|
||||
self.ec_volumes.remove(&vid)
|
||||
}
|
||||
|
||||
/// Mount EC shards for a volume on this location.
|
||||
pub fn mount_ec_shards(
|
||||
&mut self,
|
||||
vid: VolumeId,
|
||||
collection: &str,
|
||||
shard_ids: &[u32],
|
||||
) -> Result<(), VolumeError> {
|
||||
let dir = self.directory.clone();
|
||||
let idx_dir = self.idx_directory.clone();
|
||||
let ec_vol = self
|
||||
.ec_volumes
|
||||
.entry(vid)
|
||||
.or_insert_with(|| EcVolume::new(&dir, &idx_dir, collection, vid).unwrap());
|
||||
ec_vol.disk_type = self.disk_type.clone();
|
||||
|
||||
for &shard_id in shard_ids {
|
||||
let shard = EcVolumeShard::new(&dir, collection, vid, shard_id as u8);
|
||||
ec_vol.add_shard(shard).map_err(VolumeError::Io)?;
|
||||
crate::metrics::VOLUME_GAUGE
|
||||
.with_label_values(&[collection, "ec_shards"])
|
||||
.inc();
|
||||
}
|
||||
Ok(())
|
||||
}
|
||||
|
||||
/// Unmount EC shards for a volume on this location.
|
||||
pub fn unmount_ec_shards(&mut self, vid: VolumeId, shard_ids: &[u32]) {
|
||||
if let Some(ec_vol) = self.ec_volumes.get_mut(&vid) {
|
||||
let collection = ec_vol.collection.clone();
|
||||
for &shard_id in shard_ids {
|
||||
ec_vol.remove_shard(shard_id as u8);
|
||||
crate::metrics::VOLUME_GAUGE
|
||||
.with_label_values(&[&collection, "ec_shards"])
|
||||
.dec();
|
||||
}
|
||||
if ec_vol.shard_count() == 0 {
|
||||
let mut vol = self.ec_volumes.remove(&vid).unwrap();
|
||||
vol.close();
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
/// Total number of EC shards on this location.
|
||||
pub fn ec_shard_count(&self) -> usize {
|
||||
self.ec_volumes
|
||||
.values()
|
||||
.map(|ecv| ecv.shards.iter().filter(|s| s.is_some()).count())
|
||||
.sum()
|
||||
}
|
||||
|
||||
/// Iterate over all EC volumes.
|
||||
pub fn ec_volumes(&self) -> impl Iterator<Item = (&VolumeId, &EcVolume)> {
|
||||
self.ec_volumes.iter()
|
||||
}
|
||||
|
||||
/// Close all volumes.
|
||||
pub fn close(&mut self) {
|
||||
for (_, v) in self.volumes.iter_mut() {
|
||||
v.close();
|
||||
}
|
||||
self.volumes.clear();
|
||||
for (_, mut ec_vol) in self.ec_volumes.drain() {
|
||||
ec_vol.close();
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
/// Get total and free disk space for a given path.
|
||||
/// Returns (total_bytes, free_bytes).
|
||||
pub fn get_disk_stats(path: &str) -> (u64, u64) {
|
||||
#[cfg(unix)]
|
||||
{
|
||||
use std::ffi::CString;
|
||||
let c_path = match CString::new(path) {
|
||||
Ok(p) => p,
|
||||
Err(_) => return (0, 0),
|
||||
};
|
||||
unsafe {
|
||||
let mut stat: libc::statvfs = std::mem::zeroed();
|
||||
if libc::statvfs(c_path.as_ptr(), &mut stat) == 0 {
|
||||
let all = stat.f_blocks as u64 * stat.f_frsize as u64;
|
||||
let free = stat.f_bavail as u64 * stat.f_frsize as u64;
|
||||
return (all, free);
|
||||
}
|
||||
}
|
||||
(0, 0)
|
||||
}
|
||||
#[cfg(not(unix))]
|
||||
{
|
||||
let _ = path;
|
||||
(0, 0)
|
||||
}
|
||||
}
|
||||
|
||||
/// Calculate expected EC shard size from .dat file size.
|
||||
/// Matches Go's `calculateExpectedShardSize`: large blocks (1GB * data_shards) first,
|
||||
/// then small blocks (1MB * data_shards) for the remainder.
|
||||
fn calculate_expected_shard_size(dat_file_size: i64) -> i64 {
|
||||
let large_batch_size = ERASURE_CODING_LARGE_BLOCK_SIZE as i64 * DATA_SHARDS_COUNT as i64;
|
||||
let num_large_batches = dat_file_size / large_batch_size;
|
||||
let mut shard_size = num_large_batches * ERASURE_CODING_LARGE_BLOCK_SIZE as i64;
|
||||
let remaining = dat_file_size - (num_large_batches * large_batch_size);
|
||||
|
||||
if remaining > 0 {
|
||||
let small_batch_size = ERASURE_CODING_SMALL_BLOCK_SIZE as i64 * DATA_SHARDS_COUNT as i64;
|
||||
// Ceiling division
|
||||
let num_small_batches = (remaining + small_batch_size - 1) / small_batch_size;
|
||||
shard_size += num_small_batches * ERASURE_CODING_SMALL_BLOCK_SIZE as i64;
|
||||
}
|
||||
|
||||
shard_size
|
||||
}
|
||||
|
||||
/// Parse a volume filename like "collection_42.dat" or "42.dat" into (collection, VolumeId).
|
||||
fn parse_volume_filename(filename: &str) -> Option<(String, VolumeId)> {
|
||||
let stem = filename
|
||||
.strip_suffix(".dat")
|
||||
.or_else(|| filename.strip_suffix(".vif"))
|
||||
.or_else(|| filename.strip_suffix(".idx"))?;
|
||||
if let Some(pos) = stem.rfind('_') {
|
||||
let collection = &stem[..pos];
|
||||
let id_str = &stem[pos + 1..];
|
||||
let id: u32 = id_str.parse().ok()?;
|
||||
Some((collection.to_string(), VolumeId(id)))
|
||||
} else {
|
||||
let id: u32 = stem.parse().ok()?;
|
||||
Some((String::new(), VolumeId(id)))
|
||||
}
|
||||
}
|
||||
|
||||
// ============================================================================
|
||||
// Tests
|
||||
// ============================================================================
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
use tempfile::TempDir;
|
||||
|
||||
#[test]
|
||||
fn test_parse_volume_filename() {
|
||||
assert_eq!(
|
||||
parse_volume_filename("42.dat"),
|
||||
Some(("".to_string(), VolumeId(42)))
|
||||
);
|
||||
assert_eq!(
|
||||
parse_volume_filename("pics_7.dat"),
|
||||
Some(("pics".to_string(), VolumeId(7)))
|
||||
);
|
||||
assert_eq!(
|
||||
parse_volume_filename("42.vif"),
|
||||
Some(("".to_string(), VolumeId(42)))
|
||||
);
|
||||
assert_eq!(
|
||||
parse_volume_filename("pics_7.idx"),
|
||||
Some(("pics".to_string(), VolumeId(7)))
|
||||
);
|
||||
assert_eq!(parse_volume_filename("notadat.idx"), None);
|
||||
assert_eq!(parse_volume_filename("bad.dat"), None);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_disk_location_create_volume() {
|
||||
let tmp = TempDir::new().unwrap();
|
||||
let dir = tmp.path().to_str().unwrap();
|
||||
let mut loc = DiskLocation::new(
|
||||
dir,
|
||||
dir,
|
||||
10,
|
||||
DiskType::HardDrive,
|
||||
MinFreeSpace::Percent(1.0),
|
||||
Vec::new(),
|
||||
)
|
||||
.unwrap();
|
||||
|
||||
loc.create_volume(
|
||||
VolumeId(1),
|
||||
"",
|
||||
NeedleMapKind::InMemory,
|
||||
None,
|
||||
None,
|
||||
0,
|
||||
Version::current(),
|
||||
)
|
||||
.unwrap();
|
||||
|
||||
assert_eq!(loc.volumes_len(), 1);
|
||||
assert!(loc.find_volume(VolumeId(1)).is_some());
|
||||
assert!(loc.find_volume(VolumeId(99)).is_none());
|
||||
assert_eq!(loc.free_volume_count(), 9);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_disk_location_load_existing() {
|
||||
let tmp = TempDir::new().unwrap();
|
||||
let dir = tmp.path().to_str().unwrap();
|
||||
|
||||
// Create volumes
|
||||
{
|
||||
let mut loc = DiskLocation::new(
|
||||
dir,
|
||||
dir,
|
||||
10,
|
||||
DiskType::HardDrive,
|
||||
MinFreeSpace::Percent(1.0),
|
||||
Vec::new(),
|
||||
)
|
||||
.unwrap();
|
||||
loc.create_volume(
|
||||
VolumeId(1),
|
||||
"",
|
||||
NeedleMapKind::InMemory,
|
||||
None,
|
||||
None,
|
||||
0,
|
||||
Version::current(),
|
||||
)
|
||||
.unwrap();
|
||||
loc.create_volume(
|
||||
VolumeId(2),
|
||||
"test",
|
||||
NeedleMapKind::InMemory,
|
||||
None,
|
||||
None,
|
||||
0,
|
||||
Version::current(),
|
||||
)
|
||||
.unwrap();
|
||||
loc.close();
|
||||
}
|
||||
|
||||
// Reload
|
||||
let mut loc = DiskLocation::new(
|
||||
dir,
|
||||
dir,
|
||||
10,
|
||||
DiskType::HardDrive,
|
||||
MinFreeSpace::Percent(1.0),
|
||||
Vec::new(),
|
||||
)
|
||||
.unwrap();
|
||||
loc.load_existing_volumes(NeedleMapKind::InMemory).unwrap();
|
||||
assert_eq!(loc.volumes_len(), 2);
|
||||
|
||||
let ids = loc.volume_ids();
|
||||
assert!(ids.contains(&VolumeId(1)));
|
||||
assert!(ids.contains(&VolumeId(2)));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_disk_location_delete_volume() {
|
||||
let tmp = TempDir::new().unwrap();
|
||||
let dir = tmp.path().to_str().unwrap();
|
||||
let mut loc = DiskLocation::new(
|
||||
dir,
|
||||
dir,
|
||||
10,
|
||||
DiskType::HardDrive,
|
||||
MinFreeSpace::Percent(1.0),
|
||||
Vec::new(),
|
||||
)
|
||||
.unwrap();
|
||||
|
||||
loc.create_volume(
|
||||
VolumeId(1),
|
||||
"",
|
||||
NeedleMapKind::InMemory,
|
||||
None,
|
||||
None,
|
||||
0,
|
||||
Version::current(),
|
||||
)
|
||||
.unwrap();
|
||||
loc.create_volume(
|
||||
VolumeId(2),
|
||||
"",
|
||||
NeedleMapKind::InMemory,
|
||||
None,
|
||||
None,
|
||||
0,
|
||||
Version::current(),
|
||||
)
|
||||
.unwrap();
|
||||
assert_eq!(loc.volumes_len(), 2);
|
||||
|
||||
loc.delete_volume(VolumeId(1), false).unwrap();
|
||||
assert_eq!(loc.volumes_len(), 1);
|
||||
assert!(loc.find_volume(VolumeId(1)).is_none());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_disk_location_delete_collection() {
|
||||
let tmp = TempDir::new().unwrap();
|
||||
let dir = tmp.path().to_str().unwrap();
|
||||
let mut loc = DiskLocation::new(
|
||||
dir,
|
||||
dir,
|
||||
10,
|
||||
DiskType::HardDrive,
|
||||
MinFreeSpace::Percent(1.0),
|
||||
Vec::new(),
|
||||
)
|
||||
.unwrap();
|
||||
|
||||
loc.create_volume(
|
||||
VolumeId(1),
|
||||
"pics",
|
||||
NeedleMapKind::InMemory,
|
||||
None,
|
||||
None,
|
||||
0,
|
||||
Version::current(),
|
||||
)
|
||||
.unwrap();
|
||||
loc.create_volume(
|
||||
VolumeId(2),
|
||||
"pics",
|
||||
NeedleMapKind::InMemory,
|
||||
None,
|
||||
None,
|
||||
0,
|
||||
Version::current(),
|
||||
)
|
||||
.unwrap();
|
||||
loc.create_volume(
|
||||
VolumeId(3),
|
||||
"docs",
|
||||
NeedleMapKind::InMemory,
|
||||
None,
|
||||
None,
|
||||
0,
|
||||
Version::current(),
|
||||
)
|
||||
.unwrap();
|
||||
assert_eq!(loc.volumes_len(), 3);
|
||||
|
||||
loc.delete_collection("pics").unwrap();
|
||||
assert_eq!(loc.volumes_len(), 1);
|
||||
assert!(loc.find_volume(VolumeId(3)).is_some());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_disk_location_delete_collection_removes_ec_volumes() {
|
||||
let tmp = TempDir::new().unwrap();
|
||||
let dir = tmp.path().to_str().unwrap();
|
||||
let mut loc = DiskLocation::new(
|
||||
dir,
|
||||
dir,
|
||||
10,
|
||||
DiskType::HardDrive,
|
||||
MinFreeSpace::Percent(1.0),
|
||||
Vec::new(),
|
||||
)
|
||||
.unwrap();
|
||||
|
||||
let shard_path = format!("{}/pics_7.ec00", dir);
|
||||
std::fs::write(&shard_path, b"ec-shard").unwrap();
|
||||
|
||||
loc.mount_ec_shards(VolumeId(7), "pics", &[0]).unwrap();
|
||||
assert!(loc.has_ec_volume(VolumeId(7)));
|
||||
assert!(std::path::Path::new(&shard_path).exists());
|
||||
assert!(std::path::Path::new(&format!("{}/pics_7.ecj", dir)).exists());
|
||||
|
||||
loc.delete_collection("pics").unwrap();
|
||||
|
||||
assert!(!loc.has_ec_volume(VolumeId(7)));
|
||||
assert!(!std::path::Path::new(&shard_path).exists());
|
||||
assert!(!std::path::Path::new(&format!("{}/pics_7.ecj", dir)).exists());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_disk_location_persists_directory_uuid_and_tags() {
|
||||
let tmp = TempDir::new().unwrap();
|
||||
let dir = tmp.path().to_str().unwrap();
|
||||
|
||||
let loc = DiskLocation::new(
|
||||
dir,
|
||||
dir,
|
||||
10,
|
||||
DiskType::HardDrive,
|
||||
MinFreeSpace::Percent(1.0),
|
||||
vec!["fast".to_string(), "ssd".to_string()],
|
||||
)
|
||||
.unwrap();
|
||||
let directory_uuid = loc.directory_uuid.clone();
|
||||
assert_eq!(loc.tags, vec!["fast".to_string(), "ssd".to_string()]);
|
||||
drop(loc);
|
||||
|
||||
let reloaded = DiskLocation::new(
|
||||
dir,
|
||||
dir,
|
||||
10,
|
||||
DiskType::HardDrive,
|
||||
MinFreeSpace::Percent(1.0),
|
||||
Vec::new(),
|
||||
)
|
||||
.unwrap();
|
||||
assert_eq!(reloaded.directory_uuid, directory_uuid);
|
||||
}
|
||||
}
|
||||
261
seaweed-volume/src/storage/erasure_coding/ec_decoder.rs
Normal file
261
seaweed-volume/src/storage/erasure_coding/ec_decoder.rs
Normal file
@@ -0,0 +1,261 @@
|
||||
//! EC decoding: reconstruct a .dat file from EC shards.
|
||||
//!
|
||||
//! Rebuilds the original .dat + .idx files from data shards (.ec00-.ec09)
|
||||
//! and the sorted index (.ecx) + deletion journal (.ecj).
|
||||
|
||||
use std::fs::File;
|
||||
use std::io::{self, Read, Write};
|
||||
|
||||
use crate::storage::erasure_coding::ec_shard::*;
|
||||
use crate::storage::idx;
|
||||
use crate::storage::needle::needle::get_actual_size;
|
||||
use crate::storage::super_block::SUPER_BLOCK_SIZE;
|
||||
use crate::storage::types::*;
|
||||
use crate::storage::volume::volume_file_name;
|
||||
|
||||
/// Calculate .dat file size from the max offset entry in .ecx.
|
||||
/// Reads the volume version from the first EC shard (.ec00) superblock,
|
||||
/// then scans .ecx entries to find the largest (offset + needle_actual_size).
|
||||
pub fn find_dat_file_size(dir: &str, collection: &str, volume_id: VolumeId) -> io::Result<i64> {
|
||||
let base = volume_file_name(dir, collection, volume_id);
|
||||
|
||||
// Read volume version from .ec00 superblock
|
||||
let ec00_path = format!("{}.ec00", base);
|
||||
let mut ec00 = File::open(&ec00_path)?;
|
||||
let mut sb_buf = [0u8; SUPER_BLOCK_SIZE];
|
||||
ec00.read_exact(&mut sb_buf)?;
|
||||
let version = Version(sb_buf[0]);
|
||||
|
||||
// Start with at least the superblock size
|
||||
let mut dat_size: i64 = SUPER_BLOCK_SIZE as i64;
|
||||
|
||||
// Scan .ecx entries
|
||||
let ecx_path = format!("{}.ecx", base);
|
||||
let ecx_data = std::fs::read(&ecx_path)?;
|
||||
let entry_count = ecx_data.len() / NEEDLE_MAP_ENTRY_SIZE;
|
||||
|
||||
for i in 0..entry_count {
|
||||
let start = i * NEEDLE_MAP_ENTRY_SIZE;
|
||||
let (_, offset, size) =
|
||||
idx_entry_from_bytes(&ecx_data[start..start + NEEDLE_MAP_ENTRY_SIZE]);
|
||||
if size.is_deleted() {
|
||||
continue;
|
||||
}
|
||||
let entry_stop = offset.to_actual_offset() + get_actual_size(size, version);
|
||||
if entry_stop > dat_size {
|
||||
dat_size = entry_stop;
|
||||
}
|
||||
}
|
||||
|
||||
Ok(dat_size)
|
||||
}
|
||||
|
||||
/// Reconstruct a .dat file from EC data shards.
|
||||
///
|
||||
/// Reads from .ec00-.ec09 and writes a new .dat file.
|
||||
pub fn write_dat_file_from_shards(
|
||||
dir: &str,
|
||||
collection: &str,
|
||||
volume_id: VolumeId,
|
||||
dat_file_size: i64,
|
||||
data_shards: usize,
|
||||
) -> io::Result<()> {
|
||||
let base = volume_file_name(dir, collection, volume_id);
|
||||
let dat_path = format!("{}.dat", base);
|
||||
|
||||
// Open data shards
|
||||
let mut shards: Vec<EcVolumeShard> = (0..data_shards as u8)
|
||||
.map(|i| EcVolumeShard::new(dir, collection, volume_id, i))
|
||||
.collect();
|
||||
|
||||
for shard in &mut shards {
|
||||
shard.open()?;
|
||||
}
|
||||
|
||||
let mut dat_file = File::create(&dat_path)?;
|
||||
let mut remaining = dat_file_size;
|
||||
let large_block_size = ERASURE_CODING_LARGE_BLOCK_SIZE;
|
||||
let small_block_size = ERASURE_CODING_SMALL_BLOCK_SIZE;
|
||||
let large_row_size = (large_block_size * data_shards) as i64;
|
||||
|
||||
let mut shard_offset: u64 = 0;
|
||||
|
||||
// Read large blocks
|
||||
while remaining >= large_row_size {
|
||||
for i in 0..data_shards {
|
||||
let mut buf = vec![0u8; large_block_size];
|
||||
shards[i].read_at(&mut buf, shard_offset)?;
|
||||
let to_write = large_block_size.min(remaining as usize);
|
||||
dat_file.write_all(&buf[..to_write])?;
|
||||
remaining -= to_write as i64;
|
||||
if remaining <= 0 {
|
||||
break;
|
||||
}
|
||||
}
|
||||
shard_offset += large_block_size as u64;
|
||||
}
|
||||
|
||||
// Read small blocks
|
||||
while remaining > 0 {
|
||||
for i in 0..data_shards {
|
||||
let mut buf = vec![0u8; small_block_size];
|
||||
shards[i].read_at(&mut buf, shard_offset)?;
|
||||
let to_write = small_block_size.min(remaining as usize);
|
||||
dat_file.write_all(&buf[..to_write])?;
|
||||
remaining -= to_write as i64;
|
||||
if remaining <= 0 {
|
||||
break;
|
||||
}
|
||||
}
|
||||
shard_offset += small_block_size as u64;
|
||||
}
|
||||
|
||||
for shard in &mut shards {
|
||||
shard.close();
|
||||
}
|
||||
|
||||
dat_file.sync_all()?;
|
||||
Ok(())
|
||||
}
|
||||
|
||||
/// Write .idx file from .ecx index + .ecj deletion journal.
|
||||
///
|
||||
/// Copies sorted .ecx entries to .idx, then appends tombstones for
|
||||
/// deleted needles from .ecj.
|
||||
pub fn write_idx_file_from_ec_index(
|
||||
dir: &str,
|
||||
collection: &str,
|
||||
volume_id: VolumeId,
|
||||
) -> io::Result<()> {
|
||||
let base = volume_file_name(dir, collection, volume_id);
|
||||
let ecx_path = format!("{}.ecx", base);
|
||||
let ecj_path = format!("{}.ecj", base);
|
||||
let idx_path = format!("{}.idx", base);
|
||||
|
||||
// Copy .ecx to .idx
|
||||
std::fs::copy(&ecx_path, &idx_path)?;
|
||||
|
||||
// Append deletions from .ecj as tombstones
|
||||
if std::path::Path::new(&ecj_path).exists() {
|
||||
let ecj_data = std::fs::read(&ecj_path)?;
|
||||
if !ecj_data.is_empty() {
|
||||
let mut idx_file = std::fs::OpenOptions::new()
|
||||
.write(true)
|
||||
.append(true)
|
||||
.open(&idx_path)?;
|
||||
|
||||
let count = ecj_data.len() / NEEDLE_ID_SIZE;
|
||||
for i in 0..count {
|
||||
let start = i * NEEDLE_ID_SIZE;
|
||||
let needle_id = NeedleId::from_bytes(&ecj_data[start..start + NEEDLE_ID_SIZE]);
|
||||
idx::write_index_entry(
|
||||
&mut idx_file,
|
||||
needle_id,
|
||||
Offset::default(),
|
||||
TOMBSTONE_FILE_SIZE,
|
||||
)?;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
Ok(())
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
use crate::storage::erasure_coding::ec_encoder;
|
||||
use crate::storage::needle::needle::Needle;
|
||||
use crate::storage::needle_map::NeedleMapKind;
|
||||
use crate::storage::volume::Volume;
|
||||
use tempfile::TempDir;
|
||||
|
||||
#[test]
|
||||
fn test_ec_full_round_trip() {
|
||||
let tmp = TempDir::new().unwrap();
|
||||
let dir = tmp.path().to_str().unwrap();
|
||||
|
||||
// Create volume with data
|
||||
let mut v = Volume::new(
|
||||
dir,
|
||||
dir,
|
||||
"",
|
||||
VolumeId(1),
|
||||
NeedleMapKind::InMemory,
|
||||
None,
|
||||
None,
|
||||
0,
|
||||
Version::current(),
|
||||
)
|
||||
.unwrap();
|
||||
|
||||
let test_data: Vec<(NeedleId, Vec<u8>)> = (1..=3)
|
||||
.map(|i| {
|
||||
let data = format!("EC round trip data for needle {}", i);
|
||||
(NeedleId(i), data.into_bytes())
|
||||
})
|
||||
.collect();
|
||||
|
||||
for (id, data) in &test_data {
|
||||
let mut n = Needle {
|
||||
id: *id,
|
||||
cookie: Cookie(id.0 as u32),
|
||||
data: data.clone(),
|
||||
data_size: data.len() as u32,
|
||||
..Needle::default()
|
||||
};
|
||||
v.write_needle(&mut n, true).unwrap();
|
||||
}
|
||||
v.sync_to_disk().unwrap();
|
||||
let original_dat_size = v.dat_file_size().unwrap();
|
||||
v.close();
|
||||
|
||||
// Read original .dat for comparison
|
||||
let original_dat = std::fs::read(format!("{}/1.dat", dir)).unwrap();
|
||||
|
||||
// Encode to EC
|
||||
let data_shards = 10;
|
||||
let parity_shards = 4;
|
||||
ec_encoder::write_ec_files(dir, dir, "", VolumeId(1), data_shards, parity_shards).unwrap();
|
||||
|
||||
// Delete original .dat and .idx
|
||||
std::fs::remove_file(format!("{}/1.dat", dir)).unwrap();
|
||||
std::fs::remove_file(format!("{}/1.idx", dir)).unwrap();
|
||||
|
||||
// Reconstruct from EC shards
|
||||
write_dat_file_from_shards(dir, "", VolumeId(1), original_dat_size as i64, data_shards)
|
||||
.unwrap();
|
||||
write_idx_file_from_ec_index(dir, "", VolumeId(1)).unwrap();
|
||||
|
||||
// Verify reconstructed .dat matches original
|
||||
let reconstructed_dat = std::fs::read(format!("{}/1.dat", dir)).unwrap();
|
||||
assert_eq!(
|
||||
original_dat[..original_dat_size as usize],
|
||||
reconstructed_dat[..original_dat_size as usize],
|
||||
"reconstructed .dat should match original"
|
||||
);
|
||||
|
||||
// Verify we can load and read from reconstructed volume
|
||||
let v2 = Volume::new(
|
||||
dir,
|
||||
dir,
|
||||
"",
|
||||
VolumeId(1),
|
||||
NeedleMapKind::InMemory,
|
||||
None,
|
||||
None,
|
||||
0,
|
||||
Version::current(),
|
||||
)
|
||||
.unwrap();
|
||||
|
||||
for (id, expected_data) in &test_data {
|
||||
let mut n = Needle {
|
||||
id: *id,
|
||||
..Needle::default()
|
||||
};
|
||||
v2.read_needle(&mut n).unwrap();
|
||||
assert_eq!(&n.data, expected_data, "needle {} data should match", id);
|
||||
}
|
||||
}
|
||||
}
|
||||
824
seaweed-volume/src/storage/erasure_coding/ec_encoder.rs
Normal file
824
seaweed-volume/src/storage/erasure_coding/ec_encoder.rs
Normal file
@@ -0,0 +1,824 @@
|
||||
//! EC encoding: convert a .dat file into 10 data + 4 parity shards.
|
||||
//!
|
||||
//! Uses Reed-Solomon erasure coding. The .dat file is split into blocks
|
||||
//! (1GB large, 1MB small) and encoded across 14 shard files.
|
||||
|
||||
use std::fs::File;
|
||||
use std::io;
|
||||
#[cfg(not(unix))]
|
||||
use std::io::{Seek, SeekFrom};
|
||||
|
||||
use reed_solomon_erasure::galois_8::ReedSolomon;
|
||||
|
||||
use crate::storage::erasure_coding::ec_shard::*;
|
||||
use crate::storage::idx;
|
||||
use crate::storage::types::*;
|
||||
use crate::storage::volume::volume_file_name;
|
||||
|
||||
/// Encode a .dat file into EC shard files.
|
||||
///
|
||||
/// Creates .ec00-.ec13 files in the same directory.
|
||||
/// Also creates a sorted .ecx index from the .idx file.
|
||||
pub fn write_ec_files(
|
||||
dir: &str,
|
||||
idx_dir: &str,
|
||||
collection: &str,
|
||||
volume_id: VolumeId,
|
||||
data_shards: usize,
|
||||
parity_shards: usize,
|
||||
) -> io::Result<()> {
|
||||
let base = volume_file_name(dir, collection, volume_id);
|
||||
let dat_path = format!("{}.dat", base);
|
||||
let idx_base = volume_file_name(idx_dir, collection, volume_id);
|
||||
let idx_path = format!("{}.idx", idx_base);
|
||||
|
||||
// Create sorted .ecx from .idx
|
||||
write_sorted_ecx_from_idx(&idx_path, &format!("{}.ecx", base))?;
|
||||
|
||||
// Encode .dat into shards
|
||||
let dat_file = File::open(&dat_path)?;
|
||||
let dat_size = dat_file.metadata()?.len() as i64;
|
||||
|
||||
let rs = ReedSolomon::new(data_shards, parity_shards)
|
||||
.map_err(|e| io::Error::new(io::ErrorKind::Other, format!("reed-solomon init: {:?}", e)))?;
|
||||
|
||||
// Create shard files
|
||||
let total_shards = data_shards + parity_shards;
|
||||
let mut shards: Vec<EcVolumeShard> = (0..total_shards as u8)
|
||||
.map(|i| EcVolumeShard::new(dir, collection, volume_id, i))
|
||||
.collect();
|
||||
|
||||
for shard in &mut shards {
|
||||
shard.create()?;
|
||||
}
|
||||
|
||||
// Encode in large blocks, then small blocks
|
||||
encode_dat_file(
|
||||
&dat_file,
|
||||
dat_size,
|
||||
&rs,
|
||||
&mut shards,
|
||||
data_shards,
|
||||
parity_shards,
|
||||
)?;
|
||||
|
||||
// Close all shards
|
||||
for shard in &mut shards {
|
||||
shard.close();
|
||||
}
|
||||
|
||||
Ok(())
|
||||
}
|
||||
|
||||
/// Rebuild missing EC shard files from existing shards using Reed-Solomon reconstruct.
|
||||
///
|
||||
/// This does not require the `.dat` file, only the existing `.ecXX` shard files.
|
||||
pub fn rebuild_ec_files(
|
||||
dir: &str,
|
||||
collection: &str,
|
||||
volume_id: VolumeId,
|
||||
missing_shard_ids: &[u32],
|
||||
data_shards: usize,
|
||||
parity_shards: usize,
|
||||
) -> io::Result<()> {
|
||||
if missing_shard_ids.is_empty() {
|
||||
return Ok(());
|
||||
}
|
||||
|
||||
let rs = ReedSolomon::new(data_shards, parity_shards)
|
||||
.map_err(|e| io::Error::new(io::ErrorKind::Other, format!("reed-solomon init: {:?}", e)))?;
|
||||
|
||||
let total_shards = data_shards + parity_shards;
|
||||
let mut shards: Vec<EcVolumeShard> = (0..total_shards as u8)
|
||||
.map(|i| EcVolumeShard::new(dir, collection, volume_id, i))
|
||||
.collect();
|
||||
|
||||
// Determine the exact shard size from the first available existing shard
|
||||
let mut shard_size = 0;
|
||||
for (i, shard) in shards.iter_mut().enumerate() {
|
||||
if !missing_shard_ids.contains(&(i as u32)) {
|
||||
if let Ok(_) = shard.open() {
|
||||
let size = shard.file_size();
|
||||
if size > shard_size {
|
||||
shard_size = size;
|
||||
}
|
||||
} else {
|
||||
return Err(io::Error::new(
|
||||
io::ErrorKind::NotFound,
|
||||
format!("missing non-rebuild shard {}", i),
|
||||
));
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
if shard_size == 0 {
|
||||
return Err(io::Error::new(
|
||||
io::ErrorKind::InvalidData,
|
||||
"all existing shards are empty or cannot find an existing shard to determine size",
|
||||
));
|
||||
}
|
||||
|
||||
// Create the missing shards for writing
|
||||
for i in missing_shard_ids {
|
||||
if let Some(shard) = shards.get_mut(*i as usize) {
|
||||
shard.create()?;
|
||||
}
|
||||
}
|
||||
|
||||
let block_size = ERASURE_CODING_SMALL_BLOCK_SIZE;
|
||||
let mut remaining = shard_size;
|
||||
let mut offset: u64 = 0;
|
||||
|
||||
// Process all data in blocks
|
||||
while remaining > 0 {
|
||||
let to_process = remaining.min(block_size as i64) as usize;
|
||||
|
||||
// Allocate buffers for all shards. Option<Vec<u8>> is required by rs.reconstruct()
|
||||
let mut buffers: Vec<Option<Vec<u8>>> = vec![None; total_shards];
|
||||
|
||||
// Read available shards
|
||||
for (i, shard) in shards.iter().enumerate() {
|
||||
if !missing_shard_ids.contains(&(i as u32)) {
|
||||
let mut buf = vec![0u8; to_process];
|
||||
shard.read_at(&mut buf, offset)?;
|
||||
buffers[i] = Some(buf);
|
||||
}
|
||||
}
|
||||
|
||||
// Reconstruct missing shards
|
||||
rs.reconstruct(&mut buffers).map_err(|e| {
|
||||
io::Error::new(
|
||||
io::ErrorKind::Other,
|
||||
format!("reed-solomon reconstruct: {:?}", e),
|
||||
)
|
||||
})?;
|
||||
|
||||
// Write recovered data into the missing shards
|
||||
for i in missing_shard_ids {
|
||||
let idx = *i as usize;
|
||||
if let Some(buf) = buffers[idx].take() {
|
||||
shards[idx].write_all(&buf)?;
|
||||
}
|
||||
}
|
||||
|
||||
offset += to_process as u64;
|
||||
remaining -= to_process as i64;
|
||||
}
|
||||
|
||||
// Close all shards
|
||||
for shard in &mut shards {
|
||||
shard.close();
|
||||
}
|
||||
|
||||
Ok(())
|
||||
}
|
||||
|
||||
/// Verify EC shards by computing parity against the existing data and identifying corrupted shards.
|
||||
pub fn verify_ec_shards(
|
||||
dir: &str,
|
||||
collection: &str,
|
||||
volume_id: VolumeId,
|
||||
data_shards: usize,
|
||||
parity_shards: usize,
|
||||
) -> io::Result<(Vec<u32>, Vec<String>)> {
|
||||
let rs = ReedSolomon::new(data_shards, parity_shards)
|
||||
.map_err(|e| io::Error::new(io::ErrorKind::Other, format!("reed-solomon init: {:?}", e)))?;
|
||||
|
||||
let total_shards = data_shards + parity_shards;
|
||||
let mut shards: Vec<EcVolumeShard> = (0..total_shards as u8)
|
||||
.map(|i| EcVolumeShard::new(dir, collection, volume_id, i))
|
||||
.collect();
|
||||
|
||||
let mut shard_size = 0;
|
||||
let mut broken_shards = std::collections::HashSet::new();
|
||||
let mut details = Vec::new();
|
||||
|
||||
for (i, shard) in shards.iter_mut().enumerate() {
|
||||
if let Ok(_) = shard.open() {
|
||||
let size = shard.file_size();
|
||||
if size > shard_size {
|
||||
shard_size = size;
|
||||
}
|
||||
} else {
|
||||
broken_shards.insert(i as u32);
|
||||
details.push(format!("failed to open or missing shard {}", i));
|
||||
}
|
||||
}
|
||||
|
||||
if shard_size == 0 || broken_shards.len() >= parity_shards {
|
||||
// Can't do much if we don't know the size or have too many missing
|
||||
return Ok((broken_shards.into_iter().collect(), details));
|
||||
}
|
||||
|
||||
let block_size = ERASURE_CODING_SMALL_BLOCK_SIZE;
|
||||
let mut remaining = shard_size;
|
||||
let mut offset: u64 = 0;
|
||||
|
||||
while remaining > 0 {
|
||||
let to_process = remaining.min(block_size as i64) as usize;
|
||||
let mut buffers = vec![vec![0u8; to_process]; total_shards];
|
||||
|
||||
let mut read_failed = false;
|
||||
for i in 0..total_shards {
|
||||
if !broken_shards.contains(&(i as u32)) {
|
||||
if let Err(e) = shards[i].read_at(&mut buffers[i], offset) {
|
||||
broken_shards.insert(i as u32);
|
||||
details.push(format!("read error shard {}: {}", i, e));
|
||||
read_failed = true;
|
||||
}
|
||||
} else {
|
||||
read_failed = true;
|
||||
}
|
||||
}
|
||||
|
||||
// Only do verification if all shards were readable
|
||||
if !read_failed {
|
||||
// Need to convert Vec<Vec<u8>> to &[&[u8]] for rs.verify
|
||||
let slice_ptrs: Vec<&[u8]> = buffers.iter().map(|v| v.as_slice()).collect();
|
||||
if let Ok(is_valid) = rs.verify(&slice_ptrs) {
|
||||
if !is_valid {
|
||||
// Reed-Solomon verification failed. We cannot easily pinpoint which shard
|
||||
// is corrupted without recalculating parities or syndromes, so we just
|
||||
// log that this batch has corruption. Wait, we can test each parity shard!
|
||||
// Let's re-encode from the first `data_shards` and compare to the actual `parity_shards`.
|
||||
|
||||
let mut verify_buffers = buffers.clone();
|
||||
// Clear the parity parts
|
||||
for i in data_shards..total_shards {
|
||||
verify_buffers[i].fill(0);
|
||||
}
|
||||
if rs.encode(&mut verify_buffers).is_ok() {
|
||||
for i in 0..total_shards {
|
||||
if buffers[i] != verify_buffers[i] {
|
||||
broken_shards.insert(i as u32);
|
||||
details.push(format!(
|
||||
"parity mismatch on shard {} at offset {}",
|
||||
i, offset
|
||||
));
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
offset += to_process as u64;
|
||||
remaining -= to_process as i64;
|
||||
}
|
||||
|
||||
// Close all shards
|
||||
for shard in &mut shards {
|
||||
shard.close();
|
||||
}
|
||||
|
||||
let mut broken_vec: Vec<u32> = broken_shards.into_iter().collect();
|
||||
broken_vec.sort_unstable();
|
||||
|
||||
Ok((broken_vec, details))
|
||||
}
|
||||
|
||||
/// Write sorted .ecx index from .idx file.
|
||||
fn write_sorted_ecx_from_idx(idx_path: &str, ecx_path: &str) -> io::Result<()> {
|
||||
if !std::path::Path::new(idx_path).exists() {
|
||||
return Err(io::Error::new(
|
||||
io::ErrorKind::NotFound,
|
||||
"idx file not found",
|
||||
));
|
||||
}
|
||||
|
||||
// Read all idx entries
|
||||
let mut idx_file = File::open(idx_path)?;
|
||||
let mut entries: Vec<(NeedleId, Offset, Size)> = Vec::new();
|
||||
|
||||
idx::walk_index_file(&mut idx_file, 0, |key, offset, size| {
|
||||
entries.push((key, offset, size));
|
||||
Ok(())
|
||||
})?;
|
||||
|
||||
// Sort by NeedleId, then by actual offset so later entries come last
|
||||
entries.sort_by_key(|&(key, offset, _)| (key, offset.to_actual_offset()));
|
||||
|
||||
// Remove duplicates (keep last/latest entry for each key).
|
||||
// dedup_by_key keeps the first in each run, so we reverse first,
|
||||
// dedup, then reverse back.
|
||||
entries.reverse();
|
||||
entries.dedup_by_key(|entry| entry.0);
|
||||
entries.reverse();
|
||||
|
||||
// Write sorted entries to .ecx
|
||||
let mut ecx_file = File::create(ecx_path)?;
|
||||
for &(key, offset, size) in &entries {
|
||||
idx::write_index_entry(&mut ecx_file, key, offset, size)?;
|
||||
}
|
||||
|
||||
Ok(())
|
||||
}
|
||||
|
||||
/// Rebuild the .ecx index file by walking needles in the EC data shards.
|
||||
///
|
||||
/// This is the equivalent of Go's `RebuildEcxFile`. It reads the logical .dat
|
||||
/// content from the EC data shards, walks through needle headers to extract
|
||||
/// (needle_id, offset, size) entries, deduplicates them, and writes a sorted
|
||||
/// .ecx index file.
|
||||
pub fn rebuild_ecx_file(
|
||||
dir: &str,
|
||||
collection: &str,
|
||||
volume_id: VolumeId,
|
||||
data_shards: usize,
|
||||
) -> io::Result<()> {
|
||||
use crate::storage::needle::needle::get_actual_size;
|
||||
use crate::storage::super_block::SUPER_BLOCK_SIZE;
|
||||
|
||||
let base = volume_file_name(dir, collection, volume_id);
|
||||
let ecx_path = format!("{}.ecx", base);
|
||||
|
||||
// Open data shards to read logical .dat content
|
||||
let mut shards: Vec<EcVolumeShard> = (0..data_shards as u8)
|
||||
.map(|i| EcVolumeShard::new(dir, collection, volume_id, i))
|
||||
.collect();
|
||||
|
||||
for shard in &mut shards {
|
||||
if let Err(_) = shard.open() {
|
||||
// If a data shard is missing, we can't rebuild ecx
|
||||
for s in &mut shards {
|
||||
s.close();
|
||||
}
|
||||
return Err(io::Error::new(
|
||||
io::ErrorKind::NotFound,
|
||||
format!("cannot open data shard for ecx rebuild"),
|
||||
));
|
||||
}
|
||||
}
|
||||
|
||||
// Determine total logical data size from shard sizes
|
||||
let shard_size = shards.iter().map(|s| s.file_size()).max().unwrap_or(0);
|
||||
let total_data_size = shard_size as i64 * data_shards as i64;
|
||||
|
||||
// Read version from superblock (first byte of logical data)
|
||||
let mut sb_buf = [0u8; SUPER_BLOCK_SIZE];
|
||||
read_from_data_shards(&shards, &mut sb_buf, 0, data_shards)?;
|
||||
let version = Version(sb_buf[0]);
|
||||
|
||||
// Walk needles starting after superblock
|
||||
let mut offset = SUPER_BLOCK_SIZE as i64;
|
||||
let header_size = NEEDLE_HEADER_SIZE;
|
||||
let mut entries: Vec<(NeedleId, Offset, Size)> = Vec::new();
|
||||
|
||||
while offset + header_size as i64 <= total_data_size {
|
||||
// Read needle header (cookie + needle_id + size = 16 bytes)
|
||||
let mut header_buf = [0u8; NEEDLE_HEADER_SIZE];
|
||||
if read_from_data_shards(&shards, &mut header_buf, offset as u64, data_shards).is_err() {
|
||||
break;
|
||||
}
|
||||
|
||||
let cookie = Cookie::from_bytes(&header_buf[..COOKIE_SIZE]);
|
||||
let needle_id = NeedleId::from_bytes(&header_buf[COOKIE_SIZE..COOKIE_SIZE + NEEDLE_ID_SIZE]);
|
||||
let size = Size::from_bytes(&header_buf[COOKIE_SIZE + NEEDLE_ID_SIZE..header_size]);
|
||||
|
||||
// Validate: stop if we hit zero cookie+id (end of data)
|
||||
if cookie.0 == 0 && needle_id.0 == 0 {
|
||||
break;
|
||||
}
|
||||
|
||||
// Validate size is reasonable
|
||||
if size.0 < 0 && !size.is_deleted() {
|
||||
break;
|
||||
}
|
||||
|
||||
let actual_size = get_actual_size(size, version);
|
||||
if actual_size <= 0 || offset + actual_size > total_data_size {
|
||||
break;
|
||||
}
|
||||
|
||||
entries.push((needle_id, Offset::from_actual_offset(offset), size));
|
||||
|
||||
// Advance to next needle (aligned to NEEDLE_PADDING_SIZE)
|
||||
offset += actual_size;
|
||||
let padding_rem = offset % NEEDLE_PADDING_SIZE as i64;
|
||||
if padding_rem != 0 {
|
||||
offset += NEEDLE_PADDING_SIZE as i64 - padding_rem;
|
||||
}
|
||||
}
|
||||
|
||||
for shard in &mut shards {
|
||||
shard.close();
|
||||
}
|
||||
|
||||
// Sort by NeedleId, then by offset (later entries override earlier)
|
||||
entries.sort_by_key(|&(key, offset, _)| (key, offset.to_actual_offset()));
|
||||
|
||||
// Deduplicate: keep latest entry per needle_id
|
||||
entries.reverse();
|
||||
entries.dedup_by_key(|entry| entry.0);
|
||||
entries.reverse();
|
||||
|
||||
// Write sorted .ecx
|
||||
let mut ecx_file = File::create(&ecx_path)?;
|
||||
for &(key, offset, size) in &entries {
|
||||
idx::write_index_entry(&mut ecx_file, key, offset, size)?;
|
||||
}
|
||||
ecx_file.sync_all()?;
|
||||
|
||||
Ok(())
|
||||
}
|
||||
|
||||
/// Read bytes from EC data shards at a logical offset in the .dat file.
|
||||
fn read_from_data_shards(
|
||||
shards: &[EcVolumeShard],
|
||||
buf: &mut [u8],
|
||||
logical_offset: u64,
|
||||
data_shards: usize,
|
||||
) -> io::Result<()> {
|
||||
let small_block = ERASURE_CODING_SMALL_BLOCK_SIZE as u64;
|
||||
let data_shards_u64 = data_shards as u64;
|
||||
|
||||
let mut bytes_read = 0u64;
|
||||
let mut remaining = buf.len() as u64;
|
||||
let mut current_offset = logical_offset;
|
||||
|
||||
while remaining > 0 {
|
||||
// Determine which shard and at what shard-offset this logical offset maps to.
|
||||
// The data is interleaved: large blocks first, then small blocks.
|
||||
// For simplicity, use the small block size for all calculations since
|
||||
// large blocks are multiples of small blocks.
|
||||
let row_size = small_block * data_shards_u64;
|
||||
let row_index = current_offset / row_size;
|
||||
let row_offset = current_offset % row_size;
|
||||
let shard_index = (row_offset / small_block) as usize;
|
||||
let shard_offset = row_index * small_block + (row_offset % small_block);
|
||||
|
||||
if shard_index >= data_shards {
|
||||
return Err(io::Error::new(
|
||||
io::ErrorKind::InvalidInput,
|
||||
"shard index out of range",
|
||||
));
|
||||
}
|
||||
|
||||
// How many bytes can we read from this position in this shard block
|
||||
let bytes_left_in_block = small_block - (row_offset % small_block);
|
||||
let to_read = remaining.min(bytes_left_in_block) as usize;
|
||||
|
||||
let dest = &mut buf[bytes_read as usize..bytes_read as usize + to_read];
|
||||
shards[shard_index].read_at(dest, shard_offset)?;
|
||||
|
||||
bytes_read += to_read as u64;
|
||||
remaining -= to_read as u64;
|
||||
current_offset += to_read as u64;
|
||||
}
|
||||
|
||||
Ok(())
|
||||
}
|
||||
|
||||
/// Encode the .dat file data into shard files.
|
||||
///
|
||||
/// Uses a two-phase approach matching Go's ec_encoder.go:
|
||||
/// 1. Process as many large blocks (1GB) as possible
|
||||
/// 2. Process remaining data with small blocks (1MB)
|
||||
fn encode_dat_file(
|
||||
dat_file: &File,
|
||||
dat_size: i64,
|
||||
rs: &ReedSolomon,
|
||||
shards: &mut [EcVolumeShard],
|
||||
data_shards: usize,
|
||||
parity_shards: usize,
|
||||
) -> io::Result<()> {
|
||||
let mut remaining = dat_size;
|
||||
let mut offset: u64 = 0;
|
||||
|
||||
// Phase 1: Process large blocks (1GB each) while enough data remains
|
||||
let large_block_size = ERASURE_CODING_LARGE_BLOCK_SIZE;
|
||||
let large_row_size = large_block_size * data_shards;
|
||||
|
||||
while remaining >= large_row_size as i64 {
|
||||
encode_one_batch(
|
||||
dat_file,
|
||||
offset,
|
||||
large_block_size,
|
||||
rs,
|
||||
shards,
|
||||
data_shards,
|
||||
parity_shards,
|
||||
)?;
|
||||
offset += large_row_size as u64;
|
||||
remaining -= large_row_size as i64;
|
||||
}
|
||||
|
||||
// Phase 2: Process remaining data with small blocks (1MB each)
|
||||
let small_block_size = ERASURE_CODING_SMALL_BLOCK_SIZE;
|
||||
let small_row_size = small_block_size * data_shards;
|
||||
|
||||
while remaining > 0 {
|
||||
let to_process = remaining.min(small_row_size as i64);
|
||||
encode_one_batch(
|
||||
dat_file,
|
||||
offset,
|
||||
small_block_size,
|
||||
rs,
|
||||
shards,
|
||||
data_shards,
|
||||
parity_shards,
|
||||
)?;
|
||||
offset += to_process as u64;
|
||||
remaining -= to_process;
|
||||
}
|
||||
|
||||
Ok(())
|
||||
}
|
||||
|
||||
/// Encode one batch (row) of data.
|
||||
fn encode_one_batch(
|
||||
dat_file: &File,
|
||||
offset: u64,
|
||||
block_size: usize,
|
||||
rs: &ReedSolomon,
|
||||
shards: &mut [EcVolumeShard],
|
||||
data_shards: usize,
|
||||
parity_shards: usize,
|
||||
) -> io::Result<()> {
|
||||
let total_shards = data_shards + parity_shards;
|
||||
// Each batch allocates block_size * total_shards bytes.
|
||||
// With large blocks (1 GiB) this is 14 GiB -- guard against OOM.
|
||||
let total_alloc = block_size.checked_mul(total_shards).ok_or_else(|| {
|
||||
io::Error::new(
|
||||
io::ErrorKind::InvalidInput,
|
||||
"block_size * shard count overflows usize",
|
||||
)
|
||||
})?;
|
||||
// Large-block encoding uses 1 GiB * 14 shards = 14 GiB; allow up to 16 GiB.
|
||||
const MAX_BATCH_ALLOC: usize = 16 * 1024 * 1024 * 1024; // 16 GiB safety limit
|
||||
if total_alloc > MAX_BATCH_ALLOC {
|
||||
return Err(io::Error::new(
|
||||
io::ErrorKind::InvalidInput,
|
||||
format!(
|
||||
"batch allocation too large ({} bytes, limit {} bytes); block_size={} shards={}",
|
||||
total_alloc, MAX_BATCH_ALLOC, block_size, total_shards,
|
||||
),
|
||||
));
|
||||
}
|
||||
|
||||
// Allocate buffers for all shards
|
||||
let mut buffers: Vec<Vec<u8>> = (0..total_shards).map(|_| vec![0u8; block_size]).collect();
|
||||
|
||||
// Read data shards from .dat file
|
||||
for i in 0..data_shards {
|
||||
let read_offset = offset + (i * block_size) as u64;
|
||||
|
||||
#[cfg(unix)]
|
||||
{
|
||||
use std::os::unix::fs::FileExt;
|
||||
dat_file.read_at(&mut buffers[i], read_offset)?;
|
||||
}
|
||||
|
||||
#[cfg(not(unix))]
|
||||
{
|
||||
let mut f = dat_file.try_clone()?;
|
||||
f.seek(SeekFrom::Start(read_offset))?;
|
||||
f.read(&mut buffers[i])?;
|
||||
}
|
||||
}
|
||||
|
||||
// Encode parity shards
|
||||
rs.encode(&mut buffers).map_err(|e| {
|
||||
io::Error::new(
|
||||
io::ErrorKind::Other,
|
||||
format!("reed-solomon encode: {:?}", e),
|
||||
)
|
||||
})?;
|
||||
|
||||
// Write all shard buffers to files
|
||||
for (i, buf) in buffers.iter().enumerate() {
|
||||
shards[i].write_all(buf)?;
|
||||
}
|
||||
|
||||
Ok(())
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
use crate::storage::needle::needle::Needle;
|
||||
use crate::storage::needle_map::NeedleMapKind;
|
||||
use crate::storage::volume::Volume;
|
||||
use tempfile::TempDir;
|
||||
|
||||
#[test]
|
||||
fn test_ec_encode_decode_round_trip() {
|
||||
let tmp = TempDir::new().unwrap();
|
||||
let dir = tmp.path().to_str().unwrap();
|
||||
|
||||
// Create a volume with some data
|
||||
let mut v = Volume::new(
|
||||
dir,
|
||||
dir,
|
||||
"",
|
||||
VolumeId(1),
|
||||
NeedleMapKind::InMemory,
|
||||
None,
|
||||
None,
|
||||
0,
|
||||
Version::current(),
|
||||
)
|
||||
.unwrap();
|
||||
|
||||
for i in 1..=5 {
|
||||
let data = format!("test data for needle {}", i);
|
||||
let mut n = Needle {
|
||||
id: NeedleId(i),
|
||||
cookie: Cookie(i as u32),
|
||||
data: data.as_bytes().to_vec(),
|
||||
data_size: data.len() as u32,
|
||||
..Needle::default()
|
||||
};
|
||||
v.write_needle(&mut n, true).unwrap();
|
||||
}
|
||||
v.sync_to_disk().unwrap();
|
||||
v.close();
|
||||
|
||||
// Encode to EC shards
|
||||
let data_shards = 10;
|
||||
let parity_shards = 4;
|
||||
let total_shards = data_shards + parity_shards;
|
||||
write_ec_files(dir, dir, "", VolumeId(1), data_shards, parity_shards).unwrap();
|
||||
|
||||
// Verify shard files exist
|
||||
for i in 0..total_shards {
|
||||
let path = format!("{}/{}.ec{:02}", dir, 1, i);
|
||||
assert!(
|
||||
std::path::Path::new(&path).exists(),
|
||||
"shard file {} should exist",
|
||||
path
|
||||
);
|
||||
}
|
||||
|
||||
// Verify .ecx exists
|
||||
let ecx_path = format!("{}/1.ecx", dir);
|
||||
assert!(std::path::Path::new(&ecx_path).exists());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_reed_solomon_basic() {
|
||||
let data_shards = 10;
|
||||
let parity_shards = 4;
|
||||
let total_shards = data_shards + parity_shards;
|
||||
let rs = ReedSolomon::new(data_shards, parity_shards).unwrap();
|
||||
let block_size = 1024;
|
||||
let mut shards: Vec<Vec<u8>> = (0..total_shards)
|
||||
.map(|i| {
|
||||
if i < data_shards {
|
||||
vec![(i as u8).wrapping_mul(7); block_size]
|
||||
} else {
|
||||
vec![0u8; block_size]
|
||||
}
|
||||
})
|
||||
.collect();
|
||||
|
||||
// Encode
|
||||
rs.encode(&mut shards).unwrap();
|
||||
|
||||
// Verify parity is non-zero (at least some)
|
||||
let parity_nonzero: bool = shards[data_shards..]
|
||||
.iter()
|
||||
.any(|s| s.iter().any(|&b| b != 0));
|
||||
assert!(parity_nonzero);
|
||||
|
||||
// Simulate losing 4 shards and reconstructing
|
||||
let original_0 = shards[0].clone();
|
||||
let original_1 = shards[1].clone();
|
||||
|
||||
let mut shard_opts: Vec<Option<Vec<u8>>> = shards.into_iter().map(Some).collect();
|
||||
shard_opts[0] = None;
|
||||
shard_opts[1] = None;
|
||||
shard_opts[2] = None;
|
||||
shard_opts[3] = None;
|
||||
|
||||
rs.reconstruct(&mut shard_opts).unwrap();
|
||||
|
||||
assert_eq!(shard_opts[0].as_ref().unwrap(), &original_0);
|
||||
assert_eq!(shard_opts[1].as_ref().unwrap(), &original_1);
|
||||
}
|
||||
|
||||
/// EC encode must read .idx from a separate index directory when configured.
|
||||
#[test]
|
||||
fn test_ec_encode_with_separate_idx_dir() {
|
||||
let dat_tmp = TempDir::new().unwrap();
|
||||
let idx_tmp = TempDir::new().unwrap();
|
||||
let dat_dir = dat_tmp.path().to_str().unwrap();
|
||||
let idx_dir = idx_tmp.path().to_str().unwrap();
|
||||
|
||||
// Create a volume with separate data and index directories
|
||||
let mut v = Volume::new(
|
||||
dat_dir,
|
||||
idx_dir,
|
||||
"",
|
||||
VolumeId(1),
|
||||
NeedleMapKind::InMemory,
|
||||
None,
|
||||
None,
|
||||
0,
|
||||
Version::current(),
|
||||
)
|
||||
.unwrap();
|
||||
|
||||
for i in 1..=5 {
|
||||
let data = format!("needle {} payload", i);
|
||||
let mut n = Needle {
|
||||
id: NeedleId(i),
|
||||
cookie: Cookie(i as u32),
|
||||
data: data.as_bytes().to_vec(),
|
||||
data_size: data.len() as u32,
|
||||
..Needle::default()
|
||||
};
|
||||
v.write_needle(&mut n, true).unwrap();
|
||||
}
|
||||
v.sync_to_disk().unwrap();
|
||||
v.close();
|
||||
|
||||
// Verify .dat is in data dir, .idx is in idx dir
|
||||
assert!(std::path::Path::new(&format!("{}/1.dat", dat_dir)).exists());
|
||||
assert!(!std::path::Path::new(&format!("{}/1.idx", dat_dir)).exists());
|
||||
assert!(std::path::Path::new(&format!("{}/1.idx", idx_dir)).exists());
|
||||
assert!(!std::path::Path::new(&format!("{}/1.dat", idx_dir)).exists());
|
||||
|
||||
// EC encode with separate idx dir
|
||||
let data_shards = 10;
|
||||
let parity_shards = 4;
|
||||
let total_shards = data_shards + parity_shards;
|
||||
write_ec_files(
|
||||
dat_dir,
|
||||
idx_dir,
|
||||
"",
|
||||
VolumeId(1),
|
||||
data_shards,
|
||||
parity_shards,
|
||||
)
|
||||
.unwrap();
|
||||
|
||||
// Verify all 14 shard files in data dir
|
||||
for i in 0..total_shards {
|
||||
let path = format!("{}/1.ec{:02}", dat_dir, i);
|
||||
assert!(
|
||||
std::path::Path::new(&path).exists(),
|
||||
"shard {} should exist in data dir",
|
||||
path
|
||||
);
|
||||
}
|
||||
|
||||
// Verify .ecx in data dir (not idx dir)
|
||||
assert!(std::path::Path::new(&format!("{}/1.ecx", dat_dir)).exists());
|
||||
assert!(!std::path::Path::new(&format!("{}/1.ecx", idx_dir)).exists());
|
||||
|
||||
// Verify no shard files leaked into idx dir
|
||||
for i in 0..total_shards {
|
||||
let path = format!("{}/1.ec{:02}", idx_dir, i);
|
||||
assert!(
|
||||
!std::path::Path::new(&path).exists(),
|
||||
"shard {} should NOT exist in idx dir",
|
||||
path
|
||||
);
|
||||
}
|
||||
}
|
||||
|
||||
/// EC encode should fail gracefully when .idx is only in the data dir
|
||||
/// but we pass a wrong idx_dir. This guards against regressions where
|
||||
/// write_ec_files ignores the idx_dir parameter.
|
||||
#[test]
|
||||
fn test_ec_encode_fails_with_wrong_idx_dir() {
|
||||
let dat_tmp = TempDir::new().unwrap();
|
||||
let idx_tmp = TempDir::new().unwrap();
|
||||
let wrong_tmp = TempDir::new().unwrap();
|
||||
let dat_dir = dat_tmp.path().to_str().unwrap();
|
||||
let idx_dir = idx_tmp.path().to_str().unwrap();
|
||||
let wrong_dir = wrong_tmp.path().to_str().unwrap();
|
||||
|
||||
let mut v = Volume::new(
|
||||
dat_dir,
|
||||
idx_dir,
|
||||
"",
|
||||
VolumeId(1),
|
||||
NeedleMapKind::InMemory,
|
||||
None,
|
||||
None,
|
||||
0,
|
||||
Version::current(),
|
||||
)
|
||||
.unwrap();
|
||||
|
||||
let mut n = Needle {
|
||||
id: NeedleId(1),
|
||||
cookie: Cookie(1),
|
||||
data: b"hello".to_vec(),
|
||||
data_size: 5,
|
||||
..Needle::default()
|
||||
};
|
||||
v.write_needle(&mut n, true).unwrap();
|
||||
v.sync_to_disk().unwrap();
|
||||
v.close();
|
||||
|
||||
// Should fail: .idx is in idx_dir, not wrong_dir
|
||||
let result = write_ec_files(dat_dir, wrong_dir, "", VolumeId(1), 10, 4);
|
||||
assert!(
|
||||
result.is_err(),
|
||||
"should fail when idx_dir doesn't contain .idx"
|
||||
);
|
||||
}
|
||||
}
|
||||
223
seaweed-volume/src/storage/erasure_coding/ec_locate.rs
Normal file
223
seaweed-volume/src/storage/erasure_coding/ec_locate.rs
Normal file
@@ -0,0 +1,223 @@
|
||||
//! EC data location: maps needle offset/size to shard intervals.
|
||||
//!
|
||||
//! Determines which shard(s) contain data for a given needle and at what
|
||||
//! offsets within those shards. Handles both large (1GB) and small (1MB)
|
||||
//! block sections.
|
||||
|
||||
use crate::storage::erasure_coding::ec_shard::*;
|
||||
use crate::storage::types::*;
|
||||
|
||||
/// An interval to read from EC shards.
|
||||
#[derive(Debug, Clone)]
|
||||
pub struct Interval {
|
||||
pub block_index: usize,
|
||||
pub inner_block_offset: i64,
|
||||
pub size: i64,
|
||||
pub is_large_block: bool,
|
||||
pub large_block_rows_count: usize,
|
||||
}
|
||||
|
||||
impl Interval {
|
||||
pub fn to_shard_id_and_offset(&self, data_shards: u32) -> (ShardId, i64) {
|
||||
let data_shards_usize = data_shards as usize;
|
||||
let shard_id = (self.block_index % data_shards_usize) as ShardId;
|
||||
let row_index = self.block_index / data_shards_usize;
|
||||
|
||||
let block_size = if self.is_large_block {
|
||||
ERASURE_CODING_LARGE_BLOCK_SIZE as i64
|
||||
} else {
|
||||
ERASURE_CODING_SMALL_BLOCK_SIZE as i64
|
||||
};
|
||||
|
||||
let mut offset = row_index as i64 * block_size + self.inner_block_offset;
|
||||
if !self.is_large_block {
|
||||
// Small blocks come after large blocks in the shard file
|
||||
offset += self.large_block_rows_count as i64 * ERASURE_CODING_LARGE_BLOCK_SIZE as i64;
|
||||
}
|
||||
|
||||
(shard_id, offset)
|
||||
}
|
||||
}
|
||||
|
||||
/// Locate the EC shard intervals needed to read data at the given offset and size.
|
||||
///
|
||||
/// `shard_size` is the size of a single shard file.
|
||||
pub fn locate_data(offset: i64, size: Size, shard_size: i64, data_shards: u32) -> Vec<Interval> {
|
||||
let mut intervals = Vec::new();
|
||||
let data_size = size.0 as i64;
|
||||
|
||||
if data_size <= 0 || shard_size <= 0 {
|
||||
return intervals;
|
||||
}
|
||||
|
||||
let large_block_size = ERASURE_CODING_LARGE_BLOCK_SIZE as i64;
|
||||
let small_block_size = ERASURE_CODING_SMALL_BLOCK_SIZE as i64;
|
||||
let large_row_size = large_block_size * data_shards as i64;
|
||||
let small_row_size = small_block_size * data_shards as i64;
|
||||
|
||||
// Number of large block rows
|
||||
let n_large_block_rows = if shard_size > 0 {
|
||||
((shard_size - 1) / large_block_size) as usize
|
||||
} else {
|
||||
0
|
||||
};
|
||||
let large_section_size = n_large_block_rows as i64 * large_row_size;
|
||||
|
||||
let mut remaining_offset = offset;
|
||||
let mut remaining_size = data_size;
|
||||
|
||||
// In large block section?
|
||||
if remaining_offset < large_section_size {
|
||||
let available_in_large = large_section_size - remaining_offset;
|
||||
let to_read = remaining_size.min(available_in_large);
|
||||
|
||||
add_intervals(
|
||||
&mut intervals,
|
||||
remaining_offset,
|
||||
to_read,
|
||||
large_block_size,
|
||||
large_row_size,
|
||||
true,
|
||||
n_large_block_rows,
|
||||
);
|
||||
|
||||
remaining_offset += to_read;
|
||||
remaining_size -= to_read;
|
||||
}
|
||||
|
||||
// In small block section?
|
||||
if remaining_size > 0 {
|
||||
let small_offset = remaining_offset - large_section_size;
|
||||
add_intervals(
|
||||
&mut intervals,
|
||||
small_offset,
|
||||
remaining_size,
|
||||
small_block_size,
|
||||
small_row_size,
|
||||
false,
|
||||
n_large_block_rows,
|
||||
);
|
||||
}
|
||||
|
||||
intervals
|
||||
}
|
||||
|
||||
fn add_intervals(
|
||||
intervals: &mut Vec<Interval>,
|
||||
offset: i64,
|
||||
size: i64,
|
||||
block_size: i64,
|
||||
_row_size: i64,
|
||||
is_large_block: bool,
|
||||
large_block_rows_count: usize,
|
||||
) {
|
||||
let mut pos = offset;
|
||||
let end = offset + size;
|
||||
|
||||
while pos < end {
|
||||
let block_index = (pos / block_size) as usize;
|
||||
let inner_offset = pos % block_size;
|
||||
let remaining_in_block = block_size - inner_offset;
|
||||
let interval_size = remaining_in_block.min(end - pos);
|
||||
|
||||
intervals.push(Interval {
|
||||
block_index,
|
||||
inner_block_offset: inner_offset,
|
||||
size: interval_size,
|
||||
is_large_block,
|
||||
large_block_rows_count,
|
||||
});
|
||||
|
||||
pos += interval_size;
|
||||
}
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn test_interval_to_shard_id() {
|
||||
let data_shards = 10;
|
||||
let large_block_size = ERASURE_CODING_LARGE_BLOCK_SIZE as i64;
|
||||
let _shard_size = 1024 * 1024; // Example shard size
|
||||
|
||||
// Block index 0 → shard 0
|
||||
let interval = Interval {
|
||||
block_index: 0,
|
||||
inner_block_offset: 100,
|
||||
size: 50,
|
||||
is_large_block: true,
|
||||
large_block_rows_count: 1,
|
||||
};
|
||||
let (shard_id, offset) = interval.to_shard_id_and_offset(data_shards);
|
||||
assert_eq!(shard_id, 0);
|
||||
assert_eq!(offset, 100);
|
||||
|
||||
// Block index 5 → shard 5
|
||||
let interval = Interval {
|
||||
block_index: 5,
|
||||
inner_block_offset: 0,
|
||||
size: 1024,
|
||||
is_large_block: true,
|
||||
large_block_rows_count: 1,
|
||||
};
|
||||
let (shard_id, _offset) = interval.to_shard_id_and_offset(data_shards);
|
||||
assert_eq!(shard_id, 5);
|
||||
|
||||
// Block index 12 (data_shards=10) → row_index 1, shard_id 2
|
||||
let interval = Interval {
|
||||
block_index: 12,
|
||||
inner_block_offset: 200,
|
||||
size: 50,
|
||||
is_large_block: true,
|
||||
large_block_rows_count: 5,
|
||||
};
|
||||
let (shard_id, offset) = interval.to_shard_id_and_offset(data_shards);
|
||||
assert_eq!(shard_id, 2); // 12 % 10 = 2
|
||||
assert_eq!(offset, large_block_size + 200); // row 1 offset + inner_block_offset
|
||||
|
||||
// Block index 10 → shard 0 (second row)
|
||||
let interval = Interval {
|
||||
block_index: 10,
|
||||
inner_block_offset: 0,
|
||||
size: 100,
|
||||
is_large_block: true,
|
||||
large_block_rows_count: 2,
|
||||
};
|
||||
let (shard_id, offset) = interval.to_shard_id_and_offset(data_shards);
|
||||
assert_eq!(shard_id, 0);
|
||||
assert_eq!(offset, ERASURE_CODING_LARGE_BLOCK_SIZE as i64); // row 1 offset
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_locate_data_small_file() {
|
||||
// Small file: 100 bytes at offset 50, shard size = 1MB
|
||||
let intervals = locate_data(50, Size(100), 1024 * 1024, 10);
|
||||
assert!(!intervals.is_empty());
|
||||
|
||||
// Should be a single small block interval (no large block rows for 1MB shard)
|
||||
assert_eq!(intervals.len(), 1);
|
||||
assert!(!intervals[0].is_large_block);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_locate_data_empty() {
|
||||
let intervals = locate_data(0, Size(0), 1024 * 1024, 10);
|
||||
assert!(intervals.is_empty());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_small_block_after_large() {
|
||||
let interval = Interval {
|
||||
block_index: 0,
|
||||
inner_block_offset: 0,
|
||||
size: 100,
|
||||
is_large_block: false,
|
||||
large_block_rows_count: 2,
|
||||
};
|
||||
let (_shard_id, offset) = interval.to_shard_id_and_offset(10);
|
||||
// Should be after 2 large block rows
|
||||
assert_eq!(offset, 2 * ERASURE_CODING_LARGE_BLOCK_SIZE as i64);
|
||||
}
|
||||
}
|
||||
225
seaweed-volume/src/storage/erasure_coding/ec_shard.rs
Normal file
225
seaweed-volume/src/storage/erasure_coding/ec_shard.rs
Normal file
@@ -0,0 +1,225 @@
|
||||
//! EcVolumeShard: a single shard file (.ec00-.ec13) of an erasure-coded volume.
|
||||
|
||||
use std::fs::{self, File, OpenOptions};
|
||||
use std::io::{self, Write};
|
||||
|
||||
use crate::storage::types::*;
|
||||
|
||||
pub const DATA_SHARDS_COUNT: usize = 10;
|
||||
pub const PARITY_SHARDS_COUNT: usize = 4;
|
||||
pub const TOTAL_SHARDS_COUNT: usize = DATA_SHARDS_COUNT + PARITY_SHARDS_COUNT;
|
||||
pub const MAX_SHARD_COUNT: usize = 32;
|
||||
pub const MIN_TOTAL_DISKS: usize = TOTAL_SHARDS_COUNT / PARITY_SHARDS_COUNT + 1;
|
||||
pub const ERASURE_CODING_LARGE_BLOCK_SIZE: usize = 1024 * 1024 * 1024; // 1GB
|
||||
pub const ERASURE_CODING_SMALL_BLOCK_SIZE: usize = 1024 * 1024; // 1MB
|
||||
|
||||
pub type ShardId = u8;
|
||||
|
||||
/// A single erasure-coded shard file.
|
||||
pub struct EcVolumeShard {
|
||||
pub volume_id: VolumeId,
|
||||
pub shard_id: ShardId,
|
||||
pub collection: String,
|
||||
pub dir: String,
|
||||
pub disk_type: DiskType,
|
||||
ecd_file: Option<File>,
|
||||
ecd_file_size: i64,
|
||||
}
|
||||
|
||||
impl EcVolumeShard {
|
||||
/// Create a new shard reference (does not open the file).
|
||||
pub fn new(dir: &str, collection: &str, volume_id: VolumeId, shard_id: ShardId) -> Self {
|
||||
EcVolumeShard {
|
||||
volume_id,
|
||||
shard_id,
|
||||
collection: collection.to_string(),
|
||||
dir: dir.to_string(),
|
||||
disk_type: DiskType::default(),
|
||||
ecd_file: None,
|
||||
ecd_file_size: 0,
|
||||
}
|
||||
}
|
||||
|
||||
/// Shard file name, e.g. "dir/collection_42.ec03"
|
||||
pub fn file_name(&self) -> String {
|
||||
let base =
|
||||
crate::storage::volume::volume_file_name(&self.dir, &self.collection, self.volume_id);
|
||||
format!("{}.ec{:02}", base, self.shard_id)
|
||||
}
|
||||
|
||||
/// Open the shard file for reading.
|
||||
pub fn open(&mut self) -> io::Result<()> {
|
||||
let path = self.file_name();
|
||||
let file = File::open(&path)?;
|
||||
self.ecd_file_size = file.metadata()?.len() as i64;
|
||||
self.ecd_file = Some(file);
|
||||
Ok(())
|
||||
}
|
||||
|
||||
/// Create the shard file for writing.
|
||||
pub fn create(&mut self) -> io::Result<()> {
|
||||
let path = self.file_name();
|
||||
let file = OpenOptions::new()
|
||||
.read(true)
|
||||
.write(true)
|
||||
.create(true)
|
||||
.truncate(true)
|
||||
.open(&path)?;
|
||||
self.ecd_file = Some(file);
|
||||
self.ecd_file_size = 0;
|
||||
Ok(())
|
||||
}
|
||||
|
||||
/// Read data at a specific offset.
|
||||
pub fn read_at(&self, buf: &mut [u8], offset: u64) -> io::Result<usize> {
|
||||
let file = self
|
||||
.ecd_file
|
||||
.as_ref()
|
||||
.ok_or_else(|| io::Error::new(io::ErrorKind::Other, "shard file not open"))?;
|
||||
|
||||
#[cfg(unix)]
|
||||
{
|
||||
use std::os::unix::fs::FileExt;
|
||||
file.read_at(buf, offset)
|
||||
}
|
||||
|
||||
#[cfg(not(unix))]
|
||||
{
|
||||
use std::io::{Read, Seek, SeekFrom};
|
||||
// File::read_at is unix-only; fall back to seek + read.
|
||||
// We need a mutable reference for seek/read, so clone the handle.
|
||||
let mut f = file.try_clone()?;
|
||||
f.seek(SeekFrom::Start(offset))?;
|
||||
f.read(buf)
|
||||
}
|
||||
}
|
||||
|
||||
/// Write data to the shard file (appends).
|
||||
pub fn write_all(&mut self, data: &[u8]) -> io::Result<()> {
|
||||
let file = self
|
||||
.ecd_file
|
||||
.as_mut()
|
||||
.ok_or_else(|| io::Error::new(io::ErrorKind::Other, "shard file not open"))?;
|
||||
file.write_all(data)?;
|
||||
self.ecd_file_size += data.len() as i64;
|
||||
Ok(())
|
||||
}
|
||||
|
||||
pub fn file_size(&self) -> i64 {
|
||||
self.ecd_file_size
|
||||
}
|
||||
|
||||
/// Close the shard file.
|
||||
pub fn close(&mut self) {
|
||||
if let Some(ref file) = self.ecd_file {
|
||||
let _ = file.sync_all();
|
||||
}
|
||||
self.ecd_file = None;
|
||||
}
|
||||
|
||||
/// Delete the shard file from disk.
|
||||
pub fn destroy(&mut self) {
|
||||
self.close();
|
||||
let _ = fs::remove_file(self.file_name());
|
||||
}
|
||||
}
|
||||
|
||||
/// ShardBits: bitmap tracking which shards are present.
|
||||
#[derive(Debug, Clone, Copy, Default, PartialEq, Eq)]
|
||||
pub struct ShardBits(pub u32);
|
||||
|
||||
impl ShardBits {
|
||||
pub fn add_shard_id(&mut self, id: ShardId) {
|
||||
assert!((id as usize) < 32, "shard id {} out of bounds (max 31)", id,);
|
||||
self.0 |= 1 << id;
|
||||
}
|
||||
|
||||
pub fn remove_shard_id(&mut self, id: ShardId) {
|
||||
assert!((id as usize) < 32, "shard id {} out of bounds (max 31)", id,);
|
||||
self.0 &= !(1 << id);
|
||||
}
|
||||
|
||||
pub fn has_shard_id(&self, id: ShardId) -> bool {
|
||||
if (id as usize) >= 32 {
|
||||
return false;
|
||||
}
|
||||
self.0 & (1 << id) != 0
|
||||
}
|
||||
|
||||
pub fn shard_id_count(&self) -> usize {
|
||||
self.0.count_ones() as usize
|
||||
}
|
||||
|
||||
/// Iterator over present shard IDs.
|
||||
pub fn shard_ids(&self) -> Vec<ShardId> {
|
||||
let mut ids = Vec::with_capacity(self.shard_id_count());
|
||||
for i in 0..32 {
|
||||
if self.has_shard_id(i) {
|
||||
ids.push(i);
|
||||
}
|
||||
}
|
||||
ids
|
||||
}
|
||||
|
||||
pub fn minus(&self, other: ShardBits) -> ShardBits {
|
||||
ShardBits(self.0 & !other.0)
|
||||
}
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn test_shard_bits() {
|
||||
let mut bits = ShardBits::default();
|
||||
assert_eq!(bits.shard_id_count(), 0);
|
||||
|
||||
bits.add_shard_id(0);
|
||||
bits.add_shard_id(3);
|
||||
bits.add_shard_id(13);
|
||||
assert_eq!(bits.shard_id_count(), 3);
|
||||
assert!(bits.has_shard_id(0));
|
||||
assert!(bits.has_shard_id(3));
|
||||
assert!(!bits.has_shard_id(1));
|
||||
|
||||
bits.remove_shard_id(3);
|
||||
assert!(!bits.has_shard_id(3));
|
||||
assert_eq!(bits.shard_id_count(), 2);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_shard_bits_ids() {
|
||||
let mut bits = ShardBits::default();
|
||||
bits.add_shard_id(1);
|
||||
bits.add_shard_id(5);
|
||||
bits.add_shard_id(9);
|
||||
assert_eq!(bits.shard_ids(), vec![1, 5, 9]);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_shard_bits_minus() {
|
||||
let mut a = ShardBits::default();
|
||||
a.add_shard_id(0);
|
||||
a.add_shard_id(1);
|
||||
a.add_shard_id(2);
|
||||
|
||||
let mut b = ShardBits::default();
|
||||
b.add_shard_id(1);
|
||||
|
||||
let c = a.minus(b);
|
||||
assert_eq!(c.shard_ids(), vec![0, 2]);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_shard_file_name() {
|
||||
let shard = EcVolumeShard::new("/data", "pics", VolumeId(42), 3);
|
||||
assert_eq!(shard.file_name(), "/data/pics_42.ec03");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_shard_file_name_no_collection() {
|
||||
let shard = EcVolumeShard::new("/data", "", VolumeId(7), 13);
|
||||
assert_eq!(shard.file_name(), "/data/7.ec13");
|
||||
}
|
||||
}
|
||||
944
seaweed-volume/src/storage/erasure_coding/ec_volume.rs
Normal file
944
seaweed-volume/src/storage/erasure_coding/ec_volume.rs
Normal file
@@ -0,0 +1,944 @@
|
||||
//! EcVolume: an erasure-coded volume with up to 14 shards.
|
||||
//!
|
||||
//! Each EcVolume has a sorted index (.ecx) and a deletion journal (.ecj).
|
||||
//! Shards (.ec00-.ec13) may be distributed across multiple servers.
|
||||
|
||||
use std::collections::HashMap;
|
||||
use std::fs::{self, File, OpenOptions};
|
||||
use std::io::{self, Write};
|
||||
use std::time::{SystemTime, UNIX_EPOCH};
|
||||
|
||||
use crate::pb::master_pb;
|
||||
use crate::storage::erasure_coding::ec_locate;
|
||||
use crate::storage::erasure_coding::ec_shard::*;
|
||||
use crate::storage::needle::needle::{get_actual_size, Needle};
|
||||
use crate::storage::types::*;
|
||||
|
||||
/// An erasure-coded volume managing its local shards and index.
|
||||
pub struct EcVolume {
|
||||
pub volume_id: VolumeId,
|
||||
pub collection: String,
|
||||
pub dir: String,
|
||||
pub dir_idx: String,
|
||||
pub version: Version,
|
||||
pub shards: Vec<Option<EcVolumeShard>>, // indexed by ShardId (0..14)
|
||||
pub dat_file_size: i64,
|
||||
pub data_shards: u32,
|
||||
pub parity_shards: u32,
|
||||
ecx_file: Option<File>,
|
||||
ecx_file_size: i64,
|
||||
ecj_file: Option<File>,
|
||||
pub disk_type: DiskType,
|
||||
/// Directory where .ecx/.ecj were actually found (may differ from dir_idx after fallback).
|
||||
ecx_actual_dir: String,
|
||||
/// Maps shard ID -> list of server addresses where that shard exists.
|
||||
/// Used for distributed EC reads across the cluster.
|
||||
pub shard_locations: HashMap<ShardId, Vec<String>>,
|
||||
/// EC volume expiration time (unix epoch seconds), set during EC encode from TTL.
|
||||
pub expire_at_sec: u64,
|
||||
}
|
||||
|
||||
pub fn read_ec_shard_config(dir: &str, collection: &str, volume_id: VolumeId) -> (u32, u32) {
|
||||
let mut data_shards = crate::storage::erasure_coding::ec_shard::DATA_SHARDS_COUNT as u32;
|
||||
let mut parity_shards = crate::storage::erasure_coding::ec_shard::PARITY_SHARDS_COUNT as u32;
|
||||
let base = crate::storage::volume::volume_file_name(dir, collection, volume_id);
|
||||
let vif_path = format!("{}.vif", base);
|
||||
if let Ok(vif_content) = std::fs::read_to_string(&vif_path) {
|
||||
if let Ok(vif_info) =
|
||||
serde_json::from_str::<crate::storage::volume::VifVolumeInfo>(&vif_content)
|
||||
{
|
||||
if let Some(ec) = vif_info.ec_shard_config {
|
||||
if ec.data_shards > 0
|
||||
&& ec.parity_shards > 0
|
||||
&& (ec.data_shards + ec.parity_shards) <= TOTAL_SHARDS_COUNT as u32
|
||||
{
|
||||
data_shards = ec.data_shards;
|
||||
parity_shards = ec.parity_shards;
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
(data_shards, parity_shards)
|
||||
}
|
||||
|
||||
impl EcVolume {
|
||||
/// Create a new EcVolume. Loads .ecx index and .ecj journal if present.
|
||||
pub fn new(
|
||||
dir: &str,
|
||||
dir_idx: &str,
|
||||
collection: &str,
|
||||
volume_id: VolumeId,
|
||||
) -> io::Result<Self> {
|
||||
let (data_shards, parity_shards) = read_ec_shard_config(dir, collection, volume_id);
|
||||
|
||||
let total_shards = (data_shards + parity_shards) as usize;
|
||||
let mut shards = Vec::with_capacity(total_shards);
|
||||
for _ in 0..total_shards {
|
||||
shards.push(None);
|
||||
}
|
||||
|
||||
// Read expire_at_sec and version from .vif if present (matches Go's MaybeLoadVolumeInfo)
|
||||
let (expire_at_sec, vif_version) = {
|
||||
let base = crate::storage::volume::volume_file_name(dir, collection, volume_id);
|
||||
let vif_path = format!("{}.vif", base);
|
||||
if let Ok(vif_content) = std::fs::read_to_string(&vif_path) {
|
||||
if let Ok(vif_info) =
|
||||
serde_json::from_str::<crate::storage::volume::VifVolumeInfo>(&vif_content)
|
||||
{
|
||||
let ver = if vif_info.version > 0 {
|
||||
Version(vif_info.version as u8)
|
||||
} else {
|
||||
Version::current()
|
||||
};
|
||||
(vif_info.expire_at_sec, ver)
|
||||
} else {
|
||||
(0, Version::current())
|
||||
}
|
||||
} else {
|
||||
(0, Version::current())
|
||||
}
|
||||
};
|
||||
|
||||
let mut vol = EcVolume {
|
||||
volume_id,
|
||||
collection: collection.to_string(),
|
||||
dir: dir.to_string(),
|
||||
dir_idx: dir_idx.to_string(),
|
||||
version: vif_version,
|
||||
shards,
|
||||
dat_file_size: 0,
|
||||
data_shards,
|
||||
parity_shards,
|
||||
ecx_file: None,
|
||||
ecx_file_size: 0,
|
||||
ecj_file: None,
|
||||
disk_type: DiskType::default(),
|
||||
ecx_actual_dir: dir_idx.to_string(),
|
||||
shard_locations: HashMap::new(),
|
||||
expire_at_sec,
|
||||
};
|
||||
|
||||
// Open .ecx file (sorted index) in read/write mode for in-place deletion marking.
|
||||
// Matches Go which opens ecx for writing via MarkNeedleDeleted.
|
||||
let ecx_path = vol.ecx_file_name();
|
||||
if std::path::Path::new(&ecx_path).exists() {
|
||||
let file = OpenOptions::new().read(true).write(true).open(&ecx_path)?;
|
||||
vol.ecx_file_size = file.metadata()?.len() as i64;
|
||||
vol.ecx_file = Some(file);
|
||||
} else if dir_idx != dir {
|
||||
// Fall back to data directory if .ecx was created before -dir.idx was configured
|
||||
let data_base = crate::storage::volume::volume_file_name(dir, collection, volume_id);
|
||||
let fallback_ecx = format!("{}.ecx", data_base);
|
||||
if std::path::Path::new(&fallback_ecx).exists() {
|
||||
tracing::info!(
|
||||
volume_id = volume_id.0,
|
||||
"ecx file not found in idx dir, falling back to data dir"
|
||||
);
|
||||
let file = OpenOptions::new().read(true).write(true).open(&fallback_ecx)?;
|
||||
vol.ecx_file_size = file.metadata()?.len() as i64;
|
||||
vol.ecx_file = Some(file);
|
||||
vol.ecx_actual_dir = dir.to_string();
|
||||
}
|
||||
}
|
||||
|
||||
// Replay .ecj journal into .ecx on startup (matches Go's RebuildEcxFile).
|
||||
vol.rebuild_ecx_from_journal()?;
|
||||
|
||||
// Open .ecj file (deletion journal) — use ecx_actual_dir for consistency
|
||||
let ecj_base =
|
||||
crate::storage::volume::volume_file_name(&vol.ecx_actual_dir, collection, volume_id);
|
||||
let ecj_path = format!("{}.ecj", ecj_base);
|
||||
let ecj_file = OpenOptions::new()
|
||||
.read(true)
|
||||
.write(true)
|
||||
.create(true)
|
||||
.append(true)
|
||||
.open(&ecj_path)?;
|
||||
vol.ecj_file = Some(ecj_file);
|
||||
|
||||
Ok(vol)
|
||||
}
|
||||
|
||||
// ---- File names ----
|
||||
|
||||
#[allow(dead_code)]
|
||||
fn base_name(&self) -> String {
|
||||
crate::storage::volume::volume_file_name(&self.dir, &self.collection, self.volume_id)
|
||||
}
|
||||
|
||||
fn idx_base_name(&self) -> String {
|
||||
crate::storage::volume::volume_file_name(&self.dir_idx, &self.collection, self.volume_id)
|
||||
}
|
||||
|
||||
pub fn ecx_file_name(&self) -> String {
|
||||
format!("{}.ecx", self.idx_base_name())
|
||||
}
|
||||
|
||||
pub fn ecj_file_name(&self) -> String {
|
||||
format!("{}.ecj", self.idx_base_name())
|
||||
}
|
||||
|
||||
/// Sync the EC volume's journal and index files to disk (matching Go's ecv.Sync()).
|
||||
/// Go flushes both .ecj and .ecx to ensure in-place deletion marks are persisted.
|
||||
pub fn sync_to_disk(&self) -> io::Result<()> {
|
||||
if let Some(ref ecj_file) = self.ecj_file {
|
||||
ecj_file.sync_all()?;
|
||||
}
|
||||
if let Some(ref ecx_file) = self.ecx_file {
|
||||
ecx_file.sync_all()?;
|
||||
}
|
||||
Ok(())
|
||||
}
|
||||
|
||||
// ---- Shard management ----
|
||||
|
||||
/// Add a shard to this volume.
|
||||
pub fn add_shard(&mut self, mut shard: EcVolumeShard) -> io::Result<()> {
|
||||
let id = shard.shard_id as usize;
|
||||
let total_shards = (self.data_shards + self.parity_shards) as usize;
|
||||
if id >= total_shards {
|
||||
return Err(io::Error::new(
|
||||
io::ErrorKind::InvalidInput,
|
||||
format!("invalid shard id: {} (max {})", id, total_shards - 1),
|
||||
));
|
||||
}
|
||||
shard.open()?;
|
||||
self.shards[id] = Some(shard);
|
||||
Ok(())
|
||||
}
|
||||
|
||||
/// Remove and close a shard.
|
||||
pub fn remove_shard(&mut self, shard_id: ShardId) {
|
||||
if let Some(ref mut shard) = self.shards[shard_id as usize] {
|
||||
shard.close();
|
||||
}
|
||||
self.shards[shard_id as usize] = None;
|
||||
}
|
||||
|
||||
/// Get a ShardBits bitmap of locally available shards.
|
||||
pub fn shard_bits(&self) -> ShardBits {
|
||||
let mut bits = ShardBits::default();
|
||||
for (i, shard) in self.shards.iter().enumerate() {
|
||||
if shard.is_some() {
|
||||
bits.add_shard_id(i as ShardId);
|
||||
}
|
||||
}
|
||||
bits
|
||||
}
|
||||
|
||||
/// Count of locally available shards.
|
||||
pub fn shard_count(&self) -> usize {
|
||||
self.shards.iter().filter(|s| s.is_some()).count()
|
||||
}
|
||||
|
||||
pub fn is_time_to_destroy(&self) -> bool {
|
||||
self.expire_at_sec > 0
|
||||
&& SystemTime::now()
|
||||
.duration_since(UNIX_EPOCH)
|
||||
.unwrap_or_default()
|
||||
.as_secs()
|
||||
> self.expire_at_sec
|
||||
}
|
||||
|
||||
pub fn to_volume_ec_shard_information_messages(
|
||||
&self,
|
||||
disk_id: u32,
|
||||
) -> Vec<master_pb::VolumeEcShardInformationMessage> {
|
||||
let mut ec_index_bits: u32 = 0;
|
||||
let mut shard_sizes = Vec::new();
|
||||
for shard in self.shards.iter().flatten() {
|
||||
ec_index_bits |= 1u32 << shard.shard_id;
|
||||
shard_sizes.push(shard.file_size());
|
||||
}
|
||||
|
||||
if ec_index_bits == 0 {
|
||||
return Vec::new();
|
||||
}
|
||||
|
||||
vec![master_pb::VolumeEcShardInformationMessage {
|
||||
id: self.volume_id.0,
|
||||
collection: self.collection.clone(),
|
||||
ec_index_bits,
|
||||
shard_sizes,
|
||||
disk_type: self.disk_type.to_string(),
|
||||
expire_at_sec: self.expire_at_sec,
|
||||
disk_id,
|
||||
..Default::default()
|
||||
}]
|
||||
}
|
||||
|
||||
// ---- Shard locations (distributed tracking) ----
|
||||
|
||||
/// Set the list of server addresses for a given shard ID.
|
||||
pub fn set_shard_locations(&mut self, shard_id: ShardId, locations: Vec<String>) {
|
||||
self.shard_locations.insert(shard_id, locations);
|
||||
}
|
||||
|
||||
/// Get the list of server addresses for a given shard ID.
|
||||
pub fn get_shard_locations(&self, shard_id: ShardId) -> &[String] {
|
||||
self.shard_locations
|
||||
.get(&shard_id)
|
||||
.map(|v| v.as_slice())
|
||||
.unwrap_or(&[])
|
||||
}
|
||||
|
||||
// ---- Index operations ----
|
||||
|
||||
/// Find a needle's offset and size in the sorted .ecx index via binary search.
|
||||
pub fn find_needle_from_ecx(&self, needle_id: NeedleId) -> io::Result<Option<(Offset, Size)>> {
|
||||
let ecx_file = self
|
||||
.ecx_file
|
||||
.as_ref()
|
||||
.ok_or_else(|| io::Error::new(io::ErrorKind::Other, "ecx file not open"))?;
|
||||
|
||||
let entry_count = self.ecx_file_size as usize / NEEDLE_MAP_ENTRY_SIZE;
|
||||
if entry_count == 0 {
|
||||
return Ok(None);
|
||||
}
|
||||
|
||||
// Binary search
|
||||
let mut lo: usize = 0;
|
||||
let mut hi: usize = entry_count;
|
||||
let mut entry_buf = [0u8; NEEDLE_MAP_ENTRY_SIZE];
|
||||
|
||||
while lo < hi {
|
||||
let mid = lo + (hi - lo) / 2;
|
||||
let file_offset = (mid * NEEDLE_MAP_ENTRY_SIZE) as u64;
|
||||
|
||||
#[cfg(unix)]
|
||||
{
|
||||
use std::os::unix::fs::FileExt;
|
||||
ecx_file.read_exact_at(&mut entry_buf, file_offset)?;
|
||||
}
|
||||
|
||||
let (key, offset, size) = idx_entry_from_bytes(&entry_buf);
|
||||
if key == needle_id {
|
||||
return Ok(Some((offset, size)));
|
||||
} else if key < needle_id {
|
||||
lo = mid + 1;
|
||||
} else {
|
||||
hi = mid;
|
||||
}
|
||||
}
|
||||
|
||||
Ok(None)
|
||||
}
|
||||
|
||||
/// Locate the EC shard intervals needed to read a needle.
|
||||
pub fn locate_needle(
|
||||
&self,
|
||||
needle_id: NeedleId,
|
||||
) -> io::Result<Option<(Offset, Size, Vec<ec_locate::Interval>)>> {
|
||||
let (offset, size) = match self.find_needle_from_ecx(needle_id)? {
|
||||
Some((o, s)) => (o, s),
|
||||
None => return Ok(None),
|
||||
};
|
||||
|
||||
if size.is_deleted() || offset.is_zero() {
|
||||
return Ok(None);
|
||||
}
|
||||
|
||||
// Match Go's LocateEcShardNeedleInterval: shardSize = shard.ecdFileSize - 1
|
||||
// Shards are usually padded to ErasureCodingSmallBlockSize, so subtract 1
|
||||
// to avoid off-by-one in large block row count calculation.
|
||||
// If datFileSize is known, use datFileSize / DataShards instead.
|
||||
let shard_size = if self.dat_file_size > 0 {
|
||||
self.dat_file_size / self.data_shards as i64
|
||||
} else {
|
||||
self.shard_file_size() - 1
|
||||
};
|
||||
// Pass the actual on-disk size (header+body+checksum+timestamp+padding)
|
||||
// to locate_data, matching Go: types.Size(needle.GetActualSize(size, version))
|
||||
let actual = get_actual_size(size, self.version);
|
||||
let intervals = ec_locate::locate_data(
|
||||
offset.to_actual_offset(),
|
||||
Size(actual as i32),
|
||||
shard_size,
|
||||
self.data_shards,
|
||||
);
|
||||
|
||||
Ok(Some((offset, size, intervals)))
|
||||
}
|
||||
|
||||
/// Read a full needle from locally available EC shards.
|
||||
///
|
||||
/// Locates the needle in the .ecx index, determines which shard intervals
|
||||
/// contain its data, reads from local shards, and parses the result into
|
||||
/// a fully populated Needle (including last_modified, checksum, ttl).
|
||||
///
|
||||
/// Returns `Ok(None)` if the needle is not found or is deleted.
|
||||
/// Returns an error if a required shard is not available locally.
|
||||
pub fn read_ec_shard_needle(&self, needle_id: NeedleId) -> io::Result<Option<Needle>> {
|
||||
let (offset, size, intervals) = match self.locate_needle(needle_id)? {
|
||||
Some(v) => v,
|
||||
None => return Ok(None),
|
||||
};
|
||||
|
||||
if intervals.is_empty() {
|
||||
return Err(io::Error::new(
|
||||
io::ErrorKind::InvalidData,
|
||||
"no intervals for needle",
|
||||
));
|
||||
}
|
||||
|
||||
// Compute the total bytes we need to read (full needle on disk)
|
||||
let actual_size = get_actual_size(size, self.version) as usize;
|
||||
let mut bytes = Vec::with_capacity(actual_size);
|
||||
|
||||
for interval in &intervals {
|
||||
let (shard_id, shard_offset) = interval.to_shard_id_and_offset(self.data_shards);
|
||||
let shard = self
|
||||
.shards
|
||||
.get(shard_id as usize)
|
||||
.and_then(|s| s.as_ref())
|
||||
.ok_or_else(|| {
|
||||
io::Error::new(
|
||||
io::ErrorKind::NotFound,
|
||||
format!("ec shard {} not available locally", shard_id),
|
||||
)
|
||||
})?;
|
||||
|
||||
let mut buf = vec![0u8; interval.size as usize];
|
||||
shard.read_at(&mut buf, shard_offset as u64)?;
|
||||
bytes.extend_from_slice(&buf);
|
||||
}
|
||||
|
||||
// Truncate to exact actual_size (intervals may span more than needed)
|
||||
bytes.truncate(actual_size);
|
||||
|
||||
if bytes.len() < actual_size {
|
||||
return Err(io::Error::new(
|
||||
io::ErrorKind::UnexpectedEof,
|
||||
format!(
|
||||
"read {} bytes but need {} for needle {}",
|
||||
bytes.len(),
|
||||
actual_size,
|
||||
needle_id
|
||||
),
|
||||
));
|
||||
}
|
||||
|
||||
let mut n = Needle::default();
|
||||
n.read_bytes(&bytes, offset.to_actual_offset(), size, self.version)
|
||||
.map_err(|e| io::Error::new(io::ErrorKind::InvalidData, format!("{}", e)))?;
|
||||
|
||||
Ok(Some(n))
|
||||
}
|
||||
|
||||
/// Get the size of a single shard (all shards are the same size).
|
||||
fn shard_file_size(&self) -> i64 {
|
||||
for shard in &self.shards {
|
||||
if let Some(s) = shard {
|
||||
return s.file_size();
|
||||
}
|
||||
}
|
||||
0
|
||||
}
|
||||
|
||||
/// Walk the .ecx index and return (file_count, file_deleted_count, total_size).
|
||||
/// total_size sums size.Raw() for all entries (including deleted), matching Go's WalkIndex.
|
||||
pub fn walk_ecx_stats(&self) -> io::Result<(u64, u64, u64)> {
|
||||
let ecx_file = match self.ecx_file.as_ref() {
|
||||
Some(f) => f,
|
||||
None => return Ok((0, 0, 0)),
|
||||
};
|
||||
|
||||
let entry_count = self.ecx_file_size as usize / NEEDLE_MAP_ENTRY_SIZE;
|
||||
let mut files: u64 = 0;
|
||||
let mut files_deleted: u64 = 0;
|
||||
let mut total_size: u64 = 0;
|
||||
let mut entry_buf = [0u8; NEEDLE_MAP_ENTRY_SIZE];
|
||||
|
||||
for i in 0..entry_count {
|
||||
let file_offset = (i * NEEDLE_MAP_ENTRY_SIZE) as u64;
|
||||
#[cfg(unix)]
|
||||
{
|
||||
use std::os::unix::fs::FileExt;
|
||||
ecx_file.read_exact_at(&mut entry_buf, file_offset)?;
|
||||
}
|
||||
let (_key, _offset, size) = idx_entry_from_bytes(&entry_buf);
|
||||
// Match Go's Size.Raw(): tombstone (-1) returns 0, other negatives return abs
|
||||
if !size.is_tombstone() {
|
||||
total_size += size.0.unsigned_abs() as u64;
|
||||
}
|
||||
if size.is_deleted() {
|
||||
files_deleted += 1;
|
||||
} else {
|
||||
files += 1;
|
||||
}
|
||||
}
|
||||
|
||||
Ok((files, files_deleted, total_size))
|
||||
}
|
||||
|
||||
/// ScrubIndex verifies index integrity of an EC volume.
|
||||
/// Matches Go's `(ev *EcVolume) ScrubIndex()` → `idx.CheckIndexFile()`.
|
||||
/// Returns (entry_count, errors).
|
||||
pub fn scrub_index(&self) -> (u64, Vec<String>) {
|
||||
let ecx_file = match self.ecx_file.as_ref() {
|
||||
Some(f) => f,
|
||||
None => {
|
||||
return (
|
||||
0,
|
||||
vec![format!(
|
||||
"no ECX file associated with EC volume {}",
|
||||
self.volume_id.0
|
||||
)],
|
||||
)
|
||||
}
|
||||
};
|
||||
|
||||
if self.ecx_file_size == 0 {
|
||||
return (
|
||||
0,
|
||||
vec![format!(
|
||||
"zero-size ECX file for EC volume {}",
|
||||
self.volume_id.0
|
||||
)],
|
||||
);
|
||||
}
|
||||
|
||||
let entry_count = self.ecx_file_size as usize / NEEDLE_MAP_ENTRY_SIZE;
|
||||
let mut entries: Vec<(usize, NeedleId, i64, Size)> = Vec::with_capacity(entry_count);
|
||||
let mut errs: Vec<String> = Vec::new();
|
||||
let mut entry_buf = [0u8; NEEDLE_MAP_ENTRY_SIZE];
|
||||
|
||||
// Walk all entries
|
||||
for i in 0..entry_count {
|
||||
let file_offset = (i * NEEDLE_MAP_ENTRY_SIZE) as u64;
|
||||
#[cfg(unix)]
|
||||
{
|
||||
use std::os::unix::fs::FileExt;
|
||||
if let Err(e) = ecx_file.read_exact_at(&mut entry_buf, file_offset) {
|
||||
errs.push(format!("read ecx entry {}: {}", i, e));
|
||||
continue;
|
||||
}
|
||||
}
|
||||
let (key, offset, size) = idx_entry_from_bytes(&entry_buf);
|
||||
entries.push((i, key, offset.to_actual_offset(), size));
|
||||
}
|
||||
|
||||
// Sort by offset, then size
|
||||
entries.sort_by(|a, b| a.2.cmp(&b.2).then(a.3 .0.cmp(&b.3 .0)));
|
||||
|
||||
// Check for overlapping needles
|
||||
for i in 1..entries.len() {
|
||||
let (idx, id, offset, size) = entries[i];
|
||||
let (_, last_id, last_offset, last_size) = entries[i - 1];
|
||||
|
||||
let actual_size =
|
||||
crate::storage::needle::needle::get_actual_size(size, self.version);
|
||||
let end = if actual_size != 0 {
|
||||
offset + actual_size - 1
|
||||
} else {
|
||||
offset
|
||||
};
|
||||
|
||||
let last_actual_size =
|
||||
crate::storage::needle::needle::get_actual_size(last_size, self.version);
|
||||
let last_end = if last_actual_size != 0 {
|
||||
last_offset + last_actual_size - 1
|
||||
} else {
|
||||
last_offset
|
||||
};
|
||||
|
||||
if offset <= last_end {
|
||||
errs.push(format!(
|
||||
"needle {} (#{}) at [{}-{}] overlaps needle {} at [{}-{}]",
|
||||
id.0,
|
||||
idx + 1,
|
||||
offset,
|
||||
end,
|
||||
last_id.0,
|
||||
last_offset,
|
||||
last_end
|
||||
));
|
||||
}
|
||||
}
|
||||
|
||||
// Verify file size matches entry count
|
||||
let expected_size = entry_count as i64 * NEEDLE_MAP_ENTRY_SIZE as i64;
|
||||
if expected_size != self.ecx_file_size {
|
||||
errs.push(format!(
|
||||
"expected an index file of size {}, got {}",
|
||||
expected_size, self.ecx_file_size
|
||||
));
|
||||
}
|
||||
|
||||
(entries.len() as u64, errs)
|
||||
}
|
||||
|
||||
// ---- Deletion ----
|
||||
|
||||
/// Mark a needle as deleted in the .ecx file in-place.
|
||||
/// Matches Go's MarkNeedleDeleted: binary search the .ecx, then overwrite
|
||||
/// the size field with TOMBSTONE_FILE_SIZE.
|
||||
fn mark_needle_deleted_in_ecx(&self, needle_id: NeedleId) -> io::Result<bool> {
|
||||
let ecx_file = match self.ecx_file.as_ref() {
|
||||
Some(f) => f,
|
||||
None => return Ok(false),
|
||||
};
|
||||
|
||||
let entry_count = self.ecx_file_size as usize / NEEDLE_MAP_ENTRY_SIZE;
|
||||
if entry_count == 0 {
|
||||
return Ok(false);
|
||||
}
|
||||
|
||||
// Binary search for the needle
|
||||
let mut lo: usize = 0;
|
||||
let mut hi: usize = entry_count;
|
||||
let mut entry_buf = [0u8; NEEDLE_MAP_ENTRY_SIZE];
|
||||
|
||||
while lo < hi {
|
||||
let mid = lo + (hi - lo) / 2;
|
||||
let file_offset = (mid * NEEDLE_MAP_ENTRY_SIZE) as u64;
|
||||
|
||||
#[cfg(unix)]
|
||||
{
|
||||
use std::os::unix::fs::FileExt;
|
||||
ecx_file.read_exact_at(&mut entry_buf, file_offset)?;
|
||||
}
|
||||
|
||||
let (key, _offset, _size) = idx_entry_from_bytes(&entry_buf);
|
||||
if key == needle_id {
|
||||
// Found — overwrite the size field with TOMBSTONE_FILE_SIZE
|
||||
let size_offset = file_offset + NEEDLE_ID_SIZE as u64 + OFFSET_SIZE as u64;
|
||||
let mut size_buf = [0u8; SIZE_SIZE];
|
||||
TOMBSTONE_FILE_SIZE.to_bytes(&mut size_buf);
|
||||
#[cfg(unix)]
|
||||
{
|
||||
use std::os::unix::fs::FileExt;
|
||||
ecx_file.write_all_at(&size_buf, size_offset)?;
|
||||
}
|
||||
return Ok(true);
|
||||
} else if key < needle_id {
|
||||
lo = mid + 1;
|
||||
} else {
|
||||
hi = mid;
|
||||
}
|
||||
}
|
||||
|
||||
Ok(false) // not found
|
||||
}
|
||||
|
||||
/// Replay .ecj journal entries into .ecx on startup.
|
||||
/// Matches Go's RebuildEcxFile: for each needle ID in .ecj, marks it
|
||||
/// deleted in .ecx, then removes the .ecj file.
|
||||
fn rebuild_ecx_from_journal(&mut self) -> io::Result<()> {
|
||||
let ecj_path = self.ecj_file_name();
|
||||
if !std::path::Path::new(&ecj_path).exists() {
|
||||
return Ok(());
|
||||
}
|
||||
|
||||
let data = fs::read(&ecj_path)?;
|
||||
if data.is_empty() {
|
||||
return Ok(());
|
||||
}
|
||||
|
||||
let count = data.len() / NEEDLE_ID_SIZE;
|
||||
for i in 0..count {
|
||||
let start = i * NEEDLE_ID_SIZE;
|
||||
if start + NEEDLE_ID_SIZE > data.len() {
|
||||
break;
|
||||
}
|
||||
let needle_id = NeedleId::from_bytes(&data[start..start + NEEDLE_ID_SIZE]);
|
||||
// Errors for individual entries are non-fatal (needle may not exist in .ecx)
|
||||
let _ = self.mark_needle_deleted_in_ecx(needle_id);
|
||||
}
|
||||
|
||||
// Remove the .ecj file after replay (matches Go)
|
||||
let _ = fs::remove_file(&ecj_path);
|
||||
|
||||
// Re-create .ecj for future deletions
|
||||
let ecj_file = OpenOptions::new()
|
||||
.read(true)
|
||||
.write(true)
|
||||
.create(true)
|
||||
.append(true)
|
||||
.open(&ecj_path)?;
|
||||
self.ecj_file = Some(ecj_file);
|
||||
|
||||
Ok(())
|
||||
}
|
||||
|
||||
// ---- Deletion journal ----
|
||||
|
||||
/// Append a deleted needle ID to the .ecj journal and mark in .ecx.
|
||||
/// Matches Go's DeleteNeedleFromEcx: marks in .ecx first, then journals.
|
||||
pub fn journal_delete(&mut self, needle_id: NeedleId) -> io::Result<()> {
|
||||
// Mark deleted in .ecx in-place (matches Go's MarkNeedleDeleted)
|
||||
let _ = self.mark_needle_deleted_in_ecx(needle_id);
|
||||
let ecj_file = self
|
||||
.ecj_file
|
||||
.as_mut()
|
||||
.ok_or_else(|| io::Error::new(io::ErrorKind::Other, "ecj file not open"))?;
|
||||
|
||||
let mut buf = [0u8; NEEDLE_ID_SIZE];
|
||||
needle_id.to_bytes(&mut buf);
|
||||
ecj_file.write_all(&buf)?;
|
||||
ecj_file.sync_all()?;
|
||||
Ok(())
|
||||
}
|
||||
|
||||
/// Append a deleted needle ID to the .ecj journal, validating the cookie first.
|
||||
/// Matches Go's DeleteEcShardNeedle which validates cookie before journaling.
|
||||
/// A cookie of 0 means skip cookie check (e.g., orphan cleanup).
|
||||
pub fn journal_delete_with_cookie(
|
||||
&mut self,
|
||||
needle_id: NeedleId,
|
||||
cookie: crate::storage::types::Cookie,
|
||||
) -> io::Result<()> {
|
||||
// cookie == 0 indicates SkipCookieCheck was requested
|
||||
if cookie.0 != 0 {
|
||||
// Try to read the needle's cookie from the EC shards to validate
|
||||
// Look up the needle in ecx index to find its offset, then read header from shard
|
||||
if let Ok(Some((offset, size))) = self.find_needle_from_ecx(needle_id) {
|
||||
if !size.is_deleted() && !offset.is_zero() {
|
||||
let actual_offset = offset.to_actual_offset() as u64;
|
||||
// Determine which shard contains this offset and read the cookie
|
||||
let shard_size = self
|
||||
.shards
|
||||
.iter()
|
||||
.filter_map(|s| s.as_ref())
|
||||
.map(|s| s.file_size())
|
||||
.next()
|
||||
.unwrap_or(0) as u64;
|
||||
if shard_size > 0 {
|
||||
let shard_id = (actual_offset / shard_size) as usize;
|
||||
let shard_offset = actual_offset % shard_size;
|
||||
if let Some(Some(shard)) = self.shards.get(shard_id) {
|
||||
let mut header_buf = [0u8; 4]; // cookie is first 4 bytes of needle
|
||||
if shard.read_at(&mut header_buf, shard_offset).is_ok() {
|
||||
let needle_cookie =
|
||||
crate::storage::types::Cookie(u32::from_be_bytes(header_buf));
|
||||
if needle_cookie != cookie {
|
||||
return Err(io::Error::new(
|
||||
io::ErrorKind::InvalidData,
|
||||
format!("unexpected cookie {:x}", cookie.0),
|
||||
));
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
self.journal_delete(needle_id)
|
||||
}
|
||||
|
||||
/// Read all deleted needle IDs from the .ecj journal.
|
||||
pub fn read_deleted_needles(&self) -> io::Result<Vec<NeedleId>> {
|
||||
let ecj_path = self.ecj_file_name();
|
||||
if !std::path::Path::new(&ecj_path).exists() {
|
||||
return Ok(Vec::new());
|
||||
}
|
||||
|
||||
let data = fs::read(&ecj_path)?;
|
||||
let count = data.len() / NEEDLE_ID_SIZE;
|
||||
let mut needles = Vec::with_capacity(count);
|
||||
for i in 0..count {
|
||||
let start = i * NEEDLE_ID_SIZE;
|
||||
let id = NeedleId::from_bytes(&data[start..start + NEEDLE_ID_SIZE]);
|
||||
needles.push(id);
|
||||
}
|
||||
Ok(needles)
|
||||
}
|
||||
|
||||
// ---- Lifecycle ----
|
||||
|
||||
pub fn close(&mut self) {
|
||||
for shard in &mut self.shards {
|
||||
if let Some(s) = shard {
|
||||
s.close();
|
||||
}
|
||||
*shard = None;
|
||||
}
|
||||
// Sync .ecx before closing to flush in-place deletion marks (matches Go's ev.ecxFile.Sync())
|
||||
if let Some(ref ecx_file) = self.ecx_file {
|
||||
let _ = ecx_file.sync_all();
|
||||
}
|
||||
self.ecx_file = None;
|
||||
self.ecj_file = None;
|
||||
}
|
||||
|
||||
pub fn destroy(&mut self) {
|
||||
for shard in &mut self.shards {
|
||||
if let Some(s) = shard {
|
||||
s.destroy();
|
||||
}
|
||||
*shard = None;
|
||||
}
|
||||
// Remove .ecx/.ecj/.vif from ecx_actual_dir (where they were found)
|
||||
// Go's Destroy() removes .ecx, .ecj, and .vif files.
|
||||
let actual_base = crate::storage::volume::volume_file_name(
|
||||
&self.ecx_actual_dir,
|
||||
&self.collection,
|
||||
self.volume_id,
|
||||
);
|
||||
let _ = fs::remove_file(format!("{}.ecx", actual_base));
|
||||
let _ = fs::remove_file(format!("{}.ecj", actual_base));
|
||||
let _ = fs::remove_file(format!("{}.vif", actual_base));
|
||||
// Also try the configured idx dir and data dir in case files exist in either
|
||||
if self.ecx_actual_dir != self.dir_idx {
|
||||
let _ = fs::remove_file(self.ecx_file_name());
|
||||
let _ = fs::remove_file(self.ecj_file_name());
|
||||
let idx_base = crate::storage::volume::volume_file_name(
|
||||
&self.dir_idx,
|
||||
&self.collection,
|
||||
self.volume_id,
|
||||
);
|
||||
let _ = fs::remove_file(format!("{}.vif", idx_base));
|
||||
}
|
||||
if self.ecx_actual_dir != self.dir && self.dir_idx != self.dir {
|
||||
let data_base = crate::storage::volume::volume_file_name(
|
||||
&self.dir,
|
||||
&self.collection,
|
||||
self.volume_id,
|
||||
);
|
||||
let _ = fs::remove_file(format!("{}.ecx", data_base));
|
||||
let _ = fs::remove_file(format!("{}.ecj", data_base));
|
||||
let _ = fs::remove_file(format!("{}.vif", data_base));
|
||||
}
|
||||
self.ecx_file = None;
|
||||
self.ecj_file = None;
|
||||
}
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
use tempfile::TempDir;
|
||||
|
||||
fn write_ecx_file(
|
||||
dir: &str,
|
||||
collection: &str,
|
||||
vid: VolumeId,
|
||||
entries: &[(NeedleId, Offset, Size)],
|
||||
) {
|
||||
let base = crate::storage::volume::volume_file_name(dir, collection, vid);
|
||||
let ecx_path = format!("{}.ecx", base);
|
||||
let mut file = File::create(&ecx_path).unwrap();
|
||||
|
||||
// Write sorted entries
|
||||
for &(key, offset, size) in entries {
|
||||
let mut buf = [0u8; NEEDLE_MAP_ENTRY_SIZE];
|
||||
idx_entry_to_bytes(&mut buf, key, offset, size);
|
||||
file.write_all(&buf).unwrap();
|
||||
}
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_ec_volume_find_needle() {
|
||||
let tmp = TempDir::new().unwrap();
|
||||
let dir = tmp.path().to_str().unwrap();
|
||||
|
||||
// Write sorted ecx entries
|
||||
let entries = vec![
|
||||
(NeedleId(1), Offset::from_actual_offset(8), Size(100)),
|
||||
(NeedleId(5), Offset::from_actual_offset(200), Size(200)),
|
||||
(NeedleId(10), Offset::from_actual_offset(500), Size(300)),
|
||||
];
|
||||
write_ecx_file(dir, "", VolumeId(1), &entries);
|
||||
|
||||
let vol = EcVolume::new(dir, dir, "", VolumeId(1)).unwrap();
|
||||
|
||||
// Found
|
||||
let result = vol.find_needle_from_ecx(NeedleId(5)).unwrap();
|
||||
assert!(result.is_some());
|
||||
let (offset, size) = result.unwrap();
|
||||
assert_eq!(offset.to_actual_offset(), 200);
|
||||
assert_eq!(size, Size(200));
|
||||
|
||||
// Not found
|
||||
let result = vol.find_needle_from_ecx(NeedleId(7)).unwrap();
|
||||
assert!(result.is_none());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_ec_volume_journal() {
|
||||
let tmp = TempDir::new().unwrap();
|
||||
let dir = tmp.path().to_str().unwrap();
|
||||
|
||||
// Need ecx file for EcVolume::new to succeed
|
||||
write_ecx_file(dir, "", VolumeId(1), &[]);
|
||||
|
||||
let mut vol = EcVolume::new(dir, dir, "", VolumeId(1)).unwrap();
|
||||
|
||||
vol.journal_delete(NeedleId(10)).unwrap();
|
||||
vol.journal_delete(NeedleId(20)).unwrap();
|
||||
|
||||
let deleted = vol.read_deleted_needles().unwrap();
|
||||
assert_eq!(deleted, vec![NeedleId(10), NeedleId(20)]);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_ec_volume_shard_bits() {
|
||||
let tmp = TempDir::new().unwrap();
|
||||
let dir = tmp.path().to_str().unwrap();
|
||||
write_ecx_file(dir, "", VolumeId(1), &[]);
|
||||
|
||||
let mut vol = EcVolume::new(dir, dir, "", VolumeId(1)).unwrap();
|
||||
assert_eq!(vol.shard_count(), 0);
|
||||
|
||||
// Create a shard file so we can add it
|
||||
let mut shard = EcVolumeShard::new(dir, "", VolumeId(1), 3);
|
||||
shard.create().unwrap();
|
||||
shard.write_all(&[0u8; 100]).unwrap();
|
||||
shard.close();
|
||||
|
||||
vol.add_shard(EcVolumeShard::new(dir, "", VolumeId(1), 3))
|
||||
.unwrap();
|
||||
assert_eq!(vol.shard_count(), 1);
|
||||
assert!(vol.shard_bits().has_shard_id(3));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_ec_volume_uses_collection_prefixed_vif_config() {
|
||||
let tmp = TempDir::new().unwrap();
|
||||
let dir = tmp.path().to_str().unwrap();
|
||||
write_ecx_file(dir, "pics", VolumeId(1), &[]);
|
||||
|
||||
let vif = crate::storage::volume::VifVolumeInfo {
|
||||
ec_shard_config: Some(crate::storage::volume::VifEcShardConfig {
|
||||
data_shards: 6,
|
||||
parity_shards: 3,
|
||||
}),
|
||||
..Default::default()
|
||||
};
|
||||
let base = crate::storage::volume::volume_file_name(dir, "pics", VolumeId(1));
|
||||
std::fs::write(
|
||||
format!("{}.vif", base),
|
||||
serde_json::to_string_pretty(&vif).unwrap(),
|
||||
)
|
||||
.unwrap();
|
||||
|
||||
let vol = EcVolume::new(dir, dir, "pics", VolumeId(1)).unwrap();
|
||||
assert_eq!(vol.data_shards, 6);
|
||||
assert_eq!(vol.parity_shards, 3);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_ec_volume_invalid_vif_config_falls_back_to_defaults() {
|
||||
let tmp = TempDir::new().unwrap();
|
||||
let dir = tmp.path().to_str().unwrap();
|
||||
write_ecx_file(dir, "pics", VolumeId(1), &[]);
|
||||
|
||||
let vif = crate::storage::volume::VifVolumeInfo {
|
||||
ec_shard_config: Some(crate::storage::volume::VifEcShardConfig {
|
||||
data_shards: 10,
|
||||
parity_shards: 10,
|
||||
}),
|
||||
..Default::default()
|
||||
};
|
||||
let base = crate::storage::volume::volume_file_name(dir, "pics", VolumeId(1));
|
||||
std::fs::write(
|
||||
format!("{}.vif", base),
|
||||
serde_json::to_string_pretty(&vif).unwrap(),
|
||||
)
|
||||
.unwrap();
|
||||
|
||||
let vol = EcVolume::new(dir, dir, "pics", VolumeId(1)).unwrap();
|
||||
assert_eq!(vol.data_shards, DATA_SHARDS_COUNT as u32);
|
||||
assert_eq!(vol.parity_shards, PARITY_SHARDS_COUNT as u32);
|
||||
}
|
||||
}
|
||||
16
seaweed-volume/src/storage/erasure_coding/mod.rs
Normal file
16
seaweed-volume/src/storage/erasure_coding/mod.rs
Normal file
@@ -0,0 +1,16 @@
|
||||
//! Erasure coding module for volume data protection.
|
||||
//!
|
||||
//! Encodes a volume's .dat file into 10 data + 4 parity shards using
|
||||
//! Reed-Solomon erasure coding. Can reconstruct from any 10 of 14 shards.
|
||||
|
||||
pub mod ec_decoder;
|
||||
pub mod ec_encoder;
|
||||
pub mod ec_locate;
|
||||
pub mod ec_shard;
|
||||
pub mod ec_volume;
|
||||
|
||||
pub use ec_shard::{
|
||||
EcVolumeShard, ShardId, DATA_SHARDS_COUNT, MAX_SHARD_COUNT, MIN_TOTAL_DISKS,
|
||||
PARITY_SHARDS_COUNT, TOTAL_SHARDS_COUNT,
|
||||
};
|
||||
pub use ec_volume::EcVolume;
|
||||
116
seaweed-volume/src/storage/idx/mod.rs
Normal file
116
seaweed-volume/src/storage/idx/mod.rs
Normal file
@@ -0,0 +1,116 @@
|
||||
//! Index file (.idx) format: sequential 17-byte entries.
|
||||
//!
|
||||
//! Each entry: NeedleId(8) + Offset(5) + Size(4) = 17 bytes.
|
||||
|
||||
use crate::storage::types::*;
|
||||
use std::io::{self, Read, Seek, SeekFrom};
|
||||
|
||||
const ROWS_TO_READ: usize = 1024;
|
||||
|
||||
/// Walk all entries in an .idx file, calling `f` for each.
|
||||
/// Mirrors Go's `WalkIndexFile()`.
|
||||
pub fn walk_index_file<R, F>(reader: &mut R, start_from: u64, mut f: F) -> io::Result<()>
|
||||
where
|
||||
R: Read + Seek,
|
||||
F: FnMut(NeedleId, Offset, Size) -> io::Result<()>,
|
||||
{
|
||||
let reader_offset = start_from * NEEDLE_MAP_ENTRY_SIZE as u64;
|
||||
reader.seek(SeekFrom::Start(reader_offset))?;
|
||||
|
||||
let mut buf = vec![0u8; NEEDLE_MAP_ENTRY_SIZE * ROWS_TO_READ];
|
||||
|
||||
loop {
|
||||
let count = match reader.read(&mut buf) {
|
||||
Ok(0) => return Ok(()),
|
||||
Ok(n) => n,
|
||||
Err(ref e) if e.kind() == io::ErrorKind::UnexpectedEof => return Ok(()),
|
||||
Err(e) => return Err(e),
|
||||
};
|
||||
|
||||
let mut i = 0;
|
||||
while i + NEEDLE_MAP_ENTRY_SIZE <= count {
|
||||
let (key, offset, size) = idx_entry_from_bytes(&buf[i..i + NEEDLE_MAP_ENTRY_SIZE]);
|
||||
f(key, offset, size)?;
|
||||
i += NEEDLE_MAP_ENTRY_SIZE;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
/// Write a single index entry to a writer.
|
||||
pub fn write_index_entry<W: io::Write>(
|
||||
writer: &mut W,
|
||||
key: NeedleId,
|
||||
offset: Offset,
|
||||
size: Size,
|
||||
) -> io::Result<()> {
|
||||
let mut buf = [0u8; NEEDLE_MAP_ENTRY_SIZE];
|
||||
idx_entry_to_bytes(&mut buf, key, offset, size);
|
||||
writer.write_all(&buf)
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
use std::io::Cursor;
|
||||
|
||||
#[test]
|
||||
fn test_walk_index_file() {
|
||||
// Create a small index with 3 entries
|
||||
let mut data = Vec::new();
|
||||
let entries = vec![
|
||||
(NeedleId(1), Offset::from_actual_offset(0), Size(100)),
|
||||
(NeedleId(2), Offset::from_actual_offset(128), Size(200)),
|
||||
(NeedleId(3), Offset::from_actual_offset(384), Size(300)),
|
||||
];
|
||||
for (key, offset, size) in &entries {
|
||||
let mut buf = [0u8; NEEDLE_MAP_ENTRY_SIZE];
|
||||
idx_entry_to_bytes(&mut buf, *key, *offset, *size);
|
||||
data.extend_from_slice(&buf);
|
||||
}
|
||||
|
||||
let mut cursor = Cursor::new(data);
|
||||
let mut collected = Vec::new();
|
||||
walk_index_file(&mut cursor, 0, |key, offset, size| {
|
||||
collected.push((key, offset.to_actual_offset(), size));
|
||||
Ok(())
|
||||
})
|
||||
.unwrap();
|
||||
|
||||
assert_eq!(collected.len(), 3);
|
||||
assert_eq!(collected[0].0, NeedleId(1));
|
||||
assert_eq!(collected[0].1, 0);
|
||||
assert_eq!(collected[0].2, Size(100));
|
||||
assert_eq!(collected[1].0, NeedleId(2));
|
||||
assert_eq!(collected[2].0, NeedleId(3));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_walk_empty() {
|
||||
let mut cursor = Cursor::new(Vec::new());
|
||||
let mut count = 0;
|
||||
walk_index_file(&mut cursor, 0, |_, _, _| {
|
||||
count += 1;
|
||||
Ok(())
|
||||
})
|
||||
.unwrap();
|
||||
assert_eq!(count, 0);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_write_index_entry() {
|
||||
let mut buf = Vec::new();
|
||||
write_index_entry(
|
||||
&mut buf,
|
||||
NeedleId(42),
|
||||
Offset::from_actual_offset(8 * 10),
|
||||
Size(512),
|
||||
)
|
||||
.unwrap();
|
||||
assert_eq!(buf.len(), NEEDLE_MAP_ENTRY_SIZE);
|
||||
|
||||
let (key, offset, size) = idx_entry_from_bytes(&buf);
|
||||
assert_eq!(key, NeedleId(42));
|
||||
assert_eq!(offset.to_actual_offset(), 80);
|
||||
assert_eq!(size, Size(512));
|
||||
}
|
||||
}
|
||||
9
seaweed-volume/src/storage/mod.rs
Normal file
9
seaweed-volume/src/storage/mod.rs
Normal file
@@ -0,0 +1,9 @@
|
||||
pub mod disk_location;
|
||||
pub mod erasure_coding;
|
||||
pub mod idx;
|
||||
pub mod needle;
|
||||
pub mod needle_map;
|
||||
pub mod store;
|
||||
pub mod super_block;
|
||||
pub mod types;
|
||||
pub mod volume;
|
||||
73
seaweed-volume/src/storage/needle/crc.rs
Normal file
73
seaweed-volume/src/storage/needle/crc.rs
Normal file
@@ -0,0 +1,73 @@
|
||||
//! CRC32-Castagnoli checksum for needle data integrity.
|
||||
//!
|
||||
//! Matches Go's `crc32.MakeTable(crc32.Castagnoli)` exactly.
|
||||
//! The CRC is stored as raw u32 (not the `.Value()` legacy transform).
|
||||
|
||||
/// CRC32-Castagnoli checksum wrapper.
|
||||
#[derive(Debug, Clone, Copy, PartialEq, Eq, Default)]
|
||||
pub struct CRC(pub u32);
|
||||
|
||||
impl CRC {
|
||||
/// Compute CRC from a byte slice (starting from 0).
|
||||
pub fn new(data: &[u8]) -> Self {
|
||||
CRC(0).update(data)
|
||||
}
|
||||
|
||||
/// Update the CRC with additional bytes.
|
||||
pub fn update(self, data: &[u8]) -> Self {
|
||||
CRC(crc32c::crc32c_append(self.0, data))
|
||||
}
|
||||
|
||||
/// Legacy `.Value()` function — deprecated in Go but needed for backward compat check.
|
||||
/// Formula: (crc >> 15 | crc << 17) + 0xa282ead8
|
||||
pub fn legacy_value(&self) -> u32 {
|
||||
(self.0 >> 15 | self.0 << 17).wrapping_add(0xa282ead8)
|
||||
}
|
||||
}
|
||||
|
||||
impl From<u32> for CRC {
|
||||
fn from(v: u32) -> Self {
|
||||
CRC(v)
|
||||
}
|
||||
}
|
||||
|
||||
impl From<CRC> for u32 {
|
||||
fn from(c: CRC) -> Self {
|
||||
c.0
|
||||
}
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn test_crc_empty() {
|
||||
let crc = CRC::new(&[]);
|
||||
assert_eq!(crc.0, 0);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_crc_known_value() {
|
||||
// CRC32-C of "hello" — verify it produces a non-zero deterministic value
|
||||
let crc = CRC::new(b"hello");
|
||||
assert_ne!(crc.0, 0);
|
||||
// Same input produces same output
|
||||
assert_eq!(crc, CRC::new(b"hello"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_crc_incremental() {
|
||||
let crc1 = CRC::new(b"hello world");
|
||||
let crc2 = CRC::new(b"hello").update(b" world");
|
||||
assert_eq!(crc1, crc2);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_crc_legacy_value() {
|
||||
let crc = CRC(0x12345678);
|
||||
let v = crc.legacy_value();
|
||||
let expected = (0x12345678u32 >> 15 | 0x12345678u32 << 17).wrapping_add(0xa282ead8);
|
||||
assert_eq!(v, expected);
|
||||
}
|
||||
}
|
||||
7
seaweed-volume/src/storage/needle/mod.rs
Normal file
7
seaweed-volume/src/storage/needle/mod.rs
Normal file
@@ -0,0 +1,7 @@
|
||||
pub mod crc;
|
||||
pub mod needle;
|
||||
pub mod ttl;
|
||||
|
||||
pub use crc::CRC;
|
||||
pub use needle::Needle;
|
||||
pub use ttl::TTL;
|
||||
944
seaweed-volume/src/storage/needle/needle.rs
Normal file
944
seaweed-volume/src/storage/needle/needle.rs
Normal file
@@ -0,0 +1,944 @@
|
||||
//! Needle: the individual file object stored in a volume.
|
||||
//!
|
||||
//! Binary format (Version 2/3):
|
||||
//! Header (16 bytes): Cookie(4) + NeedleId(8) + Size(4)
|
||||
//! Body (Size bytes):
|
||||
//! DataSize(4) + Data(DataSize) + Flags(1)
|
||||
//! [if HasName]: NameSize(1) + Name(NameSize)
|
||||
//! [if HasMime]: MimeSize(1) + Mime(MimeSize)
|
||||
//! [if HasLastMod]: LastModified(5)
|
||||
//! [if HasTtl]: TTL(2)
|
||||
//! [if HasPairs]: PairsSize(2) + Pairs(PairsSize)
|
||||
//! Tail:
|
||||
//! Checksum(4) + [if V3: AppendAtNs(8)] + Padding(0-7)
|
||||
|
||||
use super::crc::CRC;
|
||||
use super::ttl::TTL;
|
||||
use crate::storage::types::*;
|
||||
|
||||
// Flag bits (matching Go constants)
|
||||
pub const FLAG_IS_COMPRESSED: u8 = 0x01;
|
||||
pub const FLAG_HAS_NAME: u8 = 0x02;
|
||||
pub const FLAG_HAS_MIME: u8 = 0x04;
|
||||
pub const FLAG_HAS_LAST_MODIFIED_DATE: u8 = 0x08;
|
||||
pub const FLAG_HAS_TTL: u8 = 0x10;
|
||||
pub const FLAG_HAS_PAIRS: u8 = 0x20;
|
||||
pub const FLAG_IS_CHUNK_MANIFEST: u8 = 0x80;
|
||||
|
||||
pub const LAST_MODIFIED_BYTES_LENGTH: usize = 5;
|
||||
pub const TTL_BYTES_LENGTH: usize = 2;
|
||||
|
||||
#[derive(Debug, Clone, Default)]
|
||||
pub struct Needle {
|
||||
pub cookie: Cookie,
|
||||
pub id: NeedleId,
|
||||
pub size: Size, // sum of body content fields
|
||||
|
||||
// Version 2+ fields
|
||||
pub data_size: u32,
|
||||
pub data: Vec<u8>,
|
||||
pub flags: u8,
|
||||
pub name_size: u8,
|
||||
pub name: Vec<u8>, // max 255 bytes
|
||||
pub mime_size: u8,
|
||||
pub mime: Vec<u8>, // max 255 bytes
|
||||
pub pairs_size: u16,
|
||||
pub pairs: Vec<u8>, // max 64KB, JSON
|
||||
pub last_modified: u64, // stored as 5 bytes on disk
|
||||
pub ttl: Option<TTL>,
|
||||
|
||||
// Tail fields
|
||||
pub checksum: CRC,
|
||||
pub append_at_ns: u64, // Version 3 only
|
||||
pub padding: Vec<u8>,
|
||||
}
|
||||
|
||||
impl Needle {
|
||||
// ---- Flag accessors (matching Go) ----
|
||||
|
||||
pub fn is_compressed(&self) -> bool {
|
||||
self.flags & FLAG_IS_COMPRESSED != 0
|
||||
}
|
||||
pub fn set_is_compressed(&mut self) {
|
||||
self.flags |= FLAG_IS_COMPRESSED;
|
||||
}
|
||||
|
||||
pub fn has_name(&self) -> bool {
|
||||
self.flags & FLAG_HAS_NAME != 0
|
||||
}
|
||||
pub fn set_has_name(&mut self) {
|
||||
self.flags |= FLAG_HAS_NAME;
|
||||
}
|
||||
|
||||
pub fn has_mime(&self) -> bool {
|
||||
self.flags & FLAG_HAS_MIME != 0
|
||||
}
|
||||
pub fn set_has_mime(&mut self) {
|
||||
self.flags |= FLAG_HAS_MIME;
|
||||
}
|
||||
|
||||
pub fn has_last_modified_date(&self) -> bool {
|
||||
self.flags & FLAG_HAS_LAST_MODIFIED_DATE != 0
|
||||
}
|
||||
pub fn set_has_last_modified_date(&mut self) {
|
||||
self.flags |= FLAG_HAS_LAST_MODIFIED_DATE;
|
||||
}
|
||||
|
||||
pub fn has_ttl(&self) -> bool {
|
||||
self.flags & FLAG_HAS_TTL != 0
|
||||
}
|
||||
pub fn set_has_ttl(&mut self) {
|
||||
self.flags |= FLAG_HAS_TTL;
|
||||
}
|
||||
|
||||
pub fn has_pairs(&self) -> bool {
|
||||
self.flags & FLAG_HAS_PAIRS != 0
|
||||
}
|
||||
pub fn set_has_pairs(&mut self) {
|
||||
self.flags |= FLAG_HAS_PAIRS;
|
||||
}
|
||||
|
||||
pub fn is_chunk_manifest(&self) -> bool {
|
||||
self.flags & FLAG_IS_CHUNK_MANIFEST != 0
|
||||
}
|
||||
pub fn set_is_chunk_manifest(&mut self) {
|
||||
self.flags |= FLAG_IS_CHUNK_MANIFEST;
|
||||
}
|
||||
|
||||
// ---- Header parsing ----
|
||||
|
||||
/// Parse the 16-byte needle header.
|
||||
pub fn parse_header(bytes: &[u8]) -> (Cookie, NeedleId, Size) {
|
||||
assert!(bytes.len() >= NEEDLE_HEADER_SIZE);
|
||||
let cookie = Cookie::from_bytes(&bytes[0..COOKIE_SIZE]);
|
||||
let id = NeedleId::from_bytes(&bytes[COOKIE_SIZE..COOKIE_SIZE + NEEDLE_ID_SIZE]);
|
||||
let size = Size::from_bytes(&bytes[COOKIE_SIZE + NEEDLE_ID_SIZE..NEEDLE_HEADER_SIZE]);
|
||||
(cookie, id, size)
|
||||
}
|
||||
|
||||
/// Parse needle header into self.
|
||||
pub fn read_header(&mut self, bytes: &[u8]) {
|
||||
let (cookie, id, size) = Self::parse_header(bytes);
|
||||
self.cookie = cookie;
|
||||
self.id = id;
|
||||
self.size = size;
|
||||
}
|
||||
|
||||
// ---- Body reading (Version 2/3) ----
|
||||
|
||||
/// Read version 2/3 body metadata only — skips copying the data payload.
|
||||
/// Sets `data_size` and all metadata fields but leaves `data` empty.
|
||||
pub fn read_body_v2_meta_only(&mut self, bytes: &[u8]) -> Result<(), NeedleError> {
|
||||
let len_bytes = bytes.len();
|
||||
let mut index = 0;
|
||||
|
||||
// DataSize (4 bytes)
|
||||
if index + 4 > len_bytes {
|
||||
return Err(NeedleError::IndexOutOfRange(1));
|
||||
}
|
||||
self.data_size = u32::from_be_bytes([
|
||||
bytes[index],
|
||||
bytes[index + 1],
|
||||
bytes[index + 2],
|
||||
bytes[index + 3],
|
||||
]);
|
||||
index += 4;
|
||||
|
||||
// Skip data bytes (do NOT copy them)
|
||||
if index + self.data_size as usize > len_bytes {
|
||||
return Err(NeedleError::IndexOutOfRange(1));
|
||||
}
|
||||
index += self.data_size as usize;
|
||||
|
||||
// Read non-data metadata
|
||||
self.read_body_v2_non_data(&bytes[index..])?;
|
||||
Ok(())
|
||||
}
|
||||
|
||||
/// Read full needle from bytes but skip copying the data payload.
|
||||
/// Sets all metadata fields, checksum, etc. but leaves `data` empty.
|
||||
pub fn read_bytes_meta_only(
|
||||
&mut self,
|
||||
bytes: &[u8],
|
||||
offset: i64,
|
||||
expected_size: Size,
|
||||
version: Version,
|
||||
) -> Result<(), NeedleError> {
|
||||
self.read_header(bytes);
|
||||
|
||||
if self.size != expected_size {
|
||||
return Err(NeedleError::SizeMismatch {
|
||||
offset,
|
||||
id: self.id,
|
||||
found: self.size,
|
||||
expected: expected_size,
|
||||
});
|
||||
}
|
||||
|
||||
let body_start = NEEDLE_HEADER_SIZE;
|
||||
let body_end = body_start + self.size.0 as usize;
|
||||
|
||||
if version == VERSION_1 {
|
||||
// V1 has no metadata — data is the entire body
|
||||
self.data_size = self.size.0 as u32;
|
||||
} else if self.size.0 == 0 {
|
||||
// Tombstones have no DataSize/body section; metadata starts at the tail.
|
||||
self.data_size = 0;
|
||||
} else {
|
||||
self.read_body_v2_meta_only(&bytes[body_start..body_end])?;
|
||||
}
|
||||
|
||||
// Read tail but skip CRC validation (no data to check against)
|
||||
self.read_tail_meta_only(&bytes[body_end..], version)?;
|
||||
Ok(())
|
||||
}
|
||||
|
||||
/// Paged meta-only parse: accepts the 20-byte header+DataSize prefix and the
|
||||
/// meta tail bytes (everything after the data payload). This avoids reading
|
||||
/// the data payload from disk at all, matching Go's `ReadNeedleMeta`.
|
||||
pub fn read_paged_meta(
|
||||
&mut self,
|
||||
header_bytes: &[u8], // first 20 bytes: NEEDLE_HEADER_SIZE + DATA_SIZE_SIZE
|
||||
meta_bytes: &[u8], // tail: non-data body metadata + checksum + timestamp + padding
|
||||
offset: i64,
|
||||
expected_size: Size,
|
||||
version: Version,
|
||||
) -> Result<(), NeedleError> {
|
||||
// Parse the 16-byte header
|
||||
self.read_header(header_bytes);
|
||||
|
||||
if self.size != expected_size {
|
||||
return Err(NeedleError::SizeMismatch {
|
||||
offset,
|
||||
id: self.id,
|
||||
found: self.size,
|
||||
expected: expected_size,
|
||||
});
|
||||
}
|
||||
|
||||
if version == VERSION_1 {
|
||||
self.data_size = self.size.0 as u32;
|
||||
} else if self.size.0 == 0 {
|
||||
// Tombstone
|
||||
self.data_size = 0;
|
||||
} else {
|
||||
// Extract DataSize from bytes 16..20
|
||||
self.data_size = u32::from_be_bytes([
|
||||
header_bytes[NEEDLE_HEADER_SIZE],
|
||||
header_bytes[NEEDLE_HEADER_SIZE + 1],
|
||||
header_bytes[NEEDLE_HEADER_SIZE + 2],
|
||||
header_bytes[NEEDLE_HEADER_SIZE + 3],
|
||||
]);
|
||||
|
||||
// meta_bytes starts with the non-data body metadata (flags, name, mime, etc.)
|
||||
// followed by the tail (checksum + timestamp + padding).
|
||||
// readNeedleDataVersion2NonData returns the index where it stopped.
|
||||
let index = self.read_body_v2_non_data(meta_bytes)?;
|
||||
self.read_tail_meta_only(&meta_bytes[index..], version)?;
|
||||
return Ok(());
|
||||
}
|
||||
|
||||
// For VERSION_1 or tombstones, meta_bytes IS the tail
|
||||
self.read_tail_meta_only(meta_bytes, version)?;
|
||||
Ok(())
|
||||
}
|
||||
|
||||
/// Read tail without CRC validation (used when data was not read).
|
||||
fn read_tail_meta_only(
|
||||
&mut self,
|
||||
tail_bytes: &[u8],
|
||||
version: Version,
|
||||
) -> Result<(), NeedleError> {
|
||||
if tail_bytes.len() < NEEDLE_CHECKSUM_SIZE {
|
||||
return Err(NeedleError::TailTooShort);
|
||||
}
|
||||
|
||||
self.checksum = CRC(u32::from_be_bytes([
|
||||
tail_bytes[0],
|
||||
tail_bytes[1],
|
||||
tail_bytes[2],
|
||||
tail_bytes[3],
|
||||
]));
|
||||
|
||||
if version == VERSION_3 {
|
||||
let ts_offset = NEEDLE_CHECKSUM_SIZE;
|
||||
if tail_bytes.len() < ts_offset + TIMESTAMP_SIZE {
|
||||
return Err(NeedleError::TailTooShort);
|
||||
}
|
||||
self.append_at_ns = u64::from_be_bytes([
|
||||
tail_bytes[ts_offset],
|
||||
tail_bytes[ts_offset + 1],
|
||||
tail_bytes[ts_offset + 2],
|
||||
tail_bytes[ts_offset + 3],
|
||||
tail_bytes[ts_offset + 4],
|
||||
tail_bytes[ts_offset + 5],
|
||||
tail_bytes[ts_offset + 6],
|
||||
tail_bytes[ts_offset + 7],
|
||||
]);
|
||||
}
|
||||
|
||||
Ok(())
|
||||
}
|
||||
|
||||
/// Read the version 2/3 body data from bytes (size bytes starting after header).
|
||||
/// Returns IndexOutOfRange errors for truncated data (matching Go's readNeedleDataVersion2).
|
||||
pub fn read_body_v2(&mut self, bytes: &[u8]) -> Result<(), NeedleError> {
|
||||
let len_bytes = bytes.len();
|
||||
let mut index = 0;
|
||||
|
||||
// DataSize (4 bytes)
|
||||
if index + 4 > len_bytes {
|
||||
return Ok(()); // tolerate EOF
|
||||
}
|
||||
self.data_size = u32::from_be_bytes([
|
||||
bytes[index],
|
||||
bytes[index + 1],
|
||||
bytes[index + 2],
|
||||
bytes[index + 3],
|
||||
]);
|
||||
index += 4;
|
||||
|
||||
// Data
|
||||
if index + self.data_size as usize > len_bytes {
|
||||
return Err(NeedleError::IndexOutOfRange(1));
|
||||
}
|
||||
self.data = bytes[index..index + self.data_size as usize].to_vec();
|
||||
index += self.data_size as usize;
|
||||
|
||||
// Read non-data metadata
|
||||
self.read_body_v2_non_data(&bytes[index..])?;
|
||||
Ok(())
|
||||
}
|
||||
|
||||
/// Read version 2/3 metadata fields (everything after Data).
|
||||
/// Returns IndexOutOfRange errors for truncated data (matching Go's readNeedleDataVersion2).
|
||||
fn read_body_v2_non_data(&mut self, bytes: &[u8]) -> Result<usize, NeedleError> {
|
||||
let len_bytes = bytes.len();
|
||||
let mut index = 0;
|
||||
|
||||
// Flags (1 byte)
|
||||
if index < len_bytes {
|
||||
self.flags = bytes[index];
|
||||
index += 1;
|
||||
} else {
|
||||
return Ok(index);
|
||||
}
|
||||
|
||||
// Name
|
||||
if index < len_bytes && self.has_name() {
|
||||
self.name_size = bytes[index];
|
||||
index += 1;
|
||||
if index + self.name_size as usize > len_bytes {
|
||||
return Err(NeedleError::IndexOutOfRange(2));
|
||||
}
|
||||
self.name = bytes[index..index + self.name_size as usize].to_vec();
|
||||
index += self.name_size as usize;
|
||||
}
|
||||
|
||||
// Mime
|
||||
if index < len_bytes && self.has_mime() {
|
||||
self.mime_size = bytes[index];
|
||||
index += 1;
|
||||
if index + self.mime_size as usize > len_bytes {
|
||||
return Err(NeedleError::IndexOutOfRange(3));
|
||||
}
|
||||
self.mime = bytes[index..index + self.mime_size as usize].to_vec();
|
||||
index += self.mime_size as usize;
|
||||
}
|
||||
|
||||
// LastModified (5 bytes)
|
||||
if index < len_bytes && self.has_last_modified_date() {
|
||||
if index + LAST_MODIFIED_BYTES_LENGTH > len_bytes {
|
||||
return Err(NeedleError::IndexOutOfRange(4));
|
||||
}
|
||||
self.last_modified = bytes_to_u64_5(&bytes[index..index + LAST_MODIFIED_BYTES_LENGTH]);
|
||||
index += LAST_MODIFIED_BYTES_LENGTH;
|
||||
}
|
||||
|
||||
// TTL (2 bytes)
|
||||
if index < len_bytes && self.has_ttl() {
|
||||
if index + TTL_BYTES_LENGTH > len_bytes {
|
||||
return Err(NeedleError::IndexOutOfRange(5));
|
||||
}
|
||||
self.ttl = Some(TTL::from_bytes(&bytes[index..index + TTL_BYTES_LENGTH]));
|
||||
index += TTL_BYTES_LENGTH;
|
||||
}
|
||||
|
||||
// Pairs
|
||||
if index < len_bytes && self.has_pairs() {
|
||||
if index + 2 > len_bytes {
|
||||
return Err(NeedleError::IndexOutOfRange(6));
|
||||
}
|
||||
self.pairs_size = u16::from_be_bytes([bytes[index], bytes[index + 1]]);
|
||||
index += 2;
|
||||
if index + self.pairs_size as usize > len_bytes {
|
||||
return Err(NeedleError::IndexOutOfRange(7));
|
||||
}
|
||||
self.pairs = bytes[index..index + self.pairs_size as usize].to_vec();
|
||||
index += self.pairs_size as usize;
|
||||
}
|
||||
|
||||
Ok(index)
|
||||
}
|
||||
|
||||
// ---- Tail reading ----
|
||||
|
||||
/// Read the needle tail (checksum + optional timestamp + padding).
|
||||
pub fn read_tail(&mut self, tail_bytes: &[u8], version: Version) -> Result<(), NeedleError> {
|
||||
if tail_bytes.len() < NEEDLE_CHECKSUM_SIZE {
|
||||
return Err(NeedleError::TailTooShort);
|
||||
}
|
||||
|
||||
let expected_checksum = CRC(u32::from_be_bytes([
|
||||
tail_bytes[0],
|
||||
tail_bytes[1],
|
||||
tail_bytes[2],
|
||||
tail_bytes[3],
|
||||
]));
|
||||
|
||||
if !self.data.is_empty() {
|
||||
let data_checksum = CRC::new(&self.data);
|
||||
// Go double-checks: n.Checksum != crc && uint32(n.Checksum) != crc.Value()
|
||||
// The crc.Value() path is a deprecated legacy transform for backward compat
|
||||
// with seaweed versions prior to commit 056c480eb.
|
||||
if expected_checksum != data_checksum
|
||||
&& expected_checksum.0 != data_checksum.legacy_value()
|
||||
{
|
||||
return Err(NeedleError::CrcMismatch {
|
||||
needle_id: self.id,
|
||||
got: data_checksum.0,
|
||||
want: expected_checksum.0,
|
||||
});
|
||||
}
|
||||
self.checksum = data_checksum;
|
||||
} else {
|
||||
self.checksum = expected_checksum;
|
||||
}
|
||||
|
||||
if version == VERSION_3 {
|
||||
let ts_offset = NEEDLE_CHECKSUM_SIZE;
|
||||
if tail_bytes.len() < ts_offset + TIMESTAMP_SIZE {
|
||||
return Err(NeedleError::TailTooShort);
|
||||
}
|
||||
self.append_at_ns = u64::from_be_bytes([
|
||||
tail_bytes[ts_offset],
|
||||
tail_bytes[ts_offset + 1],
|
||||
tail_bytes[ts_offset + 2],
|
||||
tail_bytes[ts_offset + 3],
|
||||
tail_bytes[ts_offset + 4],
|
||||
tail_bytes[ts_offset + 5],
|
||||
tail_bytes[ts_offset + 6],
|
||||
tail_bytes[ts_offset + 7],
|
||||
]);
|
||||
}
|
||||
|
||||
Ok(())
|
||||
}
|
||||
|
||||
// ---- Full read from bytes ----
|
||||
|
||||
/// Read a complete needle from its raw bytes (header + body + tail).
|
||||
pub fn read_bytes(
|
||||
&mut self,
|
||||
bytes: &[u8],
|
||||
offset: i64,
|
||||
expected_size: Size,
|
||||
version: Version,
|
||||
) -> Result<(), NeedleError> {
|
||||
self.read_header(bytes);
|
||||
|
||||
if self.size != expected_size {
|
||||
return Err(NeedleError::SizeMismatch {
|
||||
offset,
|
||||
id: self.id,
|
||||
found: self.size,
|
||||
expected: expected_size,
|
||||
});
|
||||
}
|
||||
|
||||
let body_start = NEEDLE_HEADER_SIZE;
|
||||
let body_end = body_start + self.size.0 as usize;
|
||||
|
||||
if version == VERSION_1 {
|
||||
self.data = bytes[body_start..body_end].to_vec();
|
||||
} else {
|
||||
self.read_body_v2(&bytes[body_start..body_end])?;
|
||||
}
|
||||
|
||||
self.read_tail(&bytes[body_end..], version)?;
|
||||
Ok(())
|
||||
}
|
||||
|
||||
// ---- Write (serialize) ----
|
||||
|
||||
/// Serialize the needle to bytes for writing to a .dat file (Version 2/3).
|
||||
pub fn write_bytes(&mut self, version: Version) -> Vec<u8> {
|
||||
let mut buf = Vec::with_capacity(256);
|
||||
|
||||
// Compute sizes (matching Go writeNeedleCommon)
|
||||
if self.name.len() >= 255 {
|
||||
self.name_size = 255;
|
||||
} else {
|
||||
self.name_size = self.name.len() as u8;
|
||||
}
|
||||
self.data_size = self.data.len() as u32;
|
||||
self.mime_size = self.mime.len() as u8;
|
||||
|
||||
// Compute n.Size (body size, excluding header)
|
||||
if self.data_size > 0 {
|
||||
let mut s: i32 = 4 + self.data_size as i32 + 1; // DataSize + Data + Flags
|
||||
if self.has_name() {
|
||||
s += 1 + self.name_size as i32;
|
||||
}
|
||||
if self.has_mime() {
|
||||
s += 1 + self.mime_size as i32;
|
||||
}
|
||||
if self.has_last_modified_date() {
|
||||
s += LAST_MODIFIED_BYTES_LENGTH as i32;
|
||||
}
|
||||
if self.has_ttl() {
|
||||
s += TTL_BYTES_LENGTH as i32;
|
||||
}
|
||||
if self.has_pairs() {
|
||||
s += 2 + self.pairs_size as i32;
|
||||
}
|
||||
self.size = Size(s);
|
||||
} else {
|
||||
self.size = Size(0);
|
||||
}
|
||||
|
||||
// Header: Cookie(4) + NeedleId(8) + Size(4) = 16 bytes
|
||||
let mut header = [0u8; NEEDLE_HEADER_SIZE];
|
||||
self.cookie.to_bytes(&mut header[0..COOKIE_SIZE]);
|
||||
self.id
|
||||
.to_bytes(&mut header[COOKIE_SIZE..COOKIE_SIZE + NEEDLE_ID_SIZE]);
|
||||
self.size
|
||||
.to_bytes(&mut header[COOKIE_SIZE + NEEDLE_ID_SIZE..NEEDLE_HEADER_SIZE]);
|
||||
buf.extend_from_slice(&header);
|
||||
|
||||
// Body
|
||||
if self.data_size > 0 {
|
||||
buf.extend_from_slice(&self.data_size.to_be_bytes());
|
||||
buf.extend_from_slice(&self.data);
|
||||
buf.push(self.flags);
|
||||
if self.has_name() {
|
||||
buf.push(self.name_size);
|
||||
buf.extend_from_slice(&self.name[..self.name_size as usize]);
|
||||
}
|
||||
if self.has_mime() {
|
||||
buf.push(self.mime_size);
|
||||
buf.extend_from_slice(&self.mime);
|
||||
}
|
||||
if self.has_last_modified_date() {
|
||||
// Write 5 bytes of last_modified (lower 5 bytes of u64 big-endian)
|
||||
let lm_bytes = self.last_modified.to_be_bytes();
|
||||
buf.extend_from_slice(&lm_bytes[8 - LAST_MODIFIED_BYTES_LENGTH..8]);
|
||||
}
|
||||
if self.has_ttl() {
|
||||
if let Some(ref ttl) = self.ttl {
|
||||
let mut ttl_buf = [0u8; 2];
|
||||
ttl.to_bytes(&mut ttl_buf);
|
||||
buf.extend_from_slice(&ttl_buf);
|
||||
} else {
|
||||
buf.extend_from_slice(&[0u8; 2]);
|
||||
}
|
||||
}
|
||||
if self.has_pairs() {
|
||||
buf.extend_from_slice(&self.pairs_size.to_be_bytes());
|
||||
buf.extend_from_slice(&self.pairs);
|
||||
}
|
||||
}
|
||||
|
||||
// Compute checksum
|
||||
self.checksum = CRC::new(&self.data);
|
||||
|
||||
// Tail: Checksum + [V3: AppendAtNs] + Padding
|
||||
buf.extend_from_slice(&self.checksum.0.to_be_bytes());
|
||||
if version == VERSION_3 {
|
||||
buf.extend_from_slice(&self.append_at_ns.to_be_bytes());
|
||||
}
|
||||
|
||||
// Padding to 8-byte alignment
|
||||
let padding = padding_length(self.size, version).0 as usize;
|
||||
buf.extend(std::iter::repeat(0u8).take(padding));
|
||||
|
||||
buf
|
||||
}
|
||||
|
||||
/// Total disk size of this needle including header, body, checksum, timestamp, and padding.
|
||||
pub fn disk_size(&self, version: Version) -> i64 {
|
||||
get_actual_size(self.size, version)
|
||||
}
|
||||
|
||||
/// Compute ETag string from checksum (matching Go).
|
||||
pub fn etag(&self) -> String {
|
||||
etag_from_checksum(self.checksum.0)
|
||||
}
|
||||
}
|
||||
|
||||
// ============================================================================
|
||||
// Helper functions (matching Go)
|
||||
// ============================================================================
|
||||
|
||||
/// Compute padding to align needle to NEEDLE_PADDING_SIZE (8 bytes).
|
||||
pub fn padding_length(needle_size: Size, version: Version) -> Size {
|
||||
if version == VERSION_3 {
|
||||
Size(
|
||||
NEEDLE_PADDING_SIZE as i32
|
||||
- ((NEEDLE_HEADER_SIZE as i32
|
||||
+ needle_size.0
|
||||
+ NEEDLE_CHECKSUM_SIZE as i32
|
||||
+ TIMESTAMP_SIZE as i32)
|
||||
% NEEDLE_PADDING_SIZE as i32),
|
||||
)
|
||||
} else {
|
||||
Size(
|
||||
NEEDLE_PADDING_SIZE as i32
|
||||
- ((NEEDLE_HEADER_SIZE as i32 + needle_size.0 + NEEDLE_CHECKSUM_SIZE as i32)
|
||||
% NEEDLE_PADDING_SIZE as i32),
|
||||
)
|
||||
}
|
||||
}
|
||||
|
||||
/// Body length = Size + Checksum + [Timestamp] + Padding.
|
||||
pub fn needle_body_length(needle_size: Size, version: Version) -> i64 {
|
||||
if version == VERSION_3 {
|
||||
needle_size.0 as i64
|
||||
+ NEEDLE_CHECKSUM_SIZE as i64
|
||||
+ TIMESTAMP_SIZE as i64
|
||||
+ padding_length(needle_size, version).0 as i64
|
||||
} else {
|
||||
needle_size.0 as i64
|
||||
+ NEEDLE_CHECKSUM_SIZE as i64
|
||||
+ padding_length(needle_size, version).0 as i64
|
||||
}
|
||||
}
|
||||
|
||||
/// Total actual size on disk: Header + Body.
|
||||
pub fn get_actual_size(size: Size, version: Version) -> i64 {
|
||||
NEEDLE_HEADER_SIZE as i64 + needle_body_length(size, version)
|
||||
}
|
||||
|
||||
/// Read 5 bytes as a u64 (big-endian, zero-padded high bytes).
|
||||
fn bytes_to_u64_5(bytes: &[u8]) -> u64 {
|
||||
assert!(bytes.len() >= 5);
|
||||
// The 5 bytes are the LOWER 5 bytes of a big-endian u64.
|
||||
// In Go: util.BytesToUint64(bytes[index : index+5]) reads into a uint64
|
||||
// Go's BytesToUint64 copies into the LAST 5 bytes of an 8-byte array (big-endian).
|
||||
let mut buf = [0u8; 8];
|
||||
buf[3..8].copy_from_slice(&bytes[..5]);
|
||||
u64::from_be_bytes(buf)
|
||||
}
|
||||
|
||||
/// ETag formatted as Go: hex of big-endian u32 bytes.
|
||||
pub fn etag_from_checksum(checksum: u32) -> String {
|
||||
let bits = checksum.to_be_bytes();
|
||||
format!(
|
||||
"{:02x}{:02x}{:02x}{:02x}",
|
||||
bits[0], bits[1], bits[2], bits[3]
|
||||
)
|
||||
}
|
||||
|
||||
// ============================================================================
|
||||
// FileId
|
||||
// ============================================================================
|
||||
|
||||
/// FileId = VolumeId + NeedleId + Cookie.
|
||||
/// String format: "<volume_id>,<needle_id_hex><cookie_hex>"
|
||||
#[derive(Debug, Clone, PartialEq, Eq)]
|
||||
pub struct FileId {
|
||||
pub volume_id: VolumeId,
|
||||
pub key: NeedleId,
|
||||
pub cookie: Cookie,
|
||||
}
|
||||
|
||||
impl FileId {
|
||||
pub fn new(volume_id: VolumeId, key: NeedleId, cookie: Cookie) -> Self {
|
||||
FileId {
|
||||
volume_id,
|
||||
key,
|
||||
cookie,
|
||||
}
|
||||
}
|
||||
|
||||
/// Parse "volume_id,needle_id_cookie" or "volume_id/needle_id_cookie".
|
||||
pub fn parse(s: &str) -> Result<Self, String> {
|
||||
let (vid_str, rest) = if let Some(pos) = s.find(',') {
|
||||
(&s[..pos], &s[pos + 1..])
|
||||
} else if let Some(pos) = s.find('/') {
|
||||
(&s[..pos], &s[pos + 1..])
|
||||
} else {
|
||||
return Err(format!("invalid file id: {}", s));
|
||||
};
|
||||
|
||||
let volume_id =
|
||||
VolumeId::parse(vid_str).map_err(|e| format!("invalid volume id: {}", e))?;
|
||||
let (key, cookie) = parse_needle_id_cookie(rest)?;
|
||||
Ok(FileId {
|
||||
volume_id,
|
||||
key,
|
||||
cookie,
|
||||
})
|
||||
}
|
||||
|
||||
/// Format the needle_id + cookie part as a hex string (stripping leading zeros).
|
||||
pub fn needle_id_cookie_string(&self) -> String {
|
||||
format_needle_id_cookie(self.key, self.cookie)
|
||||
}
|
||||
}
|
||||
|
||||
impl std::fmt::Display for FileId {
|
||||
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
|
||||
write!(f, "{},{}", self.volume_id, self.needle_id_cookie_string())
|
||||
}
|
||||
}
|
||||
|
||||
/// Format NeedleId + Cookie as hex, stripping leading zero bytes from NeedleId only.
|
||||
/// Matches Go: strips leading zero bytes up to NeedleIdSize (8), so cookie is always present.
|
||||
fn format_needle_id_cookie(key: NeedleId, cookie: Cookie) -> String {
|
||||
// Encode 12 bytes: 8 for NeedleId + 4 for Cookie
|
||||
let mut bytes = [0u8; 12];
|
||||
key.to_bytes(&mut bytes[0..8]);
|
||||
cookie.to_bytes(&mut bytes[8..12]);
|
||||
|
||||
// Strip leading zero bytes, but only within NeedleId portion (first 8 bytes)
|
||||
let mut nonzero_index = 0;
|
||||
while nonzero_index < NEEDLE_ID_SIZE && bytes[nonzero_index] == 0 {
|
||||
nonzero_index += 1;
|
||||
}
|
||||
hex::encode(&bytes[nonzero_index..])
|
||||
}
|
||||
|
||||
/// Parse "needle_id_cookie_hex" or "needle_id_cookie_hex_delta" into (NeedleId, Cookie).
|
||||
/// Matches Go's ParsePath + ParseNeedleIdCookie: supports an optional `_delta` suffix
|
||||
/// where delta is a decimal number added to the NeedleId (used for sub-file addressing).
|
||||
/// Rejects strings that are too short or too long.
|
||||
pub fn parse_needle_id_cookie(s: &str) -> Result<(NeedleId, Cookie), String> {
|
||||
// Go ParsePath: check for "_" suffix containing a decimal delta
|
||||
let (hex_part, delta) = if let Some(underscore_pos) = s.rfind('_') {
|
||||
if underscore_pos > 0 {
|
||||
let delta_str = &s[underscore_pos + 1..];
|
||||
let d: u64 = delta_str
|
||||
.parse()
|
||||
.map_err(|e| format!("Parse delta error: {}", e))?;
|
||||
(&s[..underscore_pos], Some(d))
|
||||
} else {
|
||||
(s, None)
|
||||
}
|
||||
} else {
|
||||
(s, None)
|
||||
};
|
||||
|
||||
// Go: len(key_hash_string) <= CookieSize*2 => error (must be > 8 hex chars)
|
||||
if hex_part.len() <= COOKIE_SIZE * 2 {
|
||||
return Err("KeyHash is too short.".to_string());
|
||||
}
|
||||
// Go: len(key_hash_string) > (NeedleIdSize+CookieSize)*2 => error (must be <= 24 hex chars)
|
||||
if hex_part.len() > (NEEDLE_ID_SIZE + COOKIE_SIZE) * 2 {
|
||||
return Err("KeyHash is too long.".to_string());
|
||||
}
|
||||
|
||||
// Split: last CookieSize*2 hex chars are cookie, rest is needle id
|
||||
let split = hex_part.len() - COOKIE_SIZE * 2;
|
||||
let needle_id_hex = &hex_part[..split];
|
||||
let cookie_hex = &hex_part[split..];
|
||||
|
||||
let needle_id_bytes = hex::decode(needle_id_hex).map_err(|e| format!("Parse needleId error: {}", e))?;
|
||||
let cookie_bytes = hex::decode(cookie_hex).map_err(|e| format!("Parse cookie error: {}", e))?;
|
||||
|
||||
// Pad needle id to 8 bytes
|
||||
let mut nid_buf = [0u8; 8];
|
||||
if needle_id_bytes.len() > 8 {
|
||||
return Err(format!("KeyHash is too long."));
|
||||
}
|
||||
let start = 8 - needle_id_bytes.len();
|
||||
nid_buf[start..].copy_from_slice(&needle_id_bytes);
|
||||
|
||||
let mut key = NeedleId::from_bytes(&nid_buf[0..8]);
|
||||
let cookie = Cookie::from_bytes(&cookie_bytes[0..4]);
|
||||
|
||||
// Apply delta if present (Go: n.Id += Uint64ToNeedleId(d))
|
||||
if let Some(d) = delta {
|
||||
key = NeedleId(key.0.wrapping_add(d));
|
||||
}
|
||||
|
||||
Ok((key, cookie))
|
||||
}
|
||||
|
||||
// ============================================================================
|
||||
// Errors
|
||||
// ============================================================================
|
||||
|
||||
#[derive(Debug, thiserror::Error)]
|
||||
pub enum NeedleError {
|
||||
#[error("size mismatch at offset {offset}: found id={id} size={found:?}, expected size={expected:?}")]
|
||||
SizeMismatch {
|
||||
offset: i64,
|
||||
id: NeedleId,
|
||||
found: Size,
|
||||
expected: Size,
|
||||
},
|
||||
|
||||
#[error("CRC mismatch for needle {needle_id}: got {got:08x}, want {want:08x}")]
|
||||
CrcMismatch {
|
||||
needle_id: NeedleId,
|
||||
got: u32,
|
||||
want: u32,
|
||||
},
|
||||
|
||||
#[error("index out of range ({0})")]
|
||||
IndexOutOfRange(u32),
|
||||
|
||||
#[error("needle tail too short")]
|
||||
TailTooShort,
|
||||
|
||||
#[error("unsupported version: {0}")]
|
||||
UnsupportedVersion(u8),
|
||||
|
||||
#[error("IO error: {0}")]
|
||||
Io(#[from] std::io::Error),
|
||||
}
|
||||
|
||||
// ============================================================================
|
||||
// Tests
|
||||
// ============================================================================
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn test_parse_header() {
|
||||
let mut buf = [0u8; NEEDLE_HEADER_SIZE];
|
||||
let cookie = Cookie(0xdeadbeef);
|
||||
let id = NeedleId(0x123456789abcdef0);
|
||||
let size = Size(1024);
|
||||
cookie.to_bytes(&mut buf[0..4]);
|
||||
id.to_bytes(&mut buf[4..12]);
|
||||
size.to_bytes(&mut buf[12..16]);
|
||||
|
||||
let (c, i, s) = Needle::parse_header(&buf);
|
||||
assert_eq!(c, cookie);
|
||||
assert_eq!(i, id);
|
||||
assert_eq!(s, size);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_needle_write_read_round_trip_v3() {
|
||||
let mut n = Needle::default();
|
||||
n.cookie = Cookie(42);
|
||||
n.id = NeedleId(100);
|
||||
n.data = b"hello world".to_vec();
|
||||
n.flags = 0;
|
||||
n.set_has_name();
|
||||
n.name = b"test.txt".to_vec();
|
||||
n.name_size = 8;
|
||||
n.set_has_mime();
|
||||
n.mime = b"text/plain".to_vec();
|
||||
n.mime_size = 10;
|
||||
n.set_has_last_modified_date();
|
||||
n.last_modified = 1234567890;
|
||||
n.set_has_ttl();
|
||||
n.ttl = Some(TTL {
|
||||
count: 5,
|
||||
unit: super::super::ttl::TTL_UNIT_DAY,
|
||||
});
|
||||
n.append_at_ns = 999999999;
|
||||
|
||||
let bytes = n.write_bytes(VERSION_3);
|
||||
|
||||
// Verify total size matches expected
|
||||
let expected_size = get_actual_size(n.size, VERSION_3);
|
||||
assert_eq!(bytes.len() as i64, expected_size);
|
||||
|
||||
// Read it back
|
||||
let mut n2 = Needle::default();
|
||||
n2.read_bytes(&bytes, 0, n.size, VERSION_3).unwrap();
|
||||
|
||||
assert_eq!(n2.cookie, n.cookie);
|
||||
assert_eq!(n2.id, n.id);
|
||||
assert_eq!(n2.data, n.data);
|
||||
assert_eq!(n2.name, n.name);
|
||||
assert_eq!(n2.mime, n.mime);
|
||||
assert_eq!(n2.last_modified, n.last_modified);
|
||||
assert_eq!(n2.ttl, n.ttl);
|
||||
assert_eq!(n2.checksum, n.checksum);
|
||||
assert_eq!(n2.append_at_ns, n.append_at_ns);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_needle_write_read_round_trip_v2() {
|
||||
let mut n = Needle::default();
|
||||
n.cookie = Cookie(77);
|
||||
n.id = NeedleId(200);
|
||||
n.data = b"data v2".to_vec();
|
||||
n.flags = 0;
|
||||
|
||||
let bytes = n.write_bytes(VERSION_2);
|
||||
let expected_size = get_actual_size(n.size, VERSION_2);
|
||||
assert_eq!(bytes.len() as i64, expected_size);
|
||||
|
||||
let mut n2 = Needle::default();
|
||||
n2.read_bytes(&bytes, 0, n.size, VERSION_2).unwrap();
|
||||
|
||||
assert_eq!(n2.data, n.data);
|
||||
assert_eq!(n2.checksum, n.checksum);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_read_bytes_meta_only_handles_tombstone_v3() {
|
||||
let mut tombstone = Needle::default();
|
||||
tombstone.cookie = Cookie(0x1234abcd);
|
||||
tombstone.id = NeedleId(300);
|
||||
tombstone.append_at_ns = 999_999;
|
||||
|
||||
let bytes = tombstone.write_bytes(VERSION_3);
|
||||
|
||||
let mut meta = Needle::default();
|
||||
meta.read_bytes_meta_only(&bytes, 0, Size(0), VERSION_3)
|
||||
.unwrap();
|
||||
|
||||
assert_eq!(meta.cookie, tombstone.cookie);
|
||||
assert_eq!(meta.id, tombstone.id);
|
||||
assert_eq!(meta.size, Size(0));
|
||||
assert_eq!(meta.data_size, 0);
|
||||
assert_eq!(meta.append_at_ns, tombstone.append_at_ns);
|
||||
assert_eq!(meta.checksum, tombstone.checksum);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_padding_alignment() {
|
||||
// All actual sizes should be multiples of 8
|
||||
for size_val in 0..50 {
|
||||
let s = Size(size_val);
|
||||
let actual_v2 = get_actual_size(s, VERSION_2);
|
||||
let actual_v3 = get_actual_size(s, VERSION_3);
|
||||
assert_eq!(actual_v2 % 8, 0, "V2 size {} not aligned", size_val);
|
||||
assert_eq!(actual_v3 % 8, 0, "V3 size {} not aligned", size_val);
|
||||
}
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_file_id_parse() {
|
||||
let fid = FileId::parse("3,01637037d6").unwrap();
|
||||
assert_eq!(fid.volume_id, VolumeId(3));
|
||||
// The hex "01637037d6" is 5 bytes = 0x0163..., padded to 12 bytes
|
||||
assert!(!fid.key.is_empty() || !fid.cookie.0 == 0);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_file_id_round_trip() {
|
||||
let fid = FileId::new(VolumeId(5), NeedleId(0x123456), Cookie(0xabcd));
|
||||
let s = fid.to_string();
|
||||
let fid2 = FileId::parse(&s).unwrap();
|
||||
assert_eq!(fid, fid2);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_needle_id_cookie_format() {
|
||||
let s = format_needle_id_cookie(NeedleId(1), Cookie(0x12345678));
|
||||
let (key, cookie) = parse_needle_id_cookie(&s).unwrap();
|
||||
assert_eq!(key, NeedleId(1));
|
||||
assert_eq!(cookie, Cookie(0x12345678));
|
||||
}
|
||||
}
|
||||
302
seaweed-volume/src/storage/needle/ttl.rs
Normal file
302
seaweed-volume/src/storage/needle/ttl.rs
Normal file
@@ -0,0 +1,302 @@
|
||||
//! Time-to-live encoding for needles.
|
||||
//!
|
||||
//! TTL is stored as 2 bytes: Count(1) + Unit(1).
|
||||
//! Supported units: minute(m), hour(h), day(d), week(w), month(M), year(y).
|
||||
|
||||
use std::fmt;
|
||||
|
||||
/// TTL unit constants (matching Go).
|
||||
pub const TTL_UNIT_EMPTY: u8 = 0;
|
||||
pub const TTL_UNIT_MINUTE: u8 = 1;
|
||||
pub const TTL_UNIT_HOUR: u8 = 2;
|
||||
pub const TTL_UNIT_DAY: u8 = 3;
|
||||
pub const TTL_UNIT_WEEK: u8 = 4;
|
||||
pub const TTL_UNIT_MONTH: u8 = 5;
|
||||
pub const TTL_UNIT_YEAR: u8 = 6;
|
||||
|
||||
pub const TTL_BYTES_LENGTH: usize = 2;
|
||||
|
||||
#[derive(Debug, Clone, Copy, PartialEq, Eq, Default)]
|
||||
pub struct TTL {
|
||||
pub count: u8,
|
||||
pub unit: u8,
|
||||
}
|
||||
|
||||
impl TTL {
|
||||
pub const EMPTY: TTL = TTL { count: 0, unit: 0 };
|
||||
|
||||
pub fn is_empty(&self) -> bool {
|
||||
self.count == 0 && self.unit == 0
|
||||
}
|
||||
|
||||
/// Load from 2 bytes.
|
||||
pub fn from_bytes(input: &[u8]) -> Self {
|
||||
if input.len() < 2 {
|
||||
return TTL::EMPTY;
|
||||
}
|
||||
TTL {
|
||||
count: input[0],
|
||||
unit: input[1],
|
||||
}
|
||||
}
|
||||
|
||||
/// Serialize to 2 bytes.
|
||||
pub fn to_bytes(&self, output: &mut [u8]) {
|
||||
assert!(output.len() >= 2);
|
||||
output[0] = self.count;
|
||||
output[1] = self.unit;
|
||||
}
|
||||
|
||||
/// Encode as u32: (count << 8) | unit.
|
||||
pub fn to_u32(&self) -> u32 {
|
||||
if self.count == 0 {
|
||||
return 0;
|
||||
}
|
||||
((self.count as u32) << 8) + (self.unit as u32)
|
||||
}
|
||||
|
||||
/// Decode from u32.
|
||||
pub fn from_u32(v: u32) -> Self {
|
||||
if v == 0 {
|
||||
return TTL::EMPTY;
|
||||
}
|
||||
TTL {
|
||||
count: (v >> 8) as u8,
|
||||
unit: (v & 0xFF) as u8,
|
||||
}
|
||||
}
|
||||
|
||||
/// Convert to total seconds.
|
||||
pub fn to_seconds(&self) -> u64 {
|
||||
unit_to_seconds(self.count as u64, self.unit)
|
||||
}
|
||||
|
||||
/// Parse from string like "3m", "4h", "5d", "6w", "7M", "8y".
|
||||
/// If the string is all digits (no unit suffix), defaults to minutes.
|
||||
/// Matches Go's ReadTTL which calls fitTtlCount to normalize:
|
||||
/// e.g. "120m" -> 2h, "7d" -> 1w, "24h" -> 1d.
|
||||
pub fn read(s: &str) -> Result<Self, String> {
|
||||
let s = s.trim();
|
||||
if s.is_empty() {
|
||||
return Ok(TTL::EMPTY);
|
||||
}
|
||||
let last_byte = s.as_bytes()[s.len() - 1];
|
||||
let (num_str, unit_byte) = if last_byte >= b'0' && last_byte <= b'9' {
|
||||
// All digits — default to minutes (matching Go)
|
||||
(s, b'm')
|
||||
} else {
|
||||
(&s[..s.len() - 1], last_byte)
|
||||
};
|
||||
let count: u32 = num_str
|
||||
.parse()
|
||||
.map_err(|e| format!("invalid TTL count: {}", e))?;
|
||||
let unit = match unit_byte {
|
||||
b'm' => TTL_UNIT_MINUTE,
|
||||
b'h' => TTL_UNIT_HOUR,
|
||||
b'd' => TTL_UNIT_DAY,
|
||||
b'w' => TTL_UNIT_WEEK,
|
||||
b'M' => TTL_UNIT_MONTH,
|
||||
b'y' => TTL_UNIT_YEAR,
|
||||
_ => return Err(format!("unknown TTL unit: {}", unit_byte as char)),
|
||||
};
|
||||
// Match Go's ReadTTL: normalize via fitTtlCount
|
||||
Ok(fit_ttl_count(count, unit))
|
||||
}
|
||||
|
||||
/// Minutes representation.
|
||||
pub fn minutes(&self) -> u32 {
|
||||
(self.to_seconds() / 60) as u32
|
||||
}
|
||||
}
|
||||
|
||||
fn unit_to_seconds(count: u64, unit: u8) -> u64 {
|
||||
match unit {
|
||||
TTL_UNIT_EMPTY => 0,
|
||||
TTL_UNIT_MINUTE => count * 60,
|
||||
TTL_UNIT_HOUR => count * 60 * 60,
|
||||
TTL_UNIT_DAY => count * 60 * 60 * 24,
|
||||
TTL_UNIT_WEEK => count * 60 * 60 * 24 * 7,
|
||||
TTL_UNIT_MONTH => count * 60 * 60 * 24 * 30,
|
||||
TTL_UNIT_YEAR => count * 60 * 60 * 24 * 365,
|
||||
_ => 0,
|
||||
}
|
||||
}
|
||||
|
||||
/// Fit a count+unit into a TTL that fits in a single byte count.
|
||||
/// Converts to seconds first, then finds the coarsest unit that fits.
|
||||
/// Matches Go's fitTtlCount called from ReadTTL.
|
||||
fn fit_ttl_count(count: u32, unit: u8) -> TTL {
|
||||
if count == 0 || unit == TTL_UNIT_EMPTY {
|
||||
return TTL::EMPTY;
|
||||
}
|
||||
|
||||
// Always convert to seconds and normalize (matches Go).
|
||||
let seconds = unit_to_seconds(count as u64, unit);
|
||||
if seconds == 0 {
|
||||
return TTL::EMPTY;
|
||||
}
|
||||
|
||||
const YEAR_SECS: u64 = 3600 * 24 * 365;
|
||||
const MONTH_SECS: u64 = 3600 * 24 * 30;
|
||||
const WEEK_SECS: u64 = 3600 * 24 * 7;
|
||||
const DAY_SECS: u64 = 3600 * 24;
|
||||
const HOUR_SECS: u64 = 3600;
|
||||
const MINUTE_SECS: u64 = 60;
|
||||
|
||||
// First pass: try exact fits from largest to smallest
|
||||
if seconds % YEAR_SECS == 0 && seconds / YEAR_SECS < 256 {
|
||||
return TTL { count: (seconds / YEAR_SECS) as u8, unit: TTL_UNIT_YEAR };
|
||||
}
|
||||
if seconds % MONTH_SECS == 0 && seconds / MONTH_SECS < 256 {
|
||||
return TTL { count: (seconds / MONTH_SECS) as u8, unit: TTL_UNIT_MONTH };
|
||||
}
|
||||
if seconds % WEEK_SECS == 0 && seconds / WEEK_SECS < 256 {
|
||||
return TTL { count: (seconds / WEEK_SECS) as u8, unit: TTL_UNIT_WEEK };
|
||||
}
|
||||
if seconds % DAY_SECS == 0 && seconds / DAY_SECS < 256 {
|
||||
return TTL { count: (seconds / DAY_SECS) as u8, unit: TTL_UNIT_DAY };
|
||||
}
|
||||
if seconds % HOUR_SECS == 0 && seconds / HOUR_SECS < 256 {
|
||||
return TTL { count: (seconds / HOUR_SECS) as u8, unit: TTL_UNIT_HOUR };
|
||||
}
|
||||
// Minutes: truncating division
|
||||
if seconds / MINUTE_SECS < 256 {
|
||||
return TTL { count: (seconds / MINUTE_SECS) as u8, unit: TTL_UNIT_MINUTE };
|
||||
}
|
||||
// Second pass: truncating division from smallest to largest
|
||||
if seconds / HOUR_SECS < 256 {
|
||||
return TTL { count: (seconds / HOUR_SECS) as u8, unit: TTL_UNIT_HOUR };
|
||||
}
|
||||
if seconds / DAY_SECS < 256 {
|
||||
return TTL { count: (seconds / DAY_SECS) as u8, unit: TTL_UNIT_DAY };
|
||||
}
|
||||
if seconds / WEEK_SECS < 256 {
|
||||
return TTL { count: (seconds / WEEK_SECS) as u8, unit: TTL_UNIT_WEEK };
|
||||
}
|
||||
if seconds / MONTH_SECS < 256 {
|
||||
return TTL { count: (seconds / MONTH_SECS) as u8, unit: TTL_UNIT_MONTH };
|
||||
}
|
||||
if seconds / YEAR_SECS < 256 {
|
||||
return TTL { count: (seconds / YEAR_SECS) as u8, unit: TTL_UNIT_YEAR };
|
||||
}
|
||||
TTL::EMPTY
|
||||
}
|
||||
|
||||
fn unit_to_char(unit: u8) -> char {
|
||||
match unit {
|
||||
TTL_UNIT_MINUTE => 'm',
|
||||
TTL_UNIT_HOUR => 'h',
|
||||
TTL_UNIT_DAY => 'd',
|
||||
TTL_UNIT_WEEK => 'w',
|
||||
TTL_UNIT_MONTH => 'M',
|
||||
TTL_UNIT_YEAR => 'y',
|
||||
_ => ' ',
|
||||
}
|
||||
}
|
||||
|
||||
impl fmt::Display for TTL {
|
||||
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
|
||||
if self.count == 0 || self.unit == TTL_UNIT_EMPTY {
|
||||
return write!(f, "");
|
||||
}
|
||||
write!(f, "{}{}", self.count, unit_to_char(self.unit))
|
||||
}
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn test_ttl_parse() {
|
||||
let ttl = TTL::read("3m").unwrap();
|
||||
assert_eq!(
|
||||
ttl,
|
||||
TTL {
|
||||
count: 3,
|
||||
unit: TTL_UNIT_MINUTE
|
||||
}
|
||||
);
|
||||
assert_eq!(ttl.to_seconds(), 180);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_ttl_parse_hours() {
|
||||
// 24h normalizes to 1d via fitTtlCount
|
||||
let ttl = TTL::read("24h").unwrap();
|
||||
assert_eq!(ttl.to_seconds(), 86400);
|
||||
assert_eq!(ttl, TTL { count: 1, unit: TTL_UNIT_DAY });
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_ttl_display() {
|
||||
let ttl = TTL {
|
||||
count: 5,
|
||||
unit: TTL_UNIT_DAY,
|
||||
};
|
||||
assert_eq!(ttl.to_string(), "5d");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_ttl_bytes_round_trip() {
|
||||
let ttl = TTL {
|
||||
count: 10,
|
||||
unit: TTL_UNIT_WEEK,
|
||||
};
|
||||
let mut buf = [0u8; 2];
|
||||
ttl.to_bytes(&mut buf);
|
||||
let ttl2 = TTL::from_bytes(&buf);
|
||||
assert_eq!(ttl, ttl2);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_ttl_u32_round_trip() {
|
||||
let ttl = TTL {
|
||||
count: 42,
|
||||
unit: TTL_UNIT_HOUR,
|
||||
};
|
||||
let v = ttl.to_u32();
|
||||
let ttl2 = TTL::from_u32(v);
|
||||
assert_eq!(ttl, ttl2);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_ttl_empty() {
|
||||
assert!(TTL::EMPTY.is_empty());
|
||||
assert_eq!(TTL::EMPTY.to_seconds(), 0);
|
||||
assert_eq!(TTL::EMPTY.to_u32(), 0);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_ttl_overflow_normalizes() {
|
||||
// Go's ReadTTL calls fitTtlCount: 300m = 18000s = 5h (exact fit)
|
||||
let ttl = TTL::read("300m").unwrap();
|
||||
assert_eq!(ttl, TTL { count: 5, unit: TTL_UNIT_HOUR });
|
||||
|
||||
// 256h = 921600s. Doesn't fit in hours (256 >= 256), doesn't fit exact in days.
|
||||
// Second pass: 921600/86400 = 10 (truncated) < 256 -> 10d
|
||||
let ttl = TTL::read("256h").unwrap();
|
||||
assert_eq!(ttl, TTL { count: 10, unit: TTL_UNIT_DAY });
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_ttl_normalizes_unit() {
|
||||
// Go's ReadTTL calls fitTtlCount which normalizes to coarsest unit.
|
||||
// 120m -> 2h, 7d -> 1w, 24h -> 1d.
|
||||
let ttl = TTL::read("120m").unwrap();
|
||||
assert_eq!(ttl, TTL { count: 2, unit: TTL_UNIT_HOUR });
|
||||
|
||||
let ttl = TTL::read("7d").unwrap();
|
||||
assert_eq!(ttl, TTL { count: 1, unit: TTL_UNIT_WEEK });
|
||||
|
||||
let ttl = TTL::read("24h").unwrap();
|
||||
assert_eq!(ttl, TTL { count: 1, unit: TTL_UNIT_DAY });
|
||||
|
||||
// Values that don't simplify stay as-is
|
||||
let ttl = TTL::read("5d").unwrap();
|
||||
assert_eq!(ttl, TTL { count: 5, unit: TTL_UNIT_DAY });
|
||||
|
||||
let ttl = TTL::read("3m").unwrap();
|
||||
assert_eq!(ttl, TTL { count: 3, unit: TTL_UNIT_MINUTE });
|
||||
}
|
||||
}
|
||||
1438
seaweed-volume/src/storage/needle_map.rs
Normal file
1438
seaweed-volume/src/storage/needle_map.rs
Normal file
File diff suppressed because it is too large
Load Diff
375
seaweed-volume/src/storage/needle_map/compact_map.rs
Normal file
375
seaweed-volume/src/storage/needle_map/compact_map.rs
Normal file
@@ -0,0 +1,375 @@
|
||||
//! CompactMap: memory-efficient in-memory map of NeedleId -> (Offset, Size).
|
||||
//!
|
||||
//! Port of Go's CompactMap from weed/storage/needle_map/compact_map.go.
|
||||
//! Uses segmented sorted arrays with compressed keys (u16 instead of u64)
|
||||
//! to achieve ~10 bytes per entry instead of ~40-48 bytes with HashMap.
|
||||
//!
|
||||
//! NeedleId is split into: chunk = id / SEGMENT_CHUNK_SIZE, compact_key = id % SEGMENT_CHUNK_SIZE.
|
||||
//! Each segment stores up to SEGMENT_CHUNK_SIZE entries in a sorted Vec, searched via binary search.
|
||||
//! Best case (ordered inserts): O(1). Worst case: O(log n) per segment.
|
||||
|
||||
use std::collections::HashMap;
|
||||
|
||||
use super::NeedleValue;
|
||||
use crate::storage::types::*;
|
||||
|
||||
/// Maximum entries per segment. Must be <= u16::MAX (65535).
|
||||
const SEGMENT_CHUNK_SIZE: u64 = 50_000;
|
||||
|
||||
/// Compact key: only the low bits of NeedleId within a segment.
|
||||
type CompactKey = u16;
|
||||
|
||||
/// Segment chunk identifier: NeedleId / SEGMENT_CHUNK_SIZE.
|
||||
type Chunk = u64;
|
||||
|
||||
/// Compact entry: 10 bytes (2 + 4 + 4) vs 16 bytes for full NeedleId + NeedleValue.
|
||||
#[derive(Clone, Copy)]
|
||||
struct CompactEntry {
|
||||
key: CompactKey, // 2 bytes
|
||||
offset: [u8; OFFSET_SIZE], // 4 bytes
|
||||
size: Size, // 4 bytes
|
||||
}
|
||||
|
||||
impl CompactEntry {
|
||||
fn to_needle_value(&self) -> NeedleValue {
|
||||
NeedleValue {
|
||||
offset: Offset::from_bytes(&self.offset),
|
||||
size: self.size,
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
/// A sorted segment of compact entries for a given chunk.
|
||||
struct Segment {
|
||||
list: Vec<CompactEntry>,
|
||||
chunk: Chunk,
|
||||
first_key: CompactKey,
|
||||
last_key: CompactKey,
|
||||
}
|
||||
|
||||
impl Segment {
|
||||
fn new(chunk: Chunk) -> Self {
|
||||
Segment {
|
||||
list: Vec::new(),
|
||||
chunk,
|
||||
first_key: u16::MAX,
|
||||
last_key: 0,
|
||||
}
|
||||
}
|
||||
|
||||
fn compact_key(&self, id: NeedleId) -> CompactKey {
|
||||
(id.0 - SEGMENT_CHUNK_SIZE * self.chunk) as CompactKey
|
||||
}
|
||||
|
||||
/// Binary search for a compact key. Returns (index, found).
|
||||
/// If not found, index is the insertion point.
|
||||
fn bsearch(&self, id: NeedleId) -> (usize, bool) {
|
||||
let ck = self.compact_key(id);
|
||||
|
||||
if self.list.is_empty() {
|
||||
return (0, false);
|
||||
}
|
||||
if ck == self.first_key {
|
||||
return (0, true);
|
||||
}
|
||||
if ck < self.first_key {
|
||||
return (0, false);
|
||||
}
|
||||
if ck == self.last_key {
|
||||
return (self.list.len() - 1, true);
|
||||
}
|
||||
if ck > self.last_key {
|
||||
return (self.list.len(), false);
|
||||
}
|
||||
|
||||
let i = self.list.partition_point(|e| e.key < ck);
|
||||
if i < self.list.len() && self.list[i].key == ck {
|
||||
(i, true)
|
||||
} else {
|
||||
(i, false)
|
||||
}
|
||||
}
|
||||
|
||||
/// Insert or update. Returns old NeedleValue if updating.
|
||||
fn set(&mut self, id: NeedleId, offset: Offset, size: Size) -> Option<NeedleValue> {
|
||||
let (i, found) = self.bsearch(id);
|
||||
|
||||
if found {
|
||||
let old = self.list[i].to_needle_value();
|
||||
let mut offset_bytes = [0u8; OFFSET_SIZE];
|
||||
offset.to_bytes(&mut offset_bytes);
|
||||
self.list[i].offset = offset_bytes;
|
||||
self.list[i].size = size;
|
||||
return Some(old);
|
||||
}
|
||||
|
||||
// Insert at sorted position
|
||||
let ck = self.compact_key(id);
|
||||
let mut offset_bytes = [0u8; OFFSET_SIZE];
|
||||
offset.to_bytes(&mut offset_bytes);
|
||||
|
||||
let entry = CompactEntry {
|
||||
key: ck,
|
||||
offset: offset_bytes,
|
||||
size,
|
||||
};
|
||||
|
||||
// Match Go panic: don't exceed segment capacity
|
||||
if self.list.len() >= SEGMENT_CHUNK_SIZE as usize {
|
||||
panic!(
|
||||
"attempted to write more than {} entries on CompactMapSegment",
|
||||
SEGMENT_CHUNK_SIZE
|
||||
);
|
||||
}
|
||||
|
||||
if self.list.len() == SEGMENT_CHUNK_SIZE as usize - 1 {
|
||||
// Pin capacity to exact size when maxing out
|
||||
let mut new_list = Vec::with_capacity(SEGMENT_CHUNK_SIZE as usize);
|
||||
new_list.extend_from_slice(&self.list[..i]);
|
||||
new_list.push(entry);
|
||||
new_list.extend_from_slice(&self.list[i..]);
|
||||
self.list = new_list;
|
||||
} else {
|
||||
self.list.insert(i, entry);
|
||||
}
|
||||
|
||||
if ck < self.first_key {
|
||||
self.first_key = ck;
|
||||
}
|
||||
if ck > self.last_key {
|
||||
self.last_key = ck;
|
||||
}
|
||||
|
||||
None
|
||||
}
|
||||
|
||||
fn get(&self, id: NeedleId) -> Option<NeedleValue> {
|
||||
let (i, found) = self.bsearch(id);
|
||||
if found {
|
||||
Some(self.list[i].to_needle_value())
|
||||
} else {
|
||||
None
|
||||
}
|
||||
}
|
||||
|
||||
/// Mark as deleted by negating size. Returns previous size if not already deleted.
|
||||
/// Matches Go behavior: checks !IsDeleted() (i.e., size >= 0).
|
||||
fn delete(&mut self, id: NeedleId) -> Option<Size> {
|
||||
let (i, found) = self.bsearch(id);
|
||||
if found && !self.list[i].size.is_deleted() {
|
||||
let old_size = self.list[i].size;
|
||||
if self.list[i].size.0 == 0 {
|
||||
self.list[i].size = TOMBSTONE_FILE_SIZE;
|
||||
} else {
|
||||
self.list[i].size = Size(-self.list[i].size.0);
|
||||
}
|
||||
Some(old_size)
|
||||
} else {
|
||||
None
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
/// Memory-efficient map of NeedleId -> (Offset, Size).
|
||||
/// Segments NeedleIds into chunks of 50,000 and stores compact 10-byte entries
|
||||
/// in sorted arrays, using only 2 bytes for the key within each segment.
|
||||
pub struct CompactMap {
|
||||
segments: HashMap<Chunk, Segment>,
|
||||
}
|
||||
|
||||
impl CompactMap {
|
||||
pub fn new() -> Self {
|
||||
CompactMap {
|
||||
segments: HashMap::new(),
|
||||
}
|
||||
}
|
||||
|
||||
fn _segment_for_key(&mut self, id: NeedleId) -> &mut Segment {
|
||||
let chunk = id.0 / SEGMENT_CHUNK_SIZE;
|
||||
self.segments
|
||||
.entry(chunk)
|
||||
.or_insert_with(|| Segment::new(chunk))
|
||||
}
|
||||
|
||||
/// Insert or update. Returns old NeedleValue if updating.
|
||||
pub fn set(&mut self, id: NeedleId, offset: Offset, size: Size) -> Option<NeedleValue> {
|
||||
let chunk = id.0 / SEGMENT_CHUNK_SIZE;
|
||||
let segment = self
|
||||
.segments
|
||||
.entry(chunk)
|
||||
.or_insert_with(|| Segment::new(chunk));
|
||||
segment.set(id, offset, size)
|
||||
}
|
||||
|
||||
pub fn get(&self, id: NeedleId) -> Option<NeedleValue> {
|
||||
let chunk = id.0 / SEGMENT_CHUNK_SIZE;
|
||||
self.segments.get(&chunk)?.get(id)
|
||||
}
|
||||
|
||||
/// Mark as deleted. Returns previous size if was valid.
|
||||
pub fn delete(&mut self, id: NeedleId) -> Option<Size> {
|
||||
let chunk = id.0 / SEGMENT_CHUNK_SIZE;
|
||||
self.segments.get_mut(&chunk)?.delete(id)
|
||||
}
|
||||
|
||||
/// Remove entry entirely (used during idx loading).
|
||||
pub fn remove(&mut self, id: NeedleId) -> Option<NeedleValue> {
|
||||
let chunk = id.0 / SEGMENT_CHUNK_SIZE;
|
||||
let segment = self.segments.get_mut(&chunk)?;
|
||||
let (i, found) = segment.bsearch(id);
|
||||
if found {
|
||||
let entry = segment.list.remove(i);
|
||||
// Update first/last keys
|
||||
if segment.list.is_empty() {
|
||||
segment.first_key = u16::MAX;
|
||||
segment.last_key = 0;
|
||||
} else {
|
||||
segment.first_key = segment.list[0].key;
|
||||
segment.last_key = segment.list[segment.list.len() - 1].key;
|
||||
}
|
||||
Some(entry.to_needle_value())
|
||||
} else {
|
||||
None
|
||||
}
|
||||
}
|
||||
|
||||
/// Iterate all entries in ascending NeedleId order.
|
||||
pub fn ascending_visit<F, E>(&self, mut f: F) -> Result<(), E>
|
||||
where
|
||||
F: FnMut(NeedleId, &NeedleValue) -> Result<(), E>,
|
||||
{
|
||||
let mut chunks: Vec<Chunk> = self.segments.keys().copied().collect();
|
||||
chunks.sort_unstable();
|
||||
|
||||
for chunk in chunks {
|
||||
let segment = &self.segments[&chunk];
|
||||
for entry in &segment.list {
|
||||
let id = NeedleId(SEGMENT_CHUNK_SIZE * segment.chunk + entry.key as u64);
|
||||
let nv = entry.to_needle_value();
|
||||
f(id, &nv)?;
|
||||
}
|
||||
}
|
||||
Ok(())
|
||||
}
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
fn offset(v: u32) -> Offset {
|
||||
Offset::from_actual_offset(v as i64 * NEEDLE_PADDING_SIZE as i64)
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_compact_map_basic() {
|
||||
let mut m = CompactMap::new();
|
||||
|
||||
// Insert
|
||||
assert!(m.set(NeedleId(1), offset(100), Size(50)).is_none());
|
||||
assert!(m.set(NeedleId(2), offset(200), Size(60)).is_none());
|
||||
|
||||
// Get
|
||||
let nv = m.get(NeedleId(1)).unwrap();
|
||||
assert_eq!(nv.size, Size(50));
|
||||
|
||||
// Update returns old value
|
||||
let old = m.set(NeedleId(1), offset(300), Size(70)).unwrap();
|
||||
assert_eq!(old.size, Size(50));
|
||||
|
||||
// Get updated value
|
||||
let nv = m.get(NeedleId(1)).unwrap();
|
||||
assert_eq!(nv.size, Size(70));
|
||||
|
||||
// Miss
|
||||
assert!(m.get(NeedleId(999)).is_none());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_compact_map_delete() {
|
||||
let mut m = CompactMap::new();
|
||||
m.set(NeedleId(1), offset(100), Size(50));
|
||||
|
||||
// Delete returns old size
|
||||
let old = m.delete(NeedleId(1)).unwrap();
|
||||
assert_eq!(old, Size(50));
|
||||
|
||||
// Get returns deleted (negative size)
|
||||
let nv = m.get(NeedleId(1)).unwrap();
|
||||
assert!(nv.size.is_deleted());
|
||||
|
||||
// Delete again returns None (already deleted)
|
||||
assert!(m.delete(NeedleId(1)).is_none());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_compact_map_zero_size_delete() {
|
||||
let mut m = CompactMap::new();
|
||||
m.set(NeedleId(1), offset(100), Size(0));
|
||||
|
||||
let old = m.delete(NeedleId(1)).unwrap();
|
||||
assert_eq!(old, Size(0));
|
||||
|
||||
let nv = m.get(NeedleId(1)).unwrap();
|
||||
assert_eq!(nv.size, TOMBSTONE_FILE_SIZE);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_compact_map_cross_segment() {
|
||||
let mut m = CompactMap::new();
|
||||
|
||||
// Insert across multiple segments
|
||||
m.set(NeedleId(1), offset(1), Size(1));
|
||||
m.set(NeedleId(50_000), offset(2), Size(2));
|
||||
m.set(NeedleId(100_000), offset(3), Size(3));
|
||||
|
||||
assert_eq!(m.get(NeedleId(1)).unwrap().size, Size(1));
|
||||
assert_eq!(m.get(NeedleId(50_000)).unwrap().size, Size(2));
|
||||
assert_eq!(m.get(NeedleId(100_000)).unwrap().size, Size(3));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_compact_map_ascending_visit() {
|
||||
let mut m = CompactMap::new();
|
||||
m.set(NeedleId(100_005), offset(3), Size(3));
|
||||
m.set(NeedleId(5), offset(1), Size(1));
|
||||
m.set(NeedleId(50_005), offset(2), Size(2));
|
||||
|
||||
let mut visited = Vec::new();
|
||||
m.ascending_visit(|id, nv| {
|
||||
visited.push((id, nv.size));
|
||||
Ok::<_, String>(())
|
||||
})
|
||||
.unwrap();
|
||||
|
||||
assert_eq!(visited.len(), 3);
|
||||
assert_eq!(visited[0].0, NeedleId(5));
|
||||
assert_eq!(visited[1].0, NeedleId(50_005));
|
||||
assert_eq!(visited[2].0, NeedleId(100_005));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_compact_map_remove() {
|
||||
let mut m = CompactMap::new();
|
||||
m.set(NeedleId(1), offset(100), Size(50));
|
||||
m.set(NeedleId(2), offset(200), Size(60));
|
||||
|
||||
let removed = m.remove(NeedleId(1)).unwrap();
|
||||
assert_eq!(removed.size, Size(50));
|
||||
|
||||
assert!(m.get(NeedleId(1)).is_none());
|
||||
assert_eq!(m.get(NeedleId(2)).unwrap().size, Size(60));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_compact_map_reverse_insert_order() {
|
||||
let mut m = CompactMap::new();
|
||||
// Insert in reverse order to test sorted insert
|
||||
for i in (0..100).rev() {
|
||||
m.set(NeedleId(i), offset(i as u32), Size(i as i32));
|
||||
}
|
||||
for i in 0..100 {
|
||||
assert_eq!(m.get(NeedleId(i)).unwrap().size, Size(i as i32));
|
||||
}
|
||||
}
|
||||
}
|
||||
1297
seaweed-volume/src/storage/store.rs
Normal file
1297
seaweed-volume/src/storage/store.rs
Normal file
File diff suppressed because it is too large
Load Diff
289
seaweed-volume/src/storage/super_block.rs
Normal file
289
seaweed-volume/src/storage/super_block.rs
Normal file
@@ -0,0 +1,289 @@
|
||||
//! SuperBlock: the 8-byte (+ optional extra) header at the start of every .dat file.
|
||||
//!
|
||||
//! Byte layout:
|
||||
//! [0] Version
|
||||
//! [1] ReplicaPlacement byte
|
||||
//! [2..4] TTL (2 bytes)
|
||||
//! [4..6] CompactionRevision (u16 big-endian)
|
||||
//! [6..8] ExtraSize (u16 big-endian)
|
||||
//! [8..] Extra data (protobuf, ExtraSize bytes) — only for Version 2/3
|
||||
|
||||
use crate::storage::needle::ttl::TTL;
|
||||
use crate::storage::types::Version;
|
||||
|
||||
pub const SUPER_BLOCK_SIZE: usize = 8;
|
||||
|
||||
/// SuperBlock metadata at the start of a volume .dat file.
|
||||
#[derive(Debug, Clone)]
|
||||
pub struct SuperBlock {
|
||||
pub version: Version,
|
||||
pub replica_placement: ReplicaPlacement,
|
||||
pub ttl: TTL,
|
||||
pub compaction_revision: u16,
|
||||
pub extra_size: u16,
|
||||
pub extra_data: Vec<u8>, // raw protobuf bytes (SuperBlockExtra)
|
||||
}
|
||||
|
||||
impl SuperBlock {
|
||||
/// Total block size on disk (base 8 + extra).
|
||||
pub fn block_size(&self) -> usize {
|
||||
match self.version.0 {
|
||||
2 | 3 => SUPER_BLOCK_SIZE + self.extra_size as usize,
|
||||
_ => SUPER_BLOCK_SIZE,
|
||||
}
|
||||
}
|
||||
|
||||
/// Serialize to bytes.
|
||||
pub fn to_bytes(&self) -> Vec<u8> {
|
||||
let mut header = vec![0u8; SUPER_BLOCK_SIZE];
|
||||
header[0] = self.version.0;
|
||||
header[1] = self.replica_placement.to_byte();
|
||||
self.ttl.to_bytes(&mut header[2..4]);
|
||||
header[4..6].copy_from_slice(&self.compaction_revision.to_be_bytes());
|
||||
|
||||
if !self.extra_data.is_empty() {
|
||||
// Go checks extraSize > 256*256-2 and calls glog.Fatalf; guard against u16 overflow.
|
||||
assert!(
|
||||
self.extra_data.len() <= 65534,
|
||||
"super block extra data too large: {} > 65534",
|
||||
self.extra_data.len()
|
||||
);
|
||||
let extra_size = self.extra_data.len() as u16;
|
||||
header[6..8].copy_from_slice(&extra_size.to_be_bytes());
|
||||
header.extend_from_slice(&self.extra_data);
|
||||
}
|
||||
|
||||
header
|
||||
}
|
||||
|
||||
/// Parse from bytes (must be at least SUPER_BLOCK_SIZE bytes).
|
||||
pub fn from_bytes(bytes: &[u8]) -> Result<Self, SuperBlockError> {
|
||||
if bytes.len() < SUPER_BLOCK_SIZE {
|
||||
return Err(SuperBlockError::TooShort(bytes.len()));
|
||||
}
|
||||
|
||||
let version = Version(bytes[0]);
|
||||
let replica_placement = ReplicaPlacement::from_byte(bytes[1])?;
|
||||
let ttl = TTL::from_bytes(&bytes[2..4]);
|
||||
let compaction_revision = u16::from_be_bytes([bytes[4], bytes[5]]);
|
||||
let extra_size = u16::from_be_bytes([bytes[6], bytes[7]]);
|
||||
|
||||
let extra_data = if extra_size > 0 && bytes.len() >= SUPER_BLOCK_SIZE + extra_size as usize
|
||||
{
|
||||
bytes[SUPER_BLOCK_SIZE..SUPER_BLOCK_SIZE + extra_size as usize].to_vec()
|
||||
} else {
|
||||
vec![]
|
||||
};
|
||||
|
||||
Ok(SuperBlock {
|
||||
version,
|
||||
replica_placement,
|
||||
ttl,
|
||||
compaction_revision,
|
||||
extra_size,
|
||||
extra_data,
|
||||
})
|
||||
}
|
||||
|
||||
pub fn initialized(&self) -> bool {
|
||||
true // ReplicaPlacement and TTL are always valid after construction
|
||||
}
|
||||
}
|
||||
|
||||
impl Default for SuperBlock {
|
||||
fn default() -> Self {
|
||||
SuperBlock {
|
||||
version: Version::current(),
|
||||
replica_placement: ReplicaPlacement::default(),
|
||||
ttl: TTL::EMPTY,
|
||||
compaction_revision: 0,
|
||||
extra_size: 0,
|
||||
extra_data: vec![],
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// ============================================================================
|
||||
// ReplicaPlacement
|
||||
// ============================================================================
|
||||
|
||||
/// Replication strategy encoded as a single byte.
|
||||
///
|
||||
/// Byte value = DiffDataCenterCount * 100 + DiffRackCount * 10 + SameRackCount
|
||||
///
|
||||
/// Examples:
|
||||
/// "000" → no replication (1 copy total)
|
||||
/// "010" → 1 copy in different rack (2 copies total)
|
||||
/// "100" → 1 copy in different datacenter
|
||||
/// "200" → 2 copies in different datacenters
|
||||
#[derive(Debug, Clone, Copy, PartialEq, Eq, Default)]
|
||||
pub struct ReplicaPlacement {
|
||||
pub same_rack_count: u8,
|
||||
pub diff_rack_count: u8,
|
||||
pub diff_data_center_count: u8,
|
||||
}
|
||||
|
||||
impl ReplicaPlacement {
|
||||
/// Parse from a string like "000", "010", "100".
|
||||
/// Accepts 0-3 character strings, padding with leading zeros to match Go behavior.
|
||||
/// E.g. "" -> "000", "1" -> "001", "01" -> "001", "010" -> "010"
|
||||
pub fn from_string(s: &str) -> Result<Self, SuperBlockError> {
|
||||
let s = s.trim();
|
||||
if s.is_empty() {
|
||||
return Ok(ReplicaPlacement::default());
|
||||
}
|
||||
// Pad with leading zeros to 3 chars, matching Go's NewReplicaPlacementFromString
|
||||
let padded = match s.len() {
|
||||
1 => format!("00{}", s),
|
||||
2 => format!("0{}", s),
|
||||
3 => s.to_string(),
|
||||
_ => return Err(SuperBlockError::InvalidReplicaPlacement(s.to_string())),
|
||||
};
|
||||
let chars: Vec<char> = padded.chars().collect();
|
||||
let dc = chars[0]
|
||||
.to_digit(10)
|
||||
.ok_or_else(|| SuperBlockError::InvalidReplicaPlacement(s.to_string()))?
|
||||
as u8;
|
||||
let rack = chars[1]
|
||||
.to_digit(10)
|
||||
.ok_or_else(|| SuperBlockError::InvalidReplicaPlacement(s.to_string()))?
|
||||
as u8;
|
||||
let same = chars[2]
|
||||
.to_digit(10)
|
||||
.ok_or_else(|| SuperBlockError::InvalidReplicaPlacement(s.to_string()))?
|
||||
as u8;
|
||||
// Go validates: value = dc*100 + rack*10 + same must fit in a byte
|
||||
let value = dc as u16 * 100 + rack as u16 * 10 + same as u16;
|
||||
if value > 255 {
|
||||
return Err(SuperBlockError::InvalidReplicaPlacement(s.to_string()));
|
||||
}
|
||||
Ok(ReplicaPlacement {
|
||||
diff_data_center_count: dc,
|
||||
diff_rack_count: rack,
|
||||
same_rack_count: same,
|
||||
})
|
||||
}
|
||||
|
||||
/// Parse from a single byte.
|
||||
pub fn from_byte(b: u8) -> Result<Self, SuperBlockError> {
|
||||
Ok(ReplicaPlacement {
|
||||
diff_data_center_count: b / 100,
|
||||
diff_rack_count: (b % 100) / 10,
|
||||
same_rack_count: b % 10,
|
||||
})
|
||||
}
|
||||
|
||||
/// Encode as a single byte.
|
||||
pub fn to_byte(&self) -> u8 {
|
||||
self.diff_data_center_count * 100 + self.diff_rack_count * 10 + self.same_rack_count
|
||||
}
|
||||
|
||||
/// Total number of copies (including the original).
|
||||
pub fn get_copy_count(&self) -> u8 {
|
||||
self.diff_data_center_count + self.diff_rack_count + self.same_rack_count + 1
|
||||
}
|
||||
|
||||
/// Whether this placement requires replication (more than 1 copy).
|
||||
pub fn has_replication(&self) -> bool {
|
||||
self.get_copy_count() > 1
|
||||
}
|
||||
}
|
||||
|
||||
impl std::fmt::Display for ReplicaPlacement {
|
||||
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
|
||||
write!(
|
||||
f,
|
||||
"{}{}{}",
|
||||
self.diff_data_center_count, self.diff_rack_count, self.same_rack_count
|
||||
)
|
||||
}
|
||||
}
|
||||
|
||||
// ============================================================================
|
||||
// Errors
|
||||
// ============================================================================
|
||||
|
||||
#[derive(Debug, thiserror::Error)]
|
||||
pub enum SuperBlockError {
|
||||
#[error("super block too short: {0} bytes")]
|
||||
TooShort(usize),
|
||||
|
||||
#[error("invalid replica placement: {0}")]
|
||||
InvalidReplicaPlacement(String),
|
||||
}
|
||||
|
||||
// ============================================================================
|
||||
// Tests
|
||||
// ============================================================================
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
use crate::storage::types::*;
|
||||
|
||||
#[test]
|
||||
fn test_super_block_round_trip() {
|
||||
let sb = SuperBlock {
|
||||
version: VERSION_3,
|
||||
replica_placement: ReplicaPlacement::from_string("010").unwrap(),
|
||||
ttl: TTL { count: 5, unit: 3 },
|
||||
compaction_revision: 42,
|
||||
extra_size: 0,
|
||||
extra_data: vec![],
|
||||
};
|
||||
|
||||
let bytes = sb.to_bytes();
|
||||
assert_eq!(bytes.len(), SUPER_BLOCK_SIZE);
|
||||
|
||||
let sb2 = SuperBlock::from_bytes(&bytes).unwrap();
|
||||
assert_eq!(sb2.version, sb.version);
|
||||
assert_eq!(sb2.replica_placement, sb.replica_placement);
|
||||
assert_eq!(sb2.ttl, sb.ttl);
|
||||
assert_eq!(sb2.compaction_revision, sb.compaction_revision);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_super_block_with_extra() {
|
||||
let sb = SuperBlock {
|
||||
version: VERSION_3,
|
||||
replica_placement: ReplicaPlacement::default(),
|
||||
ttl: TTL::EMPTY,
|
||||
compaction_revision: 0,
|
||||
extra_size: 3,
|
||||
extra_data: vec![1, 2, 3],
|
||||
};
|
||||
|
||||
let bytes = sb.to_bytes();
|
||||
assert_eq!(bytes.len(), SUPER_BLOCK_SIZE + 3);
|
||||
|
||||
let sb2 = SuperBlock::from_bytes(&bytes).unwrap();
|
||||
assert_eq!(sb2.extra_data, vec![1, 2, 3]);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_replica_placement_byte_round_trip() {
|
||||
let rp = ReplicaPlacement::from_string("123").unwrap();
|
||||
assert_eq!(rp.diff_data_center_count, 1);
|
||||
assert_eq!(rp.diff_rack_count, 2);
|
||||
assert_eq!(rp.same_rack_count, 3);
|
||||
assert_eq!(rp.to_byte(), 123);
|
||||
assert_eq!(rp.get_copy_count(), 7); // 1+2+3+1
|
||||
|
||||
let rp2 = ReplicaPlacement::from_byte(123).unwrap();
|
||||
assert_eq!(rp, rp2);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_replica_placement_no_replication() {
|
||||
let rp = ReplicaPlacement::from_string("000").unwrap();
|
||||
assert!(!rp.has_replication());
|
||||
assert_eq!(rp.get_copy_count(), 1);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_replica_placement_display() {
|
||||
let rp = ReplicaPlacement::from_string("010").unwrap();
|
||||
assert_eq!(rp.to_string(), "010");
|
||||
assert!(rp.has_replication());
|
||||
}
|
||||
}
|
||||
679
seaweed-volume/src/storage/types.rs
Normal file
679
seaweed-volume/src/storage/types.rs
Normal file
@@ -0,0 +1,679 @@
|
||||
//! Core storage types: NeedleId, Offset, Size, Cookie, DiskType.
|
||||
//!
|
||||
//! These types define the binary-compatible on-disk format matching the Go implementation.
|
||||
//! CRITICAL: Byte layout must match exactly for cross-compatibility.
|
||||
|
||||
use std::fmt;
|
||||
|
||||
// ============================================================================
|
||||
// Constants
|
||||
// ============================================================================
|
||||
|
||||
pub const NEEDLE_ID_SIZE: usize = 8;
|
||||
pub const NEEDLE_ID_EMPTY: u64 = 0;
|
||||
pub const COOKIE_SIZE: usize = 4;
|
||||
pub const SIZE_SIZE: usize = 4;
|
||||
pub const NEEDLE_HEADER_SIZE: usize = COOKIE_SIZE + NEEDLE_ID_SIZE + SIZE_SIZE; // 16
|
||||
pub const DATA_SIZE_SIZE: usize = 4;
|
||||
pub const TIMESTAMP_SIZE: usize = 8;
|
||||
pub const NEEDLE_PADDING_SIZE: usize = 8;
|
||||
pub const NEEDLE_CHECKSUM_SIZE: usize = 4;
|
||||
|
||||
/// 5-byte offset mode (matching Go production builds with `-tags 5BytesOffset`).
|
||||
/// Max volume size: 8TB. Index entry: 17 bytes (8 + 5 + 4).
|
||||
#[cfg(feature = "5bytes")]
|
||||
pub const OFFSET_SIZE: usize = 5;
|
||||
#[cfg(feature = "5bytes")]
|
||||
pub const MAX_POSSIBLE_VOLUME_SIZE: u64 = 4 * 1024 * 1024 * 1024 * 8 * 256; // 8TB
|
||||
|
||||
/// 4-byte offset mode (matching Go default build without `5BytesOffset`).
|
||||
/// Max volume size: 32GB. Index entry: 16 bytes (8 + 4 + 4).
|
||||
#[cfg(not(feature = "5bytes"))]
|
||||
pub const OFFSET_SIZE: usize = 4;
|
||||
#[cfg(not(feature = "5bytes"))]
|
||||
pub const MAX_POSSIBLE_VOLUME_SIZE: u64 = 4 * 1024 * 1024 * 1024 * 8; // 32GB
|
||||
|
||||
pub const NEEDLE_MAP_ENTRY_SIZE: usize = NEEDLE_ID_SIZE + OFFSET_SIZE + SIZE_SIZE;
|
||||
|
||||
// ============================================================================
|
||||
// NeedleId
|
||||
// ============================================================================
|
||||
|
||||
/// 64-bit unique identifier for a needle within a volume.
|
||||
#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash, PartialOrd, Ord, Default)]
|
||||
pub struct NeedleId(pub u64);
|
||||
|
||||
impl NeedleId {
|
||||
pub fn to_bytes(&self, bytes: &mut [u8]) {
|
||||
assert!(bytes.len() >= NEEDLE_ID_SIZE);
|
||||
bytes[0..8].copy_from_slice(&self.0.to_be_bytes());
|
||||
}
|
||||
|
||||
pub fn from_bytes(bytes: &[u8]) -> Self {
|
||||
assert!(bytes.len() >= NEEDLE_ID_SIZE);
|
||||
NeedleId(u64::from_be_bytes([
|
||||
bytes[0], bytes[1], bytes[2], bytes[3], bytes[4], bytes[5], bytes[6], bytes[7],
|
||||
]))
|
||||
}
|
||||
|
||||
pub fn is_empty(&self) -> bool {
|
||||
self.0 == 0
|
||||
}
|
||||
|
||||
/// Parse a hex string into a NeedleId.
|
||||
pub fn parse(s: &str) -> Result<Self, std::num::ParseIntError> {
|
||||
u64::from_str_radix(s, 16).map(NeedleId)
|
||||
}
|
||||
}
|
||||
|
||||
impl fmt::Display for NeedleId {
|
||||
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
|
||||
write!(f, "{:x}", self.0)
|
||||
}
|
||||
}
|
||||
|
||||
impl From<u64> for NeedleId {
|
||||
fn from(v: u64) -> Self {
|
||||
NeedleId(v)
|
||||
}
|
||||
}
|
||||
|
||||
impl From<NeedleId> for u64 {
|
||||
fn from(v: NeedleId) -> Self {
|
||||
v.0
|
||||
}
|
||||
}
|
||||
|
||||
// ============================================================================
|
||||
// Cookie
|
||||
// ============================================================================
|
||||
|
||||
/// Random 32-bit value to mitigate brute-force lookups.
|
||||
#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash, Default)]
|
||||
pub struct Cookie(pub u32);
|
||||
|
||||
impl Cookie {
|
||||
pub fn to_bytes(&self, bytes: &mut [u8]) {
|
||||
assert!(bytes.len() >= COOKIE_SIZE);
|
||||
bytes[0..4].copy_from_slice(&self.0.to_be_bytes());
|
||||
}
|
||||
|
||||
pub fn from_bytes(bytes: &[u8]) -> Self {
|
||||
assert!(bytes.len() >= COOKIE_SIZE);
|
||||
Cookie(u32::from_be_bytes([bytes[0], bytes[1], bytes[2], bytes[3]]))
|
||||
}
|
||||
|
||||
/// Parse a hex string into a Cookie.
|
||||
pub fn parse(s: &str) -> Result<Self, std::num::ParseIntError> {
|
||||
u32::from_str_radix(s, 16).map(Cookie)
|
||||
}
|
||||
}
|
||||
|
||||
impl fmt::Display for Cookie {
|
||||
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
|
||||
write!(f, "{:x}", self.0)
|
||||
}
|
||||
}
|
||||
|
||||
impl From<u32> for Cookie {
|
||||
fn from(v: u32) -> Self {
|
||||
Cookie(v)
|
||||
}
|
||||
}
|
||||
|
||||
// ============================================================================
|
||||
// Size
|
||||
// ============================================================================
|
||||
|
||||
/// Needle size as stored in the index. Negative = deleted.
|
||||
///
|
||||
/// - Positive: valid needle with that many bytes of body content
|
||||
/// - TombstoneFileSize (-1): tombstone marker
|
||||
/// - Other negative: deleted, absolute value was the original size
|
||||
#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash, Default)]
|
||||
pub struct Size(pub i32);
|
||||
|
||||
/// Special marker for a tombstone (deletion marker) entry.
|
||||
pub const TOMBSTONE_FILE_SIZE: Size = Size(-1);
|
||||
|
||||
impl Size {
|
||||
pub fn is_tombstone(&self) -> bool {
|
||||
self.0 == TOMBSTONE_FILE_SIZE.0
|
||||
}
|
||||
|
||||
pub fn is_deleted(&self) -> bool {
|
||||
self.0 < 0 || self.0 == TOMBSTONE_FILE_SIZE.0
|
||||
}
|
||||
|
||||
pub fn is_valid(&self) -> bool {
|
||||
self.0 > 0 && !self.is_tombstone()
|
||||
}
|
||||
|
||||
/// Raw storage size. For tombstones returns 0; for negative returns abs value.
|
||||
pub fn raw(&self) -> u32 {
|
||||
if self.is_tombstone() {
|
||||
return 0;
|
||||
}
|
||||
if self.0 < 0 {
|
||||
return (self.0 * -1) as u32;
|
||||
}
|
||||
self.0 as u32
|
||||
}
|
||||
|
||||
pub fn to_bytes(&self, bytes: &mut [u8]) {
|
||||
assert!(bytes.len() >= SIZE_SIZE);
|
||||
bytes[0..4].copy_from_slice(&(self.0 as u32).to_be_bytes());
|
||||
}
|
||||
|
||||
pub fn from_bytes(bytes: &[u8]) -> Self {
|
||||
assert!(bytes.len() >= SIZE_SIZE);
|
||||
let v = u32::from_be_bytes([bytes[0], bytes[1], bytes[2], bytes[3]]);
|
||||
Size(v as i32)
|
||||
}
|
||||
}
|
||||
|
||||
impl From<i32> for Size {
|
||||
fn from(v: i32) -> Self {
|
||||
Size(v)
|
||||
}
|
||||
}
|
||||
|
||||
impl From<Size> for i32 {
|
||||
fn from(v: Size) -> Self {
|
||||
v.0
|
||||
}
|
||||
}
|
||||
|
||||
// ============================================================================
|
||||
// Offset
|
||||
// ============================================================================
|
||||
|
||||
/// Offset encoding for needle positions in .dat files.
|
||||
///
|
||||
/// The offset is stored divided by NEEDLE_PADDING_SIZE (8).
|
||||
///
|
||||
/// With `5bytes` feature (default, matching Go production builds):
|
||||
/// 5 bytes can address up to 8TB.
|
||||
/// On-disk layout: [b3][b2][b1][b0][b4] (big-endian 4 bytes + 1 high byte)
|
||||
///
|
||||
/// Without `5bytes` feature (matching Go default build):
|
||||
/// 4 bytes can address up to 32GB.
|
||||
/// On-disk layout: [b3][b2][b1][b0] (big-endian 4 bytes)
|
||||
#[derive(Debug, Clone, Copy, PartialEq, Eq, PartialOrd, Ord, Hash, Default)]
|
||||
pub struct Offset {
|
||||
pub b0: u8,
|
||||
pub b1: u8,
|
||||
pub b2: u8,
|
||||
pub b3: u8,
|
||||
#[cfg(feature = "5bytes")]
|
||||
pub b4: u8,
|
||||
}
|
||||
|
||||
impl Offset {
|
||||
/// Convert to the actual byte offset in the .dat file.
|
||||
pub fn to_actual_offset(&self) -> i64 {
|
||||
let stored = self.b0 as i64
|
||||
+ (self.b1 as i64) * 256
|
||||
+ (self.b2 as i64) * 65536
|
||||
+ (self.b3 as i64) * 16777216;
|
||||
#[cfg(feature = "5bytes")]
|
||||
let stored = stored + (self.b4 as i64) * 4294967296; // 1 << 32
|
||||
stored * NEEDLE_PADDING_SIZE as i64
|
||||
}
|
||||
|
||||
/// Create an Offset from an actual byte offset.
|
||||
pub fn from_actual_offset(offset: i64) -> Self {
|
||||
let smaller = offset / NEEDLE_PADDING_SIZE as i64;
|
||||
Offset {
|
||||
b0: smaller as u8,
|
||||
b1: (smaller >> 8) as u8,
|
||||
b2: (smaller >> 16) as u8,
|
||||
b3: (smaller >> 24) as u8,
|
||||
#[cfg(feature = "5bytes")]
|
||||
b4: (smaller >> 32) as u8,
|
||||
}
|
||||
}
|
||||
|
||||
/// Serialize to bytes in the .idx file format.
|
||||
/// 5-byte layout: [b3][b2][b1][b0][b4]
|
||||
/// 4-byte layout: [b3][b2][b1][b0]
|
||||
pub fn to_bytes(&self, bytes: &mut [u8]) {
|
||||
assert!(bytes.len() >= OFFSET_SIZE);
|
||||
bytes[0] = self.b3;
|
||||
bytes[1] = self.b2;
|
||||
bytes[2] = self.b1;
|
||||
bytes[3] = self.b0;
|
||||
#[cfg(feature = "5bytes")]
|
||||
{
|
||||
bytes[4] = self.b4;
|
||||
}
|
||||
}
|
||||
|
||||
/// Deserialize from bytes in the .idx file format.
|
||||
pub fn from_bytes(bytes: &[u8]) -> Self {
|
||||
assert!(bytes.len() >= OFFSET_SIZE);
|
||||
Offset {
|
||||
b3: bytes[0],
|
||||
b2: bytes[1],
|
||||
b1: bytes[2],
|
||||
b0: bytes[3],
|
||||
#[cfg(feature = "5bytes")]
|
||||
b4: bytes[4],
|
||||
}
|
||||
}
|
||||
|
||||
pub fn is_zero(&self) -> bool {
|
||||
#[cfg(feature = "5bytes")]
|
||||
{
|
||||
self.b0 == 0 && self.b1 == 0 && self.b2 == 0 && self.b3 == 0 && self.b4 == 0
|
||||
}
|
||||
#[cfg(not(feature = "5bytes"))]
|
||||
{
|
||||
self.b0 == 0 && self.b1 == 0 && self.b2 == 0 && self.b3 == 0
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
impl fmt::Display for Offset {
|
||||
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
|
||||
write!(f, "{}", self.to_actual_offset())
|
||||
}
|
||||
}
|
||||
|
||||
// ============================================================================
|
||||
// DiskType
|
||||
// ============================================================================
|
||||
|
||||
#[derive(Debug, Clone, PartialEq, Eq, Hash)]
|
||||
pub enum DiskType {
|
||||
HardDrive,
|
||||
Ssd,
|
||||
Custom(String),
|
||||
}
|
||||
|
||||
impl DiskType {
|
||||
pub fn from_string(s: &str) -> Self {
|
||||
match s.to_lowercase().as_str() {
|
||||
"" | "hdd" => DiskType::HardDrive,
|
||||
"ssd" => DiskType::Ssd,
|
||||
other => DiskType::Custom(other.to_string()),
|
||||
}
|
||||
}
|
||||
|
||||
pub fn readable_string(&self) -> &str {
|
||||
match self {
|
||||
DiskType::HardDrive => "hdd",
|
||||
DiskType::Ssd => "ssd",
|
||||
DiskType::Custom(s) => s,
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
impl fmt::Display for DiskType {
|
||||
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
|
||||
match self {
|
||||
DiskType::HardDrive => write!(f, ""),
|
||||
DiskType::Ssd => write!(f, "ssd"),
|
||||
DiskType::Custom(s) => write!(f, "{}", s),
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
impl Default for DiskType {
|
||||
fn default() -> Self {
|
||||
DiskType::HardDrive
|
||||
}
|
||||
}
|
||||
|
||||
// ============================================================================
|
||||
// VolumeId
|
||||
// ============================================================================
|
||||
|
||||
/// Volume identifier, stored as u32.
|
||||
#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash, PartialOrd, Ord, Default)]
|
||||
pub struct VolumeId(pub u32);
|
||||
|
||||
impl VolumeId {
|
||||
pub fn parse(s: &str) -> Result<Self, std::num::ParseIntError> {
|
||||
s.parse::<u32>().map(VolumeId)
|
||||
}
|
||||
|
||||
pub fn next(&self) -> VolumeId {
|
||||
VolumeId(self.0 + 1)
|
||||
}
|
||||
}
|
||||
|
||||
impl fmt::Display for VolumeId {
|
||||
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
|
||||
write!(f, "{}", self.0)
|
||||
}
|
||||
}
|
||||
|
||||
impl From<u32> for VolumeId {
|
||||
fn from(v: u32) -> Self {
|
||||
VolumeId(v)
|
||||
}
|
||||
}
|
||||
|
||||
// ============================================================================
|
||||
// Version
|
||||
// ============================================================================
|
||||
|
||||
/// Needle storage format version.
|
||||
#[derive(Debug, Clone, Copy, PartialEq, Eq, PartialOrd, Ord, Hash)]
|
||||
pub struct Version(pub u8);
|
||||
|
||||
pub const VERSION_1: Version = Version(1);
|
||||
pub const VERSION_2: Version = Version(2);
|
||||
pub const VERSION_3: Version = Version(3);
|
||||
|
||||
impl Version {
|
||||
pub fn current() -> Self {
|
||||
VERSION_3
|
||||
}
|
||||
|
||||
pub fn is_supported(&self) -> bool {
|
||||
self.0 >= 1 && self.0 <= 3
|
||||
}
|
||||
}
|
||||
|
||||
impl Default for Version {
|
||||
fn default() -> Self {
|
||||
VERSION_3
|
||||
}
|
||||
}
|
||||
|
||||
impl From<u8> for Version {
|
||||
fn from(v: u8) -> Self {
|
||||
Version(v)
|
||||
}
|
||||
}
|
||||
|
||||
// ============================================================================
|
||||
// ReadOption
|
||||
// ============================================================================
|
||||
|
||||
/// Options controlling needle read behavior, matching Go's `ReadOption` in store.go.
|
||||
///
|
||||
/// Fields are split into request-side options (set by the caller) and response-side
|
||||
/// flags (set during the read to communicate status back).
|
||||
#[derive(Debug, Clone)]
|
||||
pub struct ReadOption {
|
||||
// -- request --
|
||||
/// If true, allow reading needles that have been soft-deleted.
|
||||
pub read_deleted: bool,
|
||||
/// If true, attempt to read only metadata for large needles (> PagedReadLimit).
|
||||
pub attempt_meta_only: bool,
|
||||
/// If true, the caller requires metadata only (no data payload).
|
||||
pub must_meta_only: bool,
|
||||
|
||||
// -- response --
|
||||
/// Set to true when the read actually returned metadata only.
|
||||
pub is_meta_only: bool,
|
||||
/// Compaction revision at the time of the read (for consistency during streaming).
|
||||
pub volume_revision: u16,
|
||||
/// Set to true when the offset exceeded MaxPossibleVolumeSize (4-byte offset wrap).
|
||||
pub is_out_of_range: bool,
|
||||
|
||||
// -- slow-read / streaming --
|
||||
/// When true, the read lock is acquired and released per chunk instead of held
|
||||
/// for the entire read, reducing write latency at the cost of higher read P99.
|
||||
pub has_slow_read: bool,
|
||||
/// Buffer size for chunked streaming reads (used with `has_slow_read`).
|
||||
pub read_buffer_size: i32,
|
||||
}
|
||||
|
||||
impl Default for ReadOption {
|
||||
fn default() -> Self {
|
||||
ReadOption {
|
||||
read_deleted: false,
|
||||
attempt_meta_only: false,
|
||||
must_meta_only: false,
|
||||
is_meta_only: false,
|
||||
volume_revision: 0,
|
||||
is_out_of_range: false,
|
||||
has_slow_read: false,
|
||||
read_buffer_size: 0,
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// ============================================================================
|
||||
// NeedleMapEntry helpers (for .idx file)
|
||||
// ============================================================================
|
||||
|
||||
/// Parse a single .idx file entry (17 bytes) into (NeedleId, Offset, Size).
|
||||
pub fn idx_entry_from_bytes(bytes: &[u8]) -> (NeedleId, Offset, Size) {
|
||||
assert!(bytes.len() >= NEEDLE_MAP_ENTRY_SIZE);
|
||||
let key = NeedleId::from_bytes(&bytes[..NEEDLE_ID_SIZE]);
|
||||
let offset = Offset::from_bytes(&bytes[NEEDLE_ID_SIZE..NEEDLE_ID_SIZE + OFFSET_SIZE]);
|
||||
let size = Size::from_bytes(
|
||||
&bytes[NEEDLE_ID_SIZE + OFFSET_SIZE..NEEDLE_ID_SIZE + OFFSET_SIZE + SIZE_SIZE],
|
||||
);
|
||||
(key, offset, size)
|
||||
}
|
||||
|
||||
/// Write a single .idx file entry (17 bytes).
|
||||
pub fn idx_entry_to_bytes(bytes: &mut [u8], key: NeedleId, offset: Offset, size: Size) {
|
||||
assert!(bytes.len() >= NEEDLE_MAP_ENTRY_SIZE);
|
||||
key.to_bytes(&mut bytes[..NEEDLE_ID_SIZE]);
|
||||
offset.to_bytes(&mut bytes[NEEDLE_ID_SIZE..NEEDLE_ID_SIZE + OFFSET_SIZE]);
|
||||
size.to_bytes(
|
||||
&mut bytes[NEEDLE_ID_SIZE + OFFSET_SIZE..NEEDLE_ID_SIZE + OFFSET_SIZE + SIZE_SIZE],
|
||||
);
|
||||
}
|
||||
|
||||
// ============================================================================
|
||||
// Tests
|
||||
// ============================================================================
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn test_needle_id_round_trip() {
|
||||
let id = NeedleId(0x123456789abcdef0);
|
||||
let mut buf = [0u8; 8];
|
||||
id.to_bytes(&mut buf);
|
||||
let id2 = NeedleId::from_bytes(&buf);
|
||||
assert_eq!(id, id2);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_needle_id_display() {
|
||||
let id = NeedleId(255);
|
||||
assert_eq!(id.to_string(), "ff");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_needle_id_parse() {
|
||||
let id = NeedleId::parse("ff").unwrap();
|
||||
assert_eq!(id, NeedleId(255));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_cookie_round_trip() {
|
||||
let cookie = Cookie(0xdeadbeef);
|
||||
let mut buf = [0u8; 4];
|
||||
cookie.to_bytes(&mut buf);
|
||||
let cookie2 = Cookie::from_bytes(&buf);
|
||||
assert_eq!(cookie, cookie2);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_size_semantics() {
|
||||
assert!(Size(100).is_valid());
|
||||
assert!(!Size(100).is_deleted());
|
||||
assert!(!Size(100).is_tombstone());
|
||||
assert_eq!(Size(100).raw(), 100);
|
||||
|
||||
assert!(Size(-50).is_deleted());
|
||||
assert!(!Size(-50).is_tombstone());
|
||||
assert_eq!(Size(-50).raw(), 50);
|
||||
|
||||
assert!(TOMBSTONE_FILE_SIZE.is_deleted());
|
||||
assert!(TOMBSTONE_FILE_SIZE.is_tombstone());
|
||||
assert_eq!(TOMBSTONE_FILE_SIZE.raw(), 0);
|
||||
|
||||
assert!(!Size(0).is_valid());
|
||||
assert!(!Size(0).is_deleted());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_size_round_trip() {
|
||||
let size = Size(12345);
|
||||
let mut buf = [0u8; 4];
|
||||
size.to_bytes(&mut buf);
|
||||
let size2 = Size::from_bytes(&buf);
|
||||
assert_eq!(size, size2);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_size_negative_round_trip() {
|
||||
// Negative sizes round-trip through u32 bit pattern
|
||||
let size = Size(-50);
|
||||
let mut buf = [0u8; 4];
|
||||
size.to_bytes(&mut buf);
|
||||
let size2 = Size::from_bytes(&buf);
|
||||
assert_eq!(size, size2);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_offset_round_trip() {
|
||||
// Test with a known actual offset
|
||||
let actual_offset: i64 = 8 * 1000000; // must be multiple of 8
|
||||
let offset = Offset::from_actual_offset(actual_offset);
|
||||
assert_eq!(offset.to_actual_offset(), actual_offset);
|
||||
|
||||
// Test byte serialization
|
||||
let mut buf = [0u8; 5];
|
||||
offset.to_bytes(&mut buf);
|
||||
let offset2 = Offset::from_bytes(&buf);
|
||||
assert_eq!(offset.to_actual_offset(), offset2.to_actual_offset());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_offset_zero() {
|
||||
let offset = Offset::default();
|
||||
assert!(offset.is_zero());
|
||||
assert_eq!(offset.to_actual_offset(), 0);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_offset_max() {
|
||||
// Max stored value depends on offset size
|
||||
#[cfg(feature = "5bytes")]
|
||||
let max_stored: i64 = (1i64 << 40) - 1; // 5-byte max
|
||||
#[cfg(not(feature = "5bytes"))]
|
||||
let max_stored: i64 = (1i64 << 32) - 1; // 4-byte max
|
||||
let max_actual = max_stored * NEEDLE_PADDING_SIZE as i64;
|
||||
let offset = Offset::from_actual_offset(max_actual);
|
||||
assert_eq!(offset.to_actual_offset(), max_actual);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_offset_size_constants() {
|
||||
#[cfg(feature = "5bytes")]
|
||||
{
|
||||
assert_eq!(OFFSET_SIZE, 5);
|
||||
assert_eq!(NEEDLE_MAP_ENTRY_SIZE, 17); // 8 + 5 + 4
|
||||
assert_eq!(MAX_POSSIBLE_VOLUME_SIZE, 4 * 1024 * 1024 * 1024 * 8 * 256);
|
||||
// 8TB
|
||||
}
|
||||
#[cfg(not(feature = "5bytes"))]
|
||||
{
|
||||
assert_eq!(OFFSET_SIZE, 4);
|
||||
assert_eq!(NEEDLE_MAP_ENTRY_SIZE, 16); // 8 + 4 + 4
|
||||
assert_eq!(MAX_POSSIBLE_VOLUME_SIZE, 4 * 1024 * 1024 * 1024 * 8); // 32GB
|
||||
}
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_idx_entry_round_trip() {
|
||||
let key = NeedleId(0xdeadbeef12345678);
|
||||
let offset = Offset::from_actual_offset(8 * 999);
|
||||
let size = Size(4096);
|
||||
|
||||
let mut buf = [0u8; NEEDLE_MAP_ENTRY_SIZE];
|
||||
idx_entry_to_bytes(&mut buf, key, offset, size);
|
||||
|
||||
let (key2, offset2, size2) = idx_entry_from_bytes(&buf);
|
||||
assert_eq!(key, key2);
|
||||
assert_eq!(offset.to_actual_offset(), offset2.to_actual_offset());
|
||||
assert_eq!(size, size2);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_volume_id() {
|
||||
let vid = VolumeId::parse("42").unwrap();
|
||||
assert_eq!(vid, VolumeId(42));
|
||||
assert_eq!(vid.to_string(), "42");
|
||||
assert_eq!(vid.next(), VolumeId(43));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_version() {
|
||||
assert!(VERSION_1.is_supported());
|
||||
assert!(VERSION_2.is_supported());
|
||||
assert!(VERSION_3.is_supported());
|
||||
assert!(!Version(0).is_supported());
|
||||
assert!(!Version(4).is_supported());
|
||||
assert_eq!(Version::current(), VERSION_3);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_disk_type() {
|
||||
assert_eq!(DiskType::from_string(""), DiskType::HardDrive);
|
||||
assert_eq!(DiskType::from_string("hdd"), DiskType::HardDrive);
|
||||
assert_eq!(DiskType::from_string("SSD"), DiskType::Ssd);
|
||||
assert_eq!(
|
||||
DiskType::from_string("nvme"),
|
||||
DiskType::Custom("nvme".to_string())
|
||||
);
|
||||
assert_eq!(DiskType::HardDrive.readable_string(), "hdd");
|
||||
assert_eq!(DiskType::Ssd.readable_string(), "ssd");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_read_option_default() {
|
||||
let ro = ReadOption::default();
|
||||
assert!(!ro.read_deleted);
|
||||
assert!(!ro.attempt_meta_only);
|
||||
assert!(!ro.must_meta_only);
|
||||
assert!(!ro.is_meta_only);
|
||||
assert_eq!(ro.volume_revision, 0);
|
||||
assert!(!ro.is_out_of_range);
|
||||
assert!(!ro.has_slow_read);
|
||||
assert_eq!(ro.read_buffer_size, 0);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_read_option_custom() {
|
||||
let ro = ReadOption {
|
||||
read_deleted: true,
|
||||
attempt_meta_only: true,
|
||||
has_slow_read: true,
|
||||
read_buffer_size: 1024 * 1024,
|
||||
..ReadOption::default()
|
||||
};
|
||||
assert!(ro.read_deleted);
|
||||
assert!(ro.attempt_meta_only);
|
||||
assert!(!ro.must_meta_only);
|
||||
assert!(!ro.is_meta_only);
|
||||
assert!(ro.has_slow_read);
|
||||
assert_eq!(ro.read_buffer_size, 1024 * 1024);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_read_option_clone() {
|
||||
let ro = ReadOption {
|
||||
is_out_of_range: true,
|
||||
volume_revision: 42,
|
||||
..ReadOption::default()
|
||||
};
|
||||
let ro2 = ro.clone();
|
||||
assert!(ro2.is_out_of_range);
|
||||
assert_eq!(ro2.volume_revision, 42);
|
||||
}
|
||||
}
|
||||
4246
seaweed-volume/src/storage/volume.rs
Normal file
4246
seaweed-volume/src/storage/volume.rs
Normal file
File diff suppressed because it is too large
Load Diff
79
seaweed-volume/src/version.rs
Normal file
79
seaweed-volume/src/version.rs
Normal file
@@ -0,0 +1,79 @@
|
||||
//! Version helpers aligned with Go's util/version package.
|
||||
|
||||
use std::sync::OnceLock;
|
||||
|
||||
#[cfg(feature = "5bytes")]
|
||||
const SIZE_LIMIT: &str = "8000GB"; // Matches Go production builds (5BytesOffset)
|
||||
#[cfg(not(feature = "5bytes"))]
|
||||
const SIZE_LIMIT: &str = "30GB"; // Matches Go default build (!5BytesOffset)
|
||||
|
||||
pub fn size_limit() -> &'static str {
|
||||
SIZE_LIMIT
|
||||
}
|
||||
|
||||
pub fn commit() -> &'static str {
|
||||
option_env!("SEAWEEDFS_COMMIT")
|
||||
.or(option_env!("GIT_COMMIT"))
|
||||
.or(option_env!("GIT_SHA"))
|
||||
.unwrap_or("")
|
||||
}
|
||||
|
||||
pub fn version_number() -> &'static str {
|
||||
static VERSION_NUMBER: OnceLock<String> = OnceLock::new();
|
||||
VERSION_NUMBER
|
||||
.get_or_init(|| {
|
||||
parse_go_version_number().unwrap_or_else(|| env!("CARGO_PKG_VERSION").to_string())
|
||||
})
|
||||
.as_str()
|
||||
}
|
||||
|
||||
pub fn version() -> &'static str {
|
||||
static VERSION: OnceLock<String> = OnceLock::new();
|
||||
VERSION
|
||||
.get_or_init(|| format!("{} {}", size_limit(), version_number()))
|
||||
.as_str()
|
||||
}
|
||||
|
||||
pub fn full_version() -> &'static str {
|
||||
static FULL: OnceLock<String> = OnceLock::new();
|
||||
FULL.get_or_init(|| format!("{} {}", version(), commit()))
|
||||
.as_str()
|
||||
}
|
||||
|
||||
pub fn server_header() -> &'static str {
|
||||
static HEADER: OnceLock<String> = OnceLock::new();
|
||||
HEADER
|
||||
.get_or_init(|| format!("SeaweedFS Volume {}", version()))
|
||||
.as_str()
|
||||
}
|
||||
|
||||
fn parse_go_version_number() -> Option<String> {
|
||||
let src = include_str!(concat!(
|
||||
env!("CARGO_MANIFEST_DIR"),
|
||||
"/../weed/util/version/constants.go"
|
||||
));
|
||||
let mut major: Option<u32> = None;
|
||||
let mut minor: Option<u32> = None;
|
||||
for line in src.lines() {
|
||||
let l = line.trim();
|
||||
if l.starts_with("MAJOR_VERSION") {
|
||||
major = parse_int32_line(l);
|
||||
} else if l.starts_with("MINOR_VERSION") {
|
||||
minor = parse_int32_line(l);
|
||||
}
|
||||
if major.is_some() && minor.is_some() {
|
||||
break;
|
||||
}
|
||||
}
|
||||
match (major, minor) {
|
||||
(Some(maj), Some(min)) => Some(format!("{}.{}", maj, format!("{:02}", min))),
|
||||
_ => None,
|
||||
}
|
||||
}
|
||||
|
||||
fn parse_int32_line(line: &str) -> Option<u32> {
|
||||
let start = line.find("int32(")? + "int32(".len();
|
||||
let rest = &line[start..];
|
||||
let end = rest.find(')')?;
|
||||
rest[..end].trim().parse::<u32>().ok()
|
||||
}
|
||||
677
seaweed-volume/tests/http_integration.rs
Normal file
677
seaweed-volume/tests/http_integration.rs
Normal file
@@ -0,0 +1,677 @@
|
||||
//! Integration tests for the volume server HTTP handlers.
|
||||
//!
|
||||
//! Uses axum's Router with tower::ServiceExt::oneshot to test
|
||||
//! end-to-end without starting a real TCP server.
|
||||
|
||||
use std::sync::{Arc, RwLock};
|
||||
|
||||
use axum::body::Body;
|
||||
use axum::extract::connect_info::ConnectInfo;
|
||||
use axum::http::{Request, StatusCode};
|
||||
use tower::ServiceExt; // for `oneshot`
|
||||
|
||||
use seaweed_volume::security::{Guard, SigningKey};
|
||||
use seaweed_volume::server::volume_server::{
|
||||
build_admin_router, build_admin_router_with_ui, build_metrics_router, build_public_router,
|
||||
VolumeServerState,
|
||||
};
|
||||
use seaweed_volume::storage::needle_map::NeedleMapKind;
|
||||
use seaweed_volume::storage::store::Store;
|
||||
use seaweed_volume::storage::types::{DiskType, Version, VolumeId};
|
||||
|
||||
use tempfile::TempDir;
|
||||
|
||||
/// Create a test VolumeServerState with a temp directory, a single disk
|
||||
/// location, and one pre-created volume (VolumeId 1).
|
||||
fn test_state() -> (Arc<VolumeServerState>, TempDir) {
|
||||
test_state_with_guard(Vec::new(), Vec::new())
|
||||
}
|
||||
|
||||
fn test_state_with_signing_key(signing_key: Vec<u8>) -> (Arc<VolumeServerState>, TempDir) {
|
||||
test_state_with_guard(Vec::new(), signing_key)
|
||||
}
|
||||
|
||||
fn test_state_with_whitelist(whitelist: Vec<String>) -> (Arc<VolumeServerState>, TempDir) {
|
||||
test_state_with_guard(whitelist, Vec::new())
|
||||
}
|
||||
|
||||
fn test_state_with_guard(
|
||||
whitelist: Vec<String>,
|
||||
signing_key: Vec<u8>,
|
||||
) -> (Arc<VolumeServerState>, TempDir) {
|
||||
let tmp = TempDir::new().expect("failed to create temp dir");
|
||||
let dir = tmp.path().to_str().unwrap();
|
||||
|
||||
let mut store = Store::new(NeedleMapKind::InMemory);
|
||||
store
|
||||
.add_location(
|
||||
dir,
|
||||
dir,
|
||||
10,
|
||||
DiskType::HardDrive,
|
||||
seaweed_volume::config::MinFreeSpace::Percent(1.0),
|
||||
Vec::new(),
|
||||
)
|
||||
.expect("failed to add location");
|
||||
store
|
||||
.add_volume(
|
||||
VolumeId(1),
|
||||
"",
|
||||
None,
|
||||
None,
|
||||
0,
|
||||
DiskType::HardDrive,
|
||||
Version::current(),
|
||||
)
|
||||
.expect("failed to create volume");
|
||||
|
||||
let guard = Guard::new(
|
||||
&whitelist,
|
||||
SigningKey(signing_key),
|
||||
0,
|
||||
SigningKey(vec![]),
|
||||
0,
|
||||
);
|
||||
let state = Arc::new(VolumeServerState {
|
||||
store: RwLock::new(store),
|
||||
guard: RwLock::new(guard),
|
||||
is_stopping: RwLock::new(false),
|
||||
maintenance: std::sync::atomic::AtomicBool::new(false),
|
||||
state_version: std::sync::atomic::AtomicU32::new(0),
|
||||
concurrent_upload_limit: 0,
|
||||
concurrent_download_limit: 0,
|
||||
inflight_upload_data_timeout: std::time::Duration::from_secs(60),
|
||||
inflight_download_data_timeout: std::time::Duration::from_secs(60),
|
||||
inflight_upload_bytes: std::sync::atomic::AtomicI64::new(0),
|
||||
inflight_download_bytes: std::sync::atomic::AtomicI64::new(0),
|
||||
upload_notify: tokio::sync::Notify::new(),
|
||||
download_notify: tokio::sync::Notify::new(),
|
||||
data_center: String::new(),
|
||||
rack: String::new(),
|
||||
file_size_limit_bytes: 0,
|
||||
maintenance_byte_per_second: 0,
|
||||
is_heartbeating: std::sync::atomic::AtomicBool::new(true),
|
||||
has_master: false,
|
||||
pre_stop_seconds: 0,
|
||||
volume_state_notify: tokio::sync::Notify::new(),
|
||||
write_queue: std::sync::OnceLock::new(),
|
||||
s3_tier_registry: std::sync::RwLock::new(
|
||||
seaweed_volume::remote_storage::s3_tier::S3TierRegistry::new(),
|
||||
),
|
||||
read_mode: seaweed_volume::config::ReadMode::Local,
|
||||
master_url: String::new(),
|
||||
master_urls: Vec::new(),
|
||||
self_url: String::new(),
|
||||
http_client: reqwest::Client::new(),
|
||||
outgoing_http_scheme: "http".to_string(),
|
||||
outgoing_grpc_tls: None,
|
||||
metrics_runtime: std::sync::RwLock::new(
|
||||
seaweed_volume::server::volume_server::RuntimeMetricsConfig::default(),
|
||||
),
|
||||
metrics_notify: tokio::sync::Notify::new(),
|
||||
fix_jpg_orientation: false,
|
||||
has_slow_read: false,
|
||||
read_buffer_size_bytes: 1024 * 1024,
|
||||
security_file: String::new(),
|
||||
cli_white_list: vec![],
|
||||
state_file_path: String::new(),
|
||||
});
|
||||
(state, tmp)
|
||||
}
|
||||
|
||||
/// Helper: read the entire response body as bytes.
|
||||
async fn body_bytes(response: axum::response::Response) -> Vec<u8> {
|
||||
let body = response.into_body();
|
||||
axum::body::to_bytes(body, usize::MAX)
|
||||
.await
|
||||
.expect("failed to read body")
|
||||
.to_vec()
|
||||
}
|
||||
|
||||
fn with_remote_addr(request: Request<Body>, remote_addr: &str) -> Request<Body> {
|
||||
let mut request = request;
|
||||
let remote_addr = remote_addr
|
||||
.parse::<std::net::SocketAddr>()
|
||||
.expect("invalid socket address");
|
||||
request.extensions_mut().insert(ConnectInfo(remote_addr));
|
||||
request
|
||||
}
|
||||
|
||||
// ============================================================================
|
||||
// 1. GET /healthz returns 200 when server is running
|
||||
// ============================================================================
|
||||
|
||||
#[tokio::test]
|
||||
async fn healthz_returns_200_when_running() {
|
||||
let (state, _tmp) = test_state();
|
||||
let app = build_admin_router(state);
|
||||
|
||||
let response = app
|
||||
.oneshot(
|
||||
Request::builder()
|
||||
.uri("/healthz")
|
||||
.body(Body::empty())
|
||||
.unwrap(),
|
||||
)
|
||||
.await
|
||||
.unwrap();
|
||||
|
||||
assert_eq!(response.status(), StatusCode::OK);
|
||||
}
|
||||
|
||||
// ============================================================================
|
||||
// 2. GET /healthz returns 503 when is_stopping=true
|
||||
// ============================================================================
|
||||
|
||||
#[tokio::test]
|
||||
async fn healthz_returns_503_when_stopping() {
|
||||
let (state, _tmp) = test_state();
|
||||
*state.is_stopping.write().unwrap() = true;
|
||||
let app = build_admin_router(state);
|
||||
|
||||
let response = app
|
||||
.oneshot(
|
||||
Request::builder()
|
||||
.uri("/healthz")
|
||||
.body(Body::empty())
|
||||
.unwrap(),
|
||||
)
|
||||
.await
|
||||
.unwrap();
|
||||
|
||||
assert_eq!(response.status(), StatusCode::SERVICE_UNAVAILABLE);
|
||||
}
|
||||
|
||||
// ============================================================================
|
||||
// 3. GET /status returns JSON with version and volumes array
|
||||
// ============================================================================
|
||||
|
||||
#[tokio::test]
|
||||
async fn status_returns_json_with_version_and_volumes() {
|
||||
let (state, _tmp) = test_state();
|
||||
let app = build_admin_router(state);
|
||||
|
||||
let response = app
|
||||
.oneshot(
|
||||
Request::builder()
|
||||
.uri("/status")
|
||||
.body(Body::empty())
|
||||
.unwrap(),
|
||||
)
|
||||
.await
|
||||
.unwrap();
|
||||
|
||||
assert_eq!(response.status(), StatusCode::OK);
|
||||
|
||||
let body = body_bytes(response).await;
|
||||
let json: serde_json::Value =
|
||||
serde_json::from_slice(&body).expect("response is not valid JSON");
|
||||
|
||||
assert!(json.get("Version").is_some(), "missing 'Version' field");
|
||||
assert!(json["Version"].is_string(), "'Version' should be a string");
|
||||
|
||||
assert!(json.get("Volumes").is_some(), "missing 'Volumes' field");
|
||||
assert!(json["Volumes"].is_array(), "'Volumes' should be an array");
|
||||
|
||||
// We created one volume in test_state, so the array should have one entry
|
||||
let volumes = json["Volumes"].as_array().unwrap();
|
||||
assert_eq!(volumes.len(), 1, "expected 1 volume");
|
||||
assert_eq!(volumes[0]["Id"], 1);
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn admin_router_does_not_expose_metrics() {
|
||||
let (state, _tmp) = test_state();
|
||||
let app = build_admin_router(state);
|
||||
|
||||
let response = app
|
||||
.oneshot(
|
||||
Request::builder()
|
||||
.uri("/metrics")
|
||||
.body(Body::empty())
|
||||
.unwrap(),
|
||||
)
|
||||
.await
|
||||
.unwrap();
|
||||
|
||||
assert_eq!(response.status(), StatusCode::BAD_REQUEST);
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn metrics_router_serves_metrics() {
|
||||
let app = build_metrics_router();
|
||||
|
||||
let response = app
|
||||
.oneshot(
|
||||
Request::builder()
|
||||
.uri("/metrics")
|
||||
.body(Body::empty())
|
||||
.unwrap(),
|
||||
)
|
||||
.await
|
||||
.unwrap();
|
||||
|
||||
assert_eq!(response.status(), StatusCode::OK);
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn admin_router_rejects_non_whitelisted_uploads() {
|
||||
let (state, _tmp) = test_state_with_whitelist(vec!["127.0.0.1".to_string()]);
|
||||
let app = build_admin_router(state);
|
||||
|
||||
let response = app
|
||||
.oneshot(with_remote_addr(
|
||||
Request::builder()
|
||||
.method("POST")
|
||||
.uri("/1,000000000000000001")
|
||||
.body(Body::from("blocked"))
|
||||
.unwrap(),
|
||||
"10.0.0.9:12345",
|
||||
))
|
||||
.await
|
||||
.unwrap();
|
||||
|
||||
assert_eq!(response.status(), StatusCode::UNAUTHORIZED);
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn admin_router_rejects_non_whitelisted_deletes() {
|
||||
let (state, _tmp) = test_state_with_whitelist(vec!["127.0.0.1".to_string()]);
|
||||
let app = build_admin_router(state);
|
||||
|
||||
let response = app
|
||||
.oneshot(with_remote_addr(
|
||||
Request::builder()
|
||||
.method("DELETE")
|
||||
.uri("/1,000000000000000001")
|
||||
.body(Body::empty())
|
||||
.unwrap(),
|
||||
"10.0.0.9:12345",
|
||||
))
|
||||
.await
|
||||
.unwrap();
|
||||
|
||||
assert_eq!(response.status(), StatusCode::UNAUTHORIZED);
|
||||
}
|
||||
|
||||
// Go's volume_server.go has /stats/* endpoints commented out (L130-134).
|
||||
// Requests to /stats/counter fall through to the store handler which returns 400.
|
||||
#[tokio::test]
|
||||
async fn admin_router_does_not_expose_stats_routes() {
|
||||
let (state, _tmp) = test_state_with_whitelist(vec!["127.0.0.1".to_string()]);
|
||||
let app = build_admin_router_with_ui(state, true);
|
||||
|
||||
let response = app
|
||||
.oneshot(with_remote_addr(
|
||||
Request::builder()
|
||||
.uri("/stats/counter")
|
||||
.body(Body::empty())
|
||||
.unwrap(),
|
||||
"127.0.0.1:12345",
|
||||
))
|
||||
.await
|
||||
.unwrap();
|
||||
|
||||
// Falls through to store handler → 400 (bad volume id)
|
||||
assert_eq!(response.status(), StatusCode::BAD_REQUEST);
|
||||
}
|
||||
|
||||
// ============================================================================
|
||||
// 4. POST writes data, then GET reads it back
|
||||
// ============================================================================
|
||||
|
||||
#[tokio::test]
|
||||
async fn write_then_read_needle() {
|
||||
let (state, _tmp) = test_state();
|
||||
|
||||
// The fid "01637037d6" encodes NeedleId=0x01, Cookie=0x637037d6
|
||||
let uri = "/1,01637037d6";
|
||||
let payload = b"hello, seaweedfs!";
|
||||
|
||||
// --- POST (write) ---
|
||||
let app = build_admin_router(state.clone());
|
||||
let response = app
|
||||
.oneshot(
|
||||
Request::builder()
|
||||
.method("POST")
|
||||
.uri(uri)
|
||||
.body(Body::from(payload.to_vec()))
|
||||
.unwrap(),
|
||||
)
|
||||
.await
|
||||
.unwrap();
|
||||
|
||||
assert_eq!(
|
||||
response.status(),
|
||||
StatusCode::CREATED,
|
||||
"POST should return 201 Created"
|
||||
);
|
||||
|
||||
let body = body_bytes(response).await;
|
||||
let json: serde_json::Value =
|
||||
serde_json::from_slice(&body).expect("POST response is not valid JSON");
|
||||
assert_eq!(json["size"], payload.len() as u64);
|
||||
|
||||
// --- GET (read back) ---
|
||||
let app = build_admin_router(state.clone());
|
||||
let response = app
|
||||
.oneshot(Request::builder().uri(uri).body(Body::empty()).unwrap())
|
||||
.await
|
||||
.unwrap();
|
||||
|
||||
assert_eq!(response.status(), StatusCode::OK, "GET should return 200");
|
||||
|
||||
let body = body_bytes(response).await;
|
||||
assert_eq!(body, payload, "GET body should match written data");
|
||||
}
|
||||
|
||||
// ============================================================================
|
||||
// 5. DELETE deletes a needle, subsequent GET returns 404
|
||||
// ============================================================================
|
||||
|
||||
#[tokio::test]
|
||||
async fn delete_then_get_returns_404() {
|
||||
let (state, _tmp) = test_state();
|
||||
let uri = "/1,01637037d6";
|
||||
let payload = b"to be deleted";
|
||||
|
||||
// Write the needle first
|
||||
let app = build_admin_router(state.clone());
|
||||
let response = app
|
||||
.oneshot(
|
||||
Request::builder()
|
||||
.method("POST")
|
||||
.uri(uri)
|
||||
.body(Body::from(payload.to_vec()))
|
||||
.unwrap(),
|
||||
)
|
||||
.await
|
||||
.unwrap();
|
||||
assert_eq!(response.status(), StatusCode::CREATED);
|
||||
|
||||
// Delete
|
||||
let app = build_admin_router(state.clone());
|
||||
let response = app
|
||||
.oneshot(
|
||||
Request::builder()
|
||||
.method("DELETE")
|
||||
.uri(uri)
|
||||
.body(Body::empty())
|
||||
.unwrap(),
|
||||
)
|
||||
.await
|
||||
.unwrap();
|
||||
assert_eq!(
|
||||
response.status(),
|
||||
StatusCode::ACCEPTED,
|
||||
"DELETE should return 202 Accepted"
|
||||
);
|
||||
|
||||
// GET should now return 404
|
||||
let app = build_admin_router(state.clone());
|
||||
let response = app
|
||||
.oneshot(Request::builder().uri(uri).body(Body::empty()).unwrap())
|
||||
.await
|
||||
.unwrap();
|
||||
assert_eq!(
|
||||
response.status(),
|
||||
StatusCode::NOT_FOUND,
|
||||
"GET after DELETE should return 404"
|
||||
);
|
||||
}
|
||||
|
||||
// ============================================================================
|
||||
// 6. HEAD returns headers without body
|
||||
// ============================================================================
|
||||
|
||||
#[tokio::test]
|
||||
async fn head_returns_headers_without_body() {
|
||||
let (state, _tmp) = test_state();
|
||||
let uri = "/1,01637037d6";
|
||||
let payload = b"head test data";
|
||||
|
||||
// Write needle
|
||||
let app = build_admin_router(state.clone());
|
||||
let response = app
|
||||
.oneshot(
|
||||
Request::builder()
|
||||
.method("POST")
|
||||
.uri(uri)
|
||||
.body(Body::from(payload.to_vec()))
|
||||
.unwrap(),
|
||||
)
|
||||
.await
|
||||
.unwrap();
|
||||
assert_eq!(response.status(), StatusCode::CREATED);
|
||||
|
||||
// HEAD
|
||||
let app = build_admin_router(state.clone());
|
||||
let response = app
|
||||
.oneshot(
|
||||
Request::builder()
|
||||
.method("HEAD")
|
||||
.uri(uri)
|
||||
.body(Body::empty())
|
||||
.unwrap(),
|
||||
)
|
||||
.await
|
||||
.unwrap();
|
||||
|
||||
assert_eq!(response.status(), StatusCode::OK, "HEAD should return 200");
|
||||
|
||||
// Content-Length header should be present
|
||||
let content_length = response
|
||||
.headers()
|
||||
.get("content-length")
|
||||
.expect("HEAD should include Content-Length header");
|
||||
let len: usize = content_length
|
||||
.to_str()
|
||||
.unwrap()
|
||||
.parse()
|
||||
.expect("Content-Length should be a number");
|
||||
assert_eq!(
|
||||
len,
|
||||
payload.len(),
|
||||
"Content-Length should match payload size"
|
||||
);
|
||||
|
||||
// Body should be empty for HEAD
|
||||
let body = body_bytes(response).await;
|
||||
assert!(body.is_empty(), "HEAD body should be empty");
|
||||
}
|
||||
|
||||
// ============================================================================
|
||||
// 7. Invalid URL path returns 400
|
||||
// ============================================================================
|
||||
|
||||
#[tokio::test]
|
||||
async fn invalid_url_path_returns_400() {
|
||||
let (state, _tmp) = test_state();
|
||||
let app = build_admin_router(state);
|
||||
|
||||
// "invalidpath" has no comma or slash separator so parse_url_path returns None
|
||||
let response = app
|
||||
.oneshot(
|
||||
Request::builder()
|
||||
.uri("/invalidpath")
|
||||
.body(Body::empty())
|
||||
.unwrap(),
|
||||
)
|
||||
.await
|
||||
.unwrap();
|
||||
|
||||
assert_eq!(
|
||||
response.status(),
|
||||
StatusCode::BAD_REQUEST,
|
||||
"invalid URL path should return 400"
|
||||
);
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn deep_invalid_url_path_returns_400() {
|
||||
let (state, _tmp) = test_state();
|
||||
let app = build_admin_router(state);
|
||||
|
||||
let response = app
|
||||
.oneshot(
|
||||
Request::builder()
|
||||
.uri("/not/a/valid/volume/path")
|
||||
.body(Body::empty())
|
||||
.unwrap(),
|
||||
)
|
||||
.await
|
||||
.unwrap();
|
||||
|
||||
assert_eq!(response.status(), StatusCode::BAD_REQUEST);
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn admin_root_get_returns_400() {
|
||||
let (state, _tmp) = test_state();
|
||||
let app = build_admin_router(state);
|
||||
|
||||
let response = app
|
||||
.oneshot(Request::builder().uri("/").body(Body::empty()).unwrap())
|
||||
.await
|
||||
.unwrap();
|
||||
|
||||
assert_eq!(response.status(), StatusCode::BAD_REQUEST);
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn public_root_get_returns_400() {
|
||||
let (state, _tmp) = test_state();
|
||||
let app = build_public_router(state);
|
||||
|
||||
let response = app
|
||||
.oneshot(Request::builder().uri("/").body(Body::empty()).unwrap())
|
||||
.await
|
||||
.unwrap();
|
||||
|
||||
assert_eq!(response.status(), StatusCode::BAD_REQUEST);
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn public_router_does_not_expose_healthz() {
|
||||
let (state, _tmp) = test_state();
|
||||
let app = build_public_router(state);
|
||||
|
||||
let response = app
|
||||
.oneshot(
|
||||
Request::builder()
|
||||
.uri("/healthz")
|
||||
.body(Body::empty())
|
||||
.unwrap(),
|
||||
)
|
||||
.await
|
||||
.unwrap();
|
||||
|
||||
assert_eq!(response.status(), StatusCode::BAD_REQUEST);
|
||||
}
|
||||
|
||||
// Go's volume_server.go has /stats/* endpoints commented out (L130-134).
|
||||
#[tokio::test]
|
||||
async fn admin_router_stats_routes_not_registered() {
|
||||
let (state, _tmp) = test_state();
|
||||
let app = build_admin_router(state);
|
||||
|
||||
let response = app
|
||||
.oneshot(
|
||||
Request::builder()
|
||||
.uri("/stats/counter")
|
||||
.body(Body::empty())
|
||||
.unwrap(),
|
||||
)
|
||||
.await
|
||||
.unwrap();
|
||||
|
||||
// Falls through to store handler → 400 (bad volume id)
|
||||
assert_eq!(response.status(), StatusCode::BAD_REQUEST);
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn admin_router_hides_ui_when_write_jwt_is_configured() {
|
||||
let (state, _tmp) = test_state_with_signing_key(b"secret".to_vec());
|
||||
let app = build_admin_router(state);
|
||||
|
||||
let response = app
|
||||
.oneshot(
|
||||
Request::builder()
|
||||
.uri("/ui/index.html")
|
||||
.body(Body::empty())
|
||||
.unwrap(),
|
||||
)
|
||||
.await
|
||||
.unwrap();
|
||||
|
||||
assert_eq!(response.status(), StatusCode::BAD_REQUEST);
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn admin_router_can_expose_ui_with_explicit_override() {
|
||||
let (state, _tmp) = test_state_with_signing_key(b"secret".to_vec());
|
||||
let app = build_admin_router_with_ui(state, true);
|
||||
|
||||
let response = app
|
||||
.oneshot(
|
||||
Request::builder()
|
||||
.uri("/ui/index.html")
|
||||
.body(Body::empty())
|
||||
.unwrap(),
|
||||
)
|
||||
.await
|
||||
.unwrap();
|
||||
|
||||
assert_eq!(response.status(), StatusCode::OK);
|
||||
let body = body_bytes(response).await;
|
||||
let html = String::from_utf8(body).unwrap();
|
||||
assert!(html.contains("Disk Stats"));
|
||||
assert!(html.contains("System Stats"));
|
||||
assert!(html.contains("Volumes"));
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn admin_router_ui_override_ignores_read_jwt_checks() {
|
||||
let (state, _tmp) = test_state_with_signing_key(b"write-secret".to_vec());
|
||||
state.guard.write().unwrap().read_signing_key = SigningKey(b"read-secret".to_vec());
|
||||
let app = build_admin_router_with_ui(state, true);
|
||||
|
||||
let response = app
|
||||
.oneshot(
|
||||
Request::builder()
|
||||
.uri("/ui/index.html")
|
||||
.body(Body::empty())
|
||||
.unwrap(),
|
||||
)
|
||||
.await
|
||||
.unwrap();
|
||||
|
||||
assert_eq!(response.status(), StatusCode::OK);
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn admin_router_serves_volume_ui_static_assets() {
|
||||
let (state, _tmp) = test_state();
|
||||
let app = build_admin_router(state);
|
||||
|
||||
let response = app
|
||||
.oneshot(
|
||||
Request::builder()
|
||||
.uri("/seaweedfsstatic/bootstrap/3.3.1/css/bootstrap.min.css")
|
||||
.body(Body::empty())
|
||||
.unwrap(),
|
||||
)
|
||||
.await
|
||||
.unwrap();
|
||||
|
||||
assert_eq!(response.status(), StatusCode::OK);
|
||||
assert_eq!(
|
||||
response
|
||||
.headers()
|
||||
.get("content-type")
|
||||
.and_then(|value| value.to_str().ok()),
|
||||
Some("text/css; charset=utf-8")
|
||||
);
|
||||
let body = body_bytes(response).await;
|
||||
assert!(body.len() > 1000);
|
||||
}
|
||||
1172
seaweed-volume/tools/generate_go_volume_docs.go
Normal file
1172
seaweed-volume/tools/generate_go_volume_docs.go
Normal file
File diff suppressed because it is too large
Load Diff
1
seaweed-volume/vendor/reed-solomon-erasure/.cargo-ok
vendored
Normal file
1
seaweed-volume/vendor/reed-solomon-erasure/.cargo-ok
vendored
Normal file
@@ -0,0 +1 @@
|
||||
{"v":1}
|
||||
6
seaweed-volume/vendor/reed-solomon-erasure/.cargo_vcs_info.json
vendored
Normal file
6
seaweed-volume/vendor/reed-solomon-erasure/.cargo_vcs_info.json
vendored
Normal file
@@ -0,0 +1,6 @@
|
||||
{
|
||||
"git": {
|
||||
"sha1": "a1ca49de5384445b68ade7d72f31f0379c199943"
|
||||
},
|
||||
"path_in_vcs": ""
|
||||
}
|
||||
3
seaweed-volume/vendor/reed-solomon-erasure/.gitattributes
vendored
Normal file
3
seaweed-volume/vendor/reed-solomon-erasure/.gitattributes
vendored
Normal file
@@ -0,0 +1,3 @@
|
||||
BackBlaze_JavaReedSolomon/* linguist-vendored
|
||||
KlausPost_reedsolomon/* linguist-vendored
|
||||
NicolasT_reedsolomon/* linguist-vendored
|
||||
2
seaweed-volume/vendor/reed-solomon-erasure/.gitignore
vendored
Normal file
2
seaweed-volume/vendor/reed-solomon-erasure/.gitignore
vendored
Normal file
@@ -0,0 +1,2 @@
|
||||
/target/
|
||||
/Cargo.lock
|
||||
181
seaweed-volume/vendor/reed-solomon-erasure/CHANGELOG.md
vendored
Normal file
181
seaweed-volume/vendor/reed-solomon-erasure/CHANGELOG.md
vendored
Normal file
@@ -0,0 +1,181 @@
|
||||
## 6.0.0
|
||||
- Use LruCache instead of InversionTree for caching data decode matrices
|
||||
- See [PR #104](https://github.com/rust-rse/reed-solomon-erasure/pull/104)
|
||||
- Minor code duplication
|
||||
- See [PR #102](https://github.com/rust-rse/reed-solomon-erasure/pull/102)
|
||||
- Dependencies update
|
||||
- Updated `smallvec` from `0.6.1` to `1.8.0`
|
||||
|
||||
## 5.0.3
|
||||
- Fixed cross build bug for aarch64 with simd-accel
|
||||
- See [PR #100](https://github.com/rust-rse/reed-solomon-erasure/pull/100)
|
||||
|
||||
## 5.0.2
|
||||
* Add support for `RUST_REED_SOLOMON_ERASURE_ARCH` environment variable and stop using `native` architecture for SIMD code
|
||||
- See [PR #98](https://github.com/rust-rse/reed-solomon-erasure/pull/98)
|
||||
|
||||
## 5.0.1
|
||||
- The `simd-accel` feature now builds on M1 Macs
|
||||
- See [PR #92](https://github.com/rust-rse/reed-solomon-erasure/pull/92)
|
||||
- Minor code cleanup
|
||||
|
||||
## 5.0.0
|
||||
- Merged several PRs
|
||||
- Not fully reviewed as I am no longer maintaining this crate
|
||||
|
||||
## 4.0.2
|
||||
- Updated build.rs to respect RUSTFLAGS's target-cpu if available
|
||||
- See [PR #75](https://github.com/darrenldl/reed-solomon-erasure/pull/75)
|
||||
- Added AVX512 support
|
||||
- See [PR #69](https://github.com/darrenldl/reed-solomon-erasure/pull/69)
|
||||
- Disabled SIMD acceleration when MSVC is being used to build the library
|
||||
- See [PR #67](https://github.com/darrenldl/reed-solomon-erasure/pull/67)
|
||||
- Dependencies update
|
||||
- Updated `smallvec` from `0.6` to `1.2`
|
||||
|
||||
## 4.0.1
|
||||
- Updated SIMD C code for Windows compatibility
|
||||
- Removed include of `unistd.h` in `simd_c/reedsolomon.c`
|
||||
- Removed GCC `nonnull` attribute in `simd_c/reedsolomon.h`
|
||||
- See PR [#63](https://github.com/darrenldl/reed-solomon-erasure/pull/63) [#64](https://github.com/darrenldl/reed-solomon-erasure/pull/64) for details
|
||||
- Replaced use of `libc::uint8_t` in `src/galois_8.rs` with `u8`
|
||||
|
||||
## 4.0.0
|
||||
- Major API restructure: removed `Shard` type in favor of generic functions
|
||||
- The logic of this crate is now generic over choice of finite field
|
||||
- The SIMD acceleration feature for GF(2^8) is now activated with the `simd-accel` Cargo feature. Pure-rust behavior is default.
|
||||
- Ran rustfmt
|
||||
- Adds a GF(2^16) implementation
|
||||
|
||||
## 3.1.2 (not published)
|
||||
- Doc fix
|
||||
- Added space before parantheses in code comments and documentation
|
||||
- Disabled SIMD C code for Android and iOS targets entirely
|
||||
|
||||
## 3.1.1
|
||||
- Fixed `Matrix::augment`
|
||||
- The error checking code was incorrect
|
||||
- Since this method is used in internal code only, and the only use case is a correct use case, the error did not lead to any bugs
|
||||
- Fixed benchmark data
|
||||
- Previously used MB=10^6 bytes while I should have used MB=2^20 bytes
|
||||
- Table in README has been updated accordingly
|
||||
- The `>= 2.1.0` data is obtained by measuring again with the corrected `rse-benchmark` code
|
||||
- The `2.0.X` and `1.X.X` data are simply adjusted by mutiplying `10^6` then dividing by `2^20`
|
||||
- Dependencies update
|
||||
- Updated `rand` from `0.4` to `0.5.4`
|
||||
- Added special handling in `build.rs` for CC options on Android and iOS
|
||||
- `-march=native` is not available for GCC on Android, see issue #23
|
||||
|
||||
## 3.1.0
|
||||
- Impl'd `std::error::Error` for `reed_solomon_erasure::Error` and `reed_solomon_erasure::SBSError`
|
||||
- See issue [#17](https://github.com/darrenldl/reed-solomon-erasure/issues/17), suggested by [DrPeterVanNostrand](https://github.com/DrPeterVanNostrand)
|
||||
- Added fuzzing suite
|
||||
- No code changes due to this as no bugs were found
|
||||
- Upgraded InversionTree QuickCheck test
|
||||
- No code changes due to this as no bugs were found
|
||||
- Upgraded test suite for main codec methods (e.g. encode, reconstruct)
|
||||
- A lot of heavy QuickCheck tests were added
|
||||
- No code changes due to this as no bugs were found
|
||||
- Upgraded test suite for ShardByShard methods
|
||||
- A lot of heavy QuickCheck tests were added
|
||||
- No code changes due to this as no bugs were found
|
||||
- Minor code refactoring in `reconstruct_internal` method
|
||||
- This means `reconstruct` and related methods are slightly more optimized
|
||||
|
||||
## 3.0.3
|
||||
- Added QuickCheck tests to the test suite
|
||||
- InversionTree is heavily tested now
|
||||
- No code changes as no bugs were found
|
||||
- Deps update
|
||||
- Updated rayon from 0.9 to 1.0
|
||||
|
||||
## 3.0.2
|
||||
- Same as 3.0.1, but 3.0.1 had unapplied changes
|
||||
|
||||
## 3.0.1 (yanked)
|
||||
- Updated doc for `with_buffer` variants of verifying methods
|
||||
- Stated explicitly that the buffer contains the correct parity shards after a successful call
|
||||
- Added tests for the above statement
|
||||
|
||||
## 3.0.0
|
||||
- Added `with_buffer` variants for verifying methods
|
||||
- This gives user the option of reducing heap allocation(s)
|
||||
- Core code clean up, improvements, and review, added more AUDIT comments
|
||||
- Improved shard utils
|
||||
- Added code to remove leftover parity shards in `reconstruct_data_shards`
|
||||
- This means one fewer gotcha of using the methods
|
||||
- `ShardByShard` code review and overhaul
|
||||
- `InversionTree` code review and improvements
|
||||
|
||||
## 2.4.0
|
||||
- Added more flexibility for `convert_2D_slices` macro
|
||||
- Now accepts expressions rather than just identifiers
|
||||
- The change requires change of syntax
|
||||
|
||||
## 2.3.3
|
||||
- Replaced all slice splitting functions in `misc_utils` with std lib ones or rayon ones
|
||||
- This means there are fewer heap allocations in general
|
||||
|
||||
## 2.3.2
|
||||
- Made `==`(`eq`) for `ReedSolomon` more reasonable
|
||||
- Previously `==` would compare
|
||||
- data shard count
|
||||
- parity shard count
|
||||
- total shard count
|
||||
- internal encoding matrix
|
||||
- internal `ParallelParam`
|
||||
- Now it only compares
|
||||
- data shard count
|
||||
- parity shard count
|
||||
|
||||
## 2.3.1
|
||||
- Added info on encoding behaviour to doc
|
||||
|
||||
## 2.3.0
|
||||
- Made Reed-Solomon codec creation methods return error instead of panic when shard numbers are not correct
|
||||
|
||||
## 2.2.0
|
||||
- Fixed SBS error checking code
|
||||
- Documentation fixes and polishing
|
||||
- Renamed `Error::InvalidShardsIndicator` to `Error::InvalidShardFlags`
|
||||
- Added more details to documentation on error handling
|
||||
- Error handling code overhaul and checks for all method variants
|
||||
- Dead commented out code cleanup and indent fix
|
||||
|
||||
## 2.1.0
|
||||
- Added Nicolas's SIMD C code files, gaining major speedup on supported CPUs
|
||||
- Added support for "shard by shard" encoding, allowing easier streamed encoding
|
||||
- Added functions for shard by shard encoding
|
||||
|
||||
## 2.0.0
|
||||
- Complete rewrite of most code following Klaus Post's design
|
||||
- Added optimsations (parallelism, loop unrolling)
|
||||
- 4-5x faster than `1.X.X`
|
||||
|
||||
## 1.1.1
|
||||
- Documentation polish
|
||||
- Added documentation badge to README
|
||||
- Optimised internal matrix related operations
|
||||
- This largely means `decode_missing` is faster
|
||||
|
||||
## 1.1.0
|
||||
- Added more helper functions
|
||||
- Added more tests
|
||||
|
||||
## 1.0.1
|
||||
- Added more tests
|
||||
- Fixed decode_missing
|
||||
- Previously may reconstruct the missing shards with incorrect length
|
||||
|
||||
## 1.0.0
|
||||
- Added more tests
|
||||
- Added integration with Codecov (via kcov)
|
||||
- Code refactoring
|
||||
- Added integration with Coveralls (via kcov)
|
||||
|
||||
## 0.9.1
|
||||
- Code restructuring
|
||||
- Added documentation
|
||||
|
||||
## 0.9.0
|
||||
- Base version
|
||||
87
seaweed-volume/vendor/reed-solomon-erasure/Cargo.toml
vendored
Normal file
87
seaweed-volume/vendor/reed-solomon-erasure/Cargo.toml
vendored
Normal file
@@ -0,0 +1,87 @@
|
||||
# THIS FILE IS AUTOMATICALLY GENERATED BY CARGO
|
||||
#
|
||||
# When uploading crates to the registry Cargo will automatically
|
||||
# "normalize" Cargo.toml files for maximal compatibility
|
||||
# with all versions of Cargo and also rewrite `path` dependencies
|
||||
# to registry (e.g., crates.io) dependencies.
|
||||
#
|
||||
# If you are reading this file be aware that the original Cargo.toml
|
||||
# will likely look very different (and much more reasonable).
|
||||
# See Cargo.toml.orig for the original contents.
|
||||
|
||||
[package]
|
||||
edition = "2018"
|
||||
name = "reed-solomon-erasure"
|
||||
version = "6.0.0"
|
||||
authors = ["Darren Ldl <darrenldldev@gmail.com>"]
|
||||
build = "build.rs"
|
||||
exclude = [
|
||||
"appveyor.yml",
|
||||
".travis.yml",
|
||||
]
|
||||
description = "Rust implementation of Reed-Solomon erasure coding"
|
||||
homepage = "https://github.com/darrenldl/reed-solomon-erasure"
|
||||
documentation = "https://docs.rs/reed-solomon-erasure"
|
||||
readme = "README.md"
|
||||
keywords = [
|
||||
"reed-solomon",
|
||||
"erasure",
|
||||
]
|
||||
categories = ["encoding"]
|
||||
license = "MIT"
|
||||
repository = "https://github.com/darrenldl/reed-solomon-erasure"
|
||||
|
||||
[[bench]]
|
||||
name = "reconstruct"
|
||||
|
||||
[dependencies.libc]
|
||||
version = "0.2"
|
||||
optional = true
|
||||
|
||||
[dependencies.libm]
|
||||
version = "0.2.1"
|
||||
|
||||
[dependencies.lru]
|
||||
version = "0.16.3"
|
||||
|
||||
[dependencies.parking_lot]
|
||||
version = "0.11.2"
|
||||
optional = true
|
||||
|
||||
[dependencies.smallvec]
|
||||
version = "1.2"
|
||||
|
||||
[dependencies.spin]
|
||||
version = "0.9.2"
|
||||
features = ["spin_mutex"]
|
||||
default-features = false
|
||||
|
||||
[dev-dependencies.quickcheck]
|
||||
version = "0.9"
|
||||
|
||||
[dev-dependencies.rand]
|
||||
version = "0.7.2"
|
||||
|
||||
[build-dependencies.cc]
|
||||
version = "1.0"
|
||||
optional = true
|
||||
|
||||
[features]
|
||||
default = ["std"]
|
||||
simd-accel = [
|
||||
"cc",
|
||||
"libc",
|
||||
]
|
||||
std = ["parking_lot"]
|
||||
|
||||
[badges.appveyor]
|
||||
repository = "darrenldl/reed-solomon-erasure"
|
||||
|
||||
[badges.codecov]
|
||||
repository = "darrenldl/reed-solomon-erasure"
|
||||
|
||||
[badges.coveralls]
|
||||
repository = "darrenldl/reed-solomon-erasure"
|
||||
|
||||
[badges.travis-ci]
|
||||
repository = "darrenldl/reed-solomon-erasure"
|
||||
56
seaweed-volume/vendor/reed-solomon-erasure/Cargo.toml.orig
generated
vendored
Normal file
56
seaweed-volume/vendor/reed-solomon-erasure/Cargo.toml.orig
generated
vendored
Normal file
@@ -0,0 +1,56 @@
|
||||
[package]
|
||||
name= "reed-solomon-erasure"
|
||||
version = "6.0.0"
|
||||
authors = ["Darren Ldl <darrenldldev@gmail.com>"]
|
||||
edition = "2018"
|
||||
build = "build.rs"
|
||||
exclude = [
|
||||
"appveyor.yml",
|
||||
".travis.yml"
|
||||
]
|
||||
|
||||
description = "Rust implementation of Reed-Solomon erasure coding"
|
||||
|
||||
documentation = "https://docs.rs/reed-solomon-erasure"
|
||||
homepage= "https://github.com/darrenldl/reed-solomon-erasure"
|
||||
repository= "https://github.com/darrenldl/reed-solomon-erasure"
|
||||
|
||||
readme= "README.md"
|
||||
|
||||
keywords= ["reed-solomon", "erasure"]
|
||||
|
||||
categories= ["encoding"]
|
||||
|
||||
license = "MIT"
|
||||
|
||||
[features]
|
||||
default = ["std"] # simd off by default
|
||||
std = ["parking_lot"]
|
||||
simd-accel = ["cc", "libc"]
|
||||
|
||||
[badges]
|
||||
travis-ci = { repository = "darrenldl/reed-solomon-erasure" }
|
||||
appveyor= { repository = "darrenldl/reed-solomon-erasure" }
|
||||
codecov = { repository = "darrenldl/reed-solomon-erasure" }
|
||||
coveralls = { repository = "darrenldl/reed-solomon-erasure" }
|
||||
|
||||
[dependencies]
|
||||
libc = { version = "0.2", optional = true }
|
||||
# `log2()` impl for `no_std`
|
||||
libm = "0.2.1"
|
||||
lru = "0.16.3"
|
||||
# Efficient `Mutex` implementation for `std` environment
|
||||
parking_lot = { version = "0.11.2", optional = true }
|
||||
smallvec = "1.2"
|
||||
# `Mutex` implementation for `no_std` environment with the same high-level API as `parking_lot`
|
||||
spin = { version = "0.9.2", default-features = false, features = ["spin_mutex"] }
|
||||
|
||||
[dev-dependencies]
|
||||
rand = "0.7.2"
|
||||
quickcheck = "0.9"
|
||||
|
||||
[build-dependencies]
|
||||
cc = { version = "1.0", optional = true }
|
||||
|
||||
[[bench]]
|
||||
name = "reconstruct"
|
||||
24
seaweed-volume/vendor/reed-solomon-erasure/LICENSE
vendored
Normal file
24
seaweed-volume/vendor/reed-solomon-erasure/LICENSE
vendored
Normal file
@@ -0,0 +1,24 @@
|
||||
MIT License
|
||||
|
||||
Copyright (c) 2017 Darren Ldl
|
||||
Copyright (c) 2015, 2016 Nicolas Trangez
|
||||
Copyright (c) 2015 Klaus Post
|
||||
Copyright (c) 2015 Backblaze
|
||||
|
||||
Permission is hereby granted, free of charge, to any person obtaining a copy
|
||||
of this software and associated documentation files (the "Software"), to deal
|
||||
in the Software without restriction, including without limitation the rights
|
||||
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
||||
copies of the Software, and to permit persons to whom the Software is
|
||||
furnished to do so, subject to the following conditions:
|
||||
|
||||
The above copyright notice and this permission notice shall be included in all
|
||||
copies or substantial portions of the Software.
|
||||
|
||||
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
||||
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
||||
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
||||
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
||||
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
||||
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
||||
SOFTWARE.
|
||||
166
seaweed-volume/vendor/reed-solomon-erasure/README.md
vendored
Normal file
166
seaweed-volume/vendor/reed-solomon-erasure/README.md
vendored
Normal file
@@ -0,0 +1,166 @@
|
||||
# reed-solomon-erasure
|
||||
[](https://travis-ci.org/darrenldl/reed-solomon-erasure)
|
||||
[](https://ci.appveyor.com/project/darrenldl/reed-solomon-erasure/branch/master)
|
||||
[](https://codecov.io/gh/darrenldl/reed-solomon-erasure)
|
||||
[](https://coveralls.io/github/darrenldl/reed-solomon-erasure?branch=master)
|
||||
[](https://crates.io/crates/reed-solomon-erasure)
|
||||
[](https://docs.rs/reed-solomon-erasure)
|
||||
[](https://deps.rs/repo/github/darrenldl/reed-solomon-erasure)
|
||||
|
||||
Rust implementation of Reed-Solomon erasure coding
|
||||
|
||||
WASM builds are also available, see section **WASM usage** below for details
|
||||
|
||||
This is a port of [BackBlaze's Java implementation](https://github.com/Backblaze/JavaReedSolomon), [Klaus Post's Go implementation](https://github.com/klauspost/reedsolomon), and [Nicolas Trangez's Haskell implementation](https://github.com/NicolasT/reedsolomon).
|
||||
|
||||
Version `1.X.X` copies BackBlaze's implementation, and is less performant as there were fewer places where parallelism could be added.
|
||||
|
||||
Version `>= 2.0.0` copies Klaus Post's implementation. The SIMD C code is copied from Nicolas Trangez's implementation with minor modifications.
|
||||
|
||||
See [Notes](#notes) and [License](#license) section for details.
|
||||
|
||||
## WASM usage
|
||||
|
||||
See [here](wasm/README.md) for details
|
||||
|
||||
## Rust usage
|
||||
Add the following to your `Cargo.toml` for the normal version (pure Rust version)
|
||||
```toml
|
||||
[dependencies]
|
||||
reed-solomon-erasure = "4.0"
|
||||
```
|
||||
or the following for the version which tries to utilise SIMD
|
||||
```toml
|
||||
[dependencies]
|
||||
reed-solomon-erasure = { version = "4.0", features = [ "simd-accel" ] }
|
||||
```
|
||||
and the following to your crate root
|
||||
```rust
|
||||
extern crate reed_solomon_erasure;
|
||||
```
|
||||
|
||||
NOTE: `simd-accel` is tuned for Haswell+ processors on x86-64 and not in any way for other architectures, set
|
||||
environment variable `RUST_REED_SOLOMON_ERASURE_ARCH` during build to force compilation of C code for specific architecture (`-march` flag in
|
||||
GCC/Clang). Even on x86-64 you can achieve better performance by setting it to `native`, but it will stop running on
|
||||
older CPUs, YMMV.
|
||||
|
||||
## Example
|
||||
```rust
|
||||
#[macro_use(shards)]
|
||||
extern crate reed_solomon_erasure;
|
||||
|
||||
use reed_solomon_erasure::galois_8::ReedSolomon;
|
||||
// or use the following for Galois 2^16 backend
|
||||
// use reed_solomon_erasure::galois_16::ReedSolomon;
|
||||
|
||||
fn main () {
|
||||
let r = ReedSolomon::new(3, 2).unwrap(); // 3 data shards, 2 parity shards
|
||||
|
||||
let mut master_copy = shards!(
|
||||
[0, 1, 2, 3],
|
||||
[4, 5, 6, 7],
|
||||
[8, 9, 10, 11],
|
||||
[0, 0, 0, 0], // last 2 rows are parity shards
|
||||
[0, 0, 0, 0]
|
||||
);
|
||||
|
||||
// Construct the parity shards
|
||||
r.encode(&mut master_copy).unwrap();
|
||||
|
||||
// Make a copy and transform it into option shards arrangement
|
||||
// for feeding into reconstruct_shards
|
||||
let mut shards: Vec<_> = master_copy.iter().cloned().map(Some).collect();
|
||||
|
||||
// We can remove up to 2 shards, which may be data or parity shards
|
||||
shards[0] = None;
|
||||
shards[4] = None;
|
||||
|
||||
// Try to reconstruct missing shards
|
||||
r.reconstruct(&mut shards).unwrap();
|
||||
|
||||
// Convert back to normal shard arrangement
|
||||
let result: Vec<_> = shards.into_iter().filter_map(|x| x).collect();
|
||||
|
||||
assert!(r.verify(&result).unwrap());
|
||||
assert_eq!(master_copy, result);
|
||||
}
|
||||
```
|
||||
|
||||
## Benchmark it yourself
|
||||
You can test performance under different configurations quickly (e.g. data parity shards ratio, parallel parameters)
|
||||
by cloning this repo: https://github.com/darrenldl/rse-benchmark
|
||||
|
||||
`rse-benchmark` contains a copy of this library (usually a fully functional dev version), so you only need to adjust `main.rs`
|
||||
then do `cargo run --release` to start the benchmark.
|
||||
|
||||
## Performance
|
||||
Version `1.X.X`, `2.0.0` do not utilise SIMD.
|
||||
|
||||
Version `2.1.0` onward uses Nicolas's C files for SIMD operations.
|
||||
|
||||
Machine: laptop with `Intel(R) Core(TM) i5-3337U CPU @ 1.80GHz (max 2.70GHz) 2 Cores 4 Threads`
|
||||
|
||||
Below shows the result of one of the test configurations, other configurations show similar results in terms of ratio.
|
||||
|
||||
|Configuration| Klaus Post's | >= 2.1.0 && < 4.0.0 | 2.0.X | 1.X.X |
|
||||
|---|---|---|---|---|
|
||||
| 10x2x1M | ~7800MB/s |~4500MB/s | ~1000MB/s | ~240MB/s |
|
||||
|
||||
Versions `>= 4.0.0` have not been benchmarked thoroughly yet
|
||||
|
||||
## Changelog
|
||||
[Changelog](CHANGELOG.md)
|
||||
|
||||
## Contributions
|
||||
Contributions are welcome. Note that by submitting contributions, you agree to license your work under the same license used by this project as stated in the LICENSE file.
|
||||
|
||||
## Credits
|
||||
#### Library overhaul and Galois 2^16 backend
|
||||
Many thanks to the following people for overhaul of the library and introduction of Galois 2^16 backend
|
||||
|
||||
- [@drskalman](https://github.com/drskalman)
|
||||
|
||||
- Jeff Burdges [@burdges](https://github.com/burdges)
|
||||
|
||||
- Robert Habermeier [@rphmeier](https://github.com/rphmeier)
|
||||
|
||||
#### WASM builds
|
||||
Many thanks to Nazar Mokrynskyi [@nazar-pc](https://github.com/nazar-pc) for submitting his package for WASM builds
|
||||
|
||||
He is the original author of the files stored in `wasm` folder. The files may have been modified by me later.
|
||||
|
||||
#### AVX512 support
|
||||
Many thanks to [@sakridge](https://github.com/sakridge) for adding support for AVX512 (see [PR #69](https://github.com/darrenldl/reed-solomon-erasure/pull/69))
|
||||
|
||||
#### build.rs improvements
|
||||
Many thanks to [@ryoqun](https://github.com/ryoqun) for improving the usability of the library in the context of cross-compilation (see [PR #75](https://github.com/darrenldl/reed-solomon-erasure/pull/75))
|
||||
|
||||
#### no_std support
|
||||
Many thanks to Nazar Mokrynskyi [@nazar-pc](https://github.com/nazar-pc) for adding `no_std` support (see [PR #90](https://github.com/darrenldl/reed-solomon-erasure/pull/90))
|
||||
|
||||
#### Testers
|
||||
Many thanks to the following people for testing and benchmarking on various platforms
|
||||
|
||||
- Laurențiu Nicola [@lnicola](https://github.com/lnicola/) (platforms: Linux, Intel)
|
||||
|
||||
- Roger Andersen [@hexjelly](https://github.com/hexjelly) (platforms: Windows, AMD)
|
||||
|
||||
## Notes
|
||||
#### Code quality review
|
||||
If you'd like to evaluate the quality of this library, you may find audit comments helpful.
|
||||
|
||||
Simply search for "AUDIT" to see the dev notes that are aimed at facilitating code reviews.
|
||||
|
||||
#### Implementation notes
|
||||
The `1.X.X` implementation mostly copies [BackBlaze's Java implementation](https://github.com/Backblaze/JavaReedSolomon).
|
||||
|
||||
`2.0.0` onward mostly copies [Klaus Post's Go implementation](https://github.com/klauspost/reedsolomon), and copies C files from [Nicolas Trangez's Haskell implementation](https://github.com/NicolasT/reedsolomon).
|
||||
|
||||
The test suite for all versions copies [Klaus Post's Go implementation](https://github.com/klauspost/reedsolomon) as basis.
|
||||
|
||||
## License
|
||||
#### Nicolas Trangez's Haskell Reed-Solomon implementation
|
||||
The C files for SIMD operations are copied (with no/minor modifications) from [Nicolas Trangez's Haskell implementation](https://github.com/NicolasT/reedsolomon), and are under the same MIT License as used by NicolasT's project
|
||||
|
||||
#### TL;DR
|
||||
All files are released under the MIT License
|
||||
108
seaweed-volume/vendor/reed-solomon-erasure/benches/reconstruct.rs
vendored
Normal file
108
seaweed-volume/vendor/reed-solomon-erasure/benches/reconstruct.rs
vendored
Normal file
@@ -0,0 +1,108 @@
|
||||
#![feature(test)]
|
||||
|
||||
extern crate test;
|
||||
|
||||
use {
|
||||
rand::{prelude::*, Rng},
|
||||
reed_solomon_erasure::galois_8::Field,
|
||||
test::Bencher,
|
||||
};
|
||||
|
||||
type ReedSolomon = reed_solomon_erasure::ReedSolomon<Field>;
|
||||
|
||||
const SHARD_SIZE: usize = 1024;
|
||||
|
||||
fn run_reconstruct_bench(bencher: &mut Bencher, num_data_shards: usize, num_parity_shards: usize) {
|
||||
let mut rng = rand::thread_rng();
|
||||
let mut shards = vec![vec![0u8; SHARD_SIZE]; num_data_shards + num_parity_shards];
|
||||
for shard in &mut shards[..num_data_shards] {
|
||||
rng.fill(&mut shard[..]);
|
||||
}
|
||||
let reed_solomon = ReedSolomon::new(num_data_shards, num_parity_shards).unwrap();
|
||||
reed_solomon.encode(&mut shards[..]).unwrap();
|
||||
let shards: Vec<_> = shards.into_iter().map(Some).collect();
|
||||
|
||||
bencher.iter(|| {
|
||||
let mut shards = shards.clone();
|
||||
for _ in 0..num_parity_shards {
|
||||
*shards.choose_mut(&mut rng).unwrap() = None;
|
||||
}
|
||||
reed_solomon.reconstruct(&mut shards[..]).unwrap();
|
||||
assert!(shards.iter().all(Option::is_some));
|
||||
});
|
||||
}
|
||||
|
||||
#[bench]
|
||||
fn bench_reconstruct_2_2(bencher: &mut Bencher) {
|
||||
run_reconstruct_bench(bencher, 2, 2)
|
||||
}
|
||||
|
||||
#[bench]
|
||||
fn bench_reconstruct_4_2(bencher: &mut Bencher) {
|
||||
run_reconstruct_bench(bencher, 4, 2)
|
||||
}
|
||||
|
||||
#[bench]
|
||||
fn bench_reconstruct_4_4(bencher: &mut Bencher) {
|
||||
run_reconstruct_bench(bencher, 4, 4)
|
||||
}
|
||||
|
||||
#[bench]
|
||||
fn bench_reconstruct_8_2(bencher: &mut Bencher) {
|
||||
run_reconstruct_bench(bencher, 8, 2)
|
||||
}
|
||||
|
||||
#[bench]
|
||||
fn bench_reconstruct_8_4(bencher: &mut Bencher) {
|
||||
run_reconstruct_bench(bencher, 8, 4)
|
||||
}
|
||||
|
||||
#[bench]
|
||||
fn bench_reconstruct_8_8(bencher: &mut Bencher) {
|
||||
run_reconstruct_bench(bencher, 8, 8)
|
||||
}
|
||||
|
||||
#[bench]
|
||||
fn bench_reconstruct_16_2(bencher: &mut Bencher) {
|
||||
run_reconstruct_bench(bencher, 16, 2)
|
||||
}
|
||||
|
||||
#[bench]
|
||||
fn bench_reconstruct_16_4(bencher: &mut Bencher) {
|
||||
run_reconstruct_bench(bencher, 16, 4)
|
||||
}
|
||||
|
||||
#[bench]
|
||||
fn bench_reconstruct_16_8(bencher: &mut Bencher) {
|
||||
run_reconstruct_bench(bencher, 16, 8)
|
||||
}
|
||||
|
||||
#[bench]
|
||||
fn bench_reconstruct_16_16(bencher: &mut Bencher) {
|
||||
run_reconstruct_bench(bencher, 16, 16)
|
||||
}
|
||||
|
||||
#[bench]
|
||||
fn bench_reconstruct_32_2(bencher: &mut Bencher) {
|
||||
run_reconstruct_bench(bencher, 32, 2)
|
||||
}
|
||||
|
||||
#[bench]
|
||||
fn bench_reconstruct_32_4(bencher: &mut Bencher) {
|
||||
run_reconstruct_bench(bencher, 32, 4)
|
||||
}
|
||||
|
||||
#[bench]
|
||||
fn bench_reconstruct_32_8(bencher: &mut Bencher) {
|
||||
run_reconstruct_bench(bencher, 32, 8)
|
||||
}
|
||||
|
||||
#[bench]
|
||||
fn bench_reconstruct_32_16(bencher: &mut Bencher) {
|
||||
run_reconstruct_bench(bencher, 32, 16)
|
||||
}
|
||||
|
||||
#[bench]
|
||||
fn bench_reconstruct_32_32(bencher: &mut Bencher) {
|
||||
run_reconstruct_bench(bencher, 32, 32)
|
||||
}
|
||||
196
seaweed-volume/vendor/reed-solomon-erasure/build.rs
vendored
Normal file
196
seaweed-volume/vendor/reed-solomon-erasure/build.rs
vendored
Normal file
@@ -0,0 +1,196 @@
|
||||
use std::env;
|
||||
use std::fs::File;
|
||||
use std::io::Write;
|
||||
use std::path::Path;
|
||||
|
||||
#[cfg(feature = "simd-accel")]
|
||||
extern crate cc;
|
||||
|
||||
const FIELD_SIZE: usize = 256;
|
||||
|
||||
const GENERATING_POLYNOMIAL: usize = 29;
|
||||
|
||||
fn gen_log_table(polynomial: usize) -> [u8; FIELD_SIZE] {
|
||||
let mut result: [u8; FIELD_SIZE] = [0; FIELD_SIZE];
|
||||
let mut b: usize = 1;
|
||||
|
||||
for log in 0..FIELD_SIZE - 1 {
|
||||
result[b] = log as u8;
|
||||
|
||||
b = b << 1;
|
||||
|
||||
if FIELD_SIZE <= b {
|
||||
b = (b - FIELD_SIZE) ^ polynomial;
|
||||
}
|
||||
}
|
||||
|
||||
result
|
||||
}
|
||||
|
||||
const EXP_TABLE_SIZE: usize = FIELD_SIZE * 2 - 2;
|
||||
|
||||
fn gen_exp_table(log_table: &[u8; FIELD_SIZE]) -> [u8; EXP_TABLE_SIZE] {
|
||||
let mut result: [u8; EXP_TABLE_SIZE] = [0; EXP_TABLE_SIZE];
|
||||
|
||||
for i in 1..FIELD_SIZE {
|
||||
let log = log_table[i] as usize;
|
||||
result[log] = i as u8;
|
||||
result[log + FIELD_SIZE - 1] = i as u8;
|
||||
}
|
||||
|
||||
result
|
||||
}
|
||||
|
||||
fn multiply(log_table: &[u8; FIELD_SIZE], exp_table: &[u8; EXP_TABLE_SIZE], a: u8, b: u8) -> u8 {
|
||||
if a == 0 || b == 0 {
|
||||
0
|
||||
} else {
|
||||
let log_a = log_table[a as usize];
|
||||
let log_b = log_table[b as usize];
|
||||
let log_result = log_a as usize + log_b as usize;
|
||||
exp_table[log_result]
|
||||
}
|
||||
}
|
||||
|
||||
fn gen_mul_table(
|
||||
log_table: &[u8; FIELD_SIZE],
|
||||
exp_table: &[u8; EXP_TABLE_SIZE],
|
||||
) -> [[u8; FIELD_SIZE]; FIELD_SIZE] {
|
||||
let mut result: [[u8; FIELD_SIZE]; FIELD_SIZE] = [[0; 256]; 256];
|
||||
|
||||
for a in 0..FIELD_SIZE {
|
||||
for b in 0..FIELD_SIZE {
|
||||
result[a][b] = multiply(log_table, exp_table, a as u8, b as u8);
|
||||
}
|
||||
}
|
||||
|
||||
result
|
||||
}
|
||||
|
||||
fn gen_mul_table_half(
|
||||
log_table: &[u8; FIELD_SIZE],
|
||||
exp_table: &[u8; EXP_TABLE_SIZE],
|
||||
) -> ([[u8; 16]; FIELD_SIZE], [[u8; 16]; FIELD_SIZE]) {
|
||||
let mut low: [[u8; 16]; FIELD_SIZE] = [[0; 16]; FIELD_SIZE];
|
||||
let mut high: [[u8; 16]; FIELD_SIZE] = [[0; 16]; FIELD_SIZE];
|
||||
|
||||
for a in 0..low.len() {
|
||||
for b in 0..low.len() {
|
||||
let mut result = 0;
|
||||
if !(a == 0 || b == 0) {
|
||||
let log_a = log_table[a];
|
||||
let log_b = log_table[b];
|
||||
result = exp_table[log_a as usize + log_b as usize];
|
||||
}
|
||||
if (b & 0x0F) == b {
|
||||
low[a][b] = result;
|
||||
}
|
||||
if (b & 0xF0) == b {
|
||||
high[a][b >> 4] = result;
|
||||
}
|
||||
}
|
||||
}
|
||||
(low, high)
|
||||
}
|
||||
|
||||
macro_rules! write_table {
|
||||
(1D => $file:ident, $table:ident, $name:expr, $type:expr) => {{
|
||||
let len = $table.len();
|
||||
let mut table_str = String::from(format!("pub static {}: [{}; {}] = [", $name, $type, len));
|
||||
|
||||
for v in $table.iter() {
|
||||
let str = format!("{}, ", v);
|
||||
table_str.push_str(&str);
|
||||
}
|
||||
|
||||
table_str.push_str("];\n");
|
||||
|
||||
$file.write_all(table_str.as_bytes()).unwrap();
|
||||
}};
|
||||
(2D => $file:ident, $table:ident, $name:expr, $type:expr) => {{
|
||||
let rows = $table.len();
|
||||
let cols = $table[0].len();
|
||||
let mut table_str = String::from(format!(
|
||||
"pub static {}: [[{}; {}]; {}] = [",
|
||||
$name, $type, cols, rows
|
||||
));
|
||||
|
||||
for a in $table.iter() {
|
||||
table_str.push_str("[");
|
||||
for b in a.iter() {
|
||||
let str = format!("{}, ", b);
|
||||
table_str.push_str(&str);
|
||||
}
|
||||
table_str.push_str("],\n");
|
||||
}
|
||||
|
||||
table_str.push_str("];\n");
|
||||
|
||||
$file.write_all(table_str.as_bytes()).unwrap();
|
||||
}};
|
||||
}
|
||||
|
||||
fn write_tables() {
|
||||
let log_table = gen_log_table(GENERATING_POLYNOMIAL);
|
||||
let exp_table = gen_exp_table(&log_table);
|
||||
let mul_table = gen_mul_table(&log_table, &exp_table);
|
||||
|
||||
let out_dir = env::var("OUT_DIR").unwrap();
|
||||
let dest_path = Path::new(&out_dir).join("table.rs");
|
||||
let mut f = File::create(&dest_path).unwrap();
|
||||
|
||||
write_table!(1D => f, log_table, "LOG_TABLE", "u8");
|
||||
write_table!(1D => f, exp_table, "EXP_TABLE", "u8");
|
||||
write_table!(2D => f, mul_table, "MUL_TABLE", "u8");
|
||||
|
||||
if cfg!(feature = "simd-accel") {
|
||||
let (mul_table_low, mul_table_high) = gen_mul_table_half(&log_table, &exp_table);
|
||||
|
||||
write_table!(2D => f, mul_table_low, "MUL_TABLE_LOW", "u8");
|
||||
write_table!(2D => f, mul_table_high, "MUL_TABLE_HIGH", "u8");
|
||||
}
|
||||
}
|
||||
|
||||
#[cfg(all(
|
||||
feature = "simd-accel",
|
||||
any(target_arch = "x86_64", target_arch = "aarch64"),
|
||||
not(target_env = "msvc"),
|
||||
not(any(target_os = "android", target_os = "ios"))
|
||||
))]
|
||||
fn compile_simd_c() {
|
||||
let mut build = cc::Build::new();
|
||||
build.opt_level(3);
|
||||
|
||||
match env::var("RUST_REED_SOLOMON_ERASURE_ARCH") {
|
||||
Ok(arch) => {
|
||||
// Use explicitly specified environment variable as architecture.
|
||||
build.flag(&format!("-march={}", arch));
|
||||
}
|
||||
Err(_error) => {
|
||||
// On x86-64 enabling Haswell architecture unlocks useful instructions and improves performance
|
||||
// dramatically while allowing it to run ony modern CPU.
|
||||
match env::var("CARGO_CFG_TARGET_ARCH").unwrap().as_str(){
|
||||
"x86_64" => { build.flag(&"-march=haswell"); },
|
||||
_ => ()
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
build
|
||||
.flag("-std=c11")
|
||||
.file("simd_c/reedsolomon.c")
|
||||
.compile("reedsolomon");
|
||||
}
|
||||
|
||||
#[cfg(not(all(
|
||||
feature = "simd-accel",
|
||||
any(target_arch = "x86_64", target_arch = "aarch64"),
|
||||
not(target_env = "msvc"),
|
||||
not(any(target_os = "android", target_os = "ios"))
|
||||
)))]
|
||||
fn compile_simd_c() {}
|
||||
|
||||
fn main() {
|
||||
compile_simd_c();
|
||||
write_tables();
|
||||
}
|
||||
26
seaweed-volume/vendor/reed-solomon-erasure/sage/galois_ext_test.sage
vendored
Normal file
26
seaweed-volume/vendor/reed-solomon-erasure/sage/galois_ext_test.sage
vendored
Normal file
@@ -0,0 +1,26 @@
|
||||
GF256.<a> = FiniteField(256)
|
||||
R.<x> = GF256[x]
|
||||
ext_poly = R.irreducible_element(2,algorithm="first_lexicographic" )
|
||||
ExtField.<b> = GF256.extension(ext_poly)
|
||||
print ExtField
|
||||
print len(ExtField)
|
||||
|
||||
x^2 + a*x + a^7
|
||||
|
||||
e1 = (a^7 + a^6 + a^4 + a)*b + a^3 + a^2 + a + 1
|
||||
e2 = (a^7 + a^5 + a^2)*b + a^7 + a^4 + a^3 + a
|
||||
|
||||
print "e1: ", e1
|
||||
print "e2: ", e2
|
||||
|
||||
print "e1 + e2: ", e1 + e2
|
||||
#(a^6 + a^5 + a^4 + a^2 + a)*b + a^7 + a^4 + a^2 + 1
|
||||
|
||||
print "e1 * e2: ", e1 * e2
|
||||
#(a^4 + a^2 + a + 1)*b + a^7 + a^5 + a^3 + a
|
||||
|
||||
print "e1 / e2: ", e1 / e2
|
||||
#(a^7 + a^6 + a^5 + a^4 + a^3 + a^2 + 1)*b + a^6 + a^3 + a
|
||||
|
||||
print "1/b: ", 1/b
|
||||
#(a^4 + a^3 + a + 1)*b + a^5 + a^4 + a^2 + a
|
||||
574
seaweed-volume/vendor/reed-solomon-erasure/simd_c/reedsolomon.c
vendored
Normal file
574
seaweed-volume/vendor/reed-solomon-erasure/simd_c/reedsolomon.c
vendored
Normal file
@@ -0,0 +1,574 @@
|
||||
/* reedsolomon.c - SIMD-optimized Galois-field multiplication routines
|
||||
*
|
||||
* Copyright (c) 2015, 2016 Nicolas Trangez
|
||||
* Copyright (c) 2015 Klaus Post
|
||||
*
|
||||
* Permission is hereby granted, free of charge, to any person obtaining a copy
|
||||
* of this software and associated documentation files (the "Software"), to
|
||||
* deal in the Software without restriction, including without limitation the
|
||||
* rights to use, copy, modify, merge, publish, distribute, sublicense, and/or
|
||||
* sell copies of the Software, and to permit persons to whom the Software is
|
||||
* furnished to do so, subject to the following conditions:
|
||||
*
|
||||
* The above copyright notice and this permission notice shall be included in
|
||||
* all copies or substantial portions of the Software.
|
||||
*
|
||||
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
||||
* IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
||||
* FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
||||
* AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
||||
* LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
||||
* OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
||||
* SOFTWARE
|
||||
*/
|
||||
|
||||
#if HAVE_CONFIG_H
|
||||
# include "config.h"
|
||||
#endif
|
||||
|
||||
#include <stdint.h>
|
||||
#include <string.h>
|
||||
|
||||
//#if defined(__SSE2__) && __SSE2__ && defined(HAVE_EMMINTRIN_H) && HAVE_EMMINTRIN_H
|
||||
//#ifdef __SSE2__
|
||||
#if defined(__SSE2__) && __SSE2__
|
||||
# define USE_SSE2 1
|
||||
# undef VECTOR_SIZE
|
||||
# define VECTOR_SIZE 16
|
||||
# include <emmintrin.h>
|
||||
#else
|
||||
# define USE_SSE2 0
|
||||
#endif
|
||||
|
||||
//#if defined(__SSSE3__) && __SSSE3__ && defined(HAVE_TMMINTRIN_H) && HAVE_TMMINTRIN_H
|
||||
//#ifdef __SSSE3__
|
||||
#if defined(__SSSE3__) && __SSSE3__
|
||||
# define USE_SSSE3 1
|
||||
# undef VECTOR_SIZE
|
||||
# define VECTOR_SIZE 16
|
||||
# include <tmmintrin.h>
|
||||
#else
|
||||
# define USE_SSSE3 0
|
||||
#endif
|
||||
|
||||
//#if defined(__AVX2__) && __AVX2__ && defined(HAVE_IMMINTRIN_H) && HAVE_IMMINTRIN_H
|
||||
//#ifdef __AVX2__
|
||||
#if defined(__AVX2__) && __AVX2__
|
||||
# define USE_AVX2 1
|
||||
# undef VECTOR_SIZE
|
||||
# define VECTOR_SIZE 32
|
||||
# include <immintrin.h>
|
||||
#else
|
||||
# define USE_AVX2 0
|
||||
#endif
|
||||
|
||||
|
||||
#if defined(__AVX512F__) && __AVX512F__
|
||||
# define USE_AVX512 1
|
||||
# undef VECTOR_SIZE
|
||||
# define VECTOR_SIZE 64
|
||||
# include <immintrin.h>
|
||||
#else
|
||||
# define USE_AVX512 0
|
||||
#endif
|
||||
|
||||
|
||||
/*#if ((defined(__ARM_NEON__) && __ARM_NEON__) \
|
||||
|| (defined(__ARM_NEON) && __ARM_NEON) \
|
||||
|| (defined(__aarch64__) && __aarch64__)) \
|
||||
&& defined(HAVE_ARM_NEON_H) && HAVE_ARM_NEON_H*/
|
||||
#if ((defined(__ARM_NEON__) && __ARM_NEON__) \
|
||||
|| (defined(__ARM_NEON) && __ARM_NEON) \
|
||||
|| (defined(__aarch64__) && __aarch64__))
|
||||
# define USE_ARM_NEON 1
|
||||
#undef VECTOR_SIZE
|
||||
# define VECTOR_SIZE 16
|
||||
# include <arm_neon.h>
|
||||
#else
|
||||
# define USE_ARM_NEON 0
|
||||
#endif
|
||||
|
||||
//#if defined(__ALTIVEC__) && __ALTIVEC__ && defined(HAVE_ALTIVEC_H) && HAVE_ALTIVEC_H
|
||||
#if defined(__ALTIVEC__) && __ALTIVEC__
|
||||
# define USE_ALTIVEC 1
|
||||
# undef VECTOR_SIZE
|
||||
# define VECTOR_SIZE 16
|
||||
# include <altivec.h>
|
||||
#else
|
||||
# define USE_ALTIVEC 0
|
||||
#endif
|
||||
|
||||
#ifndef VECTOR_SIZE
|
||||
/* 'Generic' code */
|
||||
# define VECTOR_SIZE 16
|
||||
#endif
|
||||
|
||||
# define USE_ALIGNED_ACCESS 0
|
||||
# define ALIGNED_ACCESS __attribute__((unused))
|
||||
# define UNALIGNED_ACCESS
|
||||
|
||||
#include "reedsolomon.h"
|
||||
|
||||
#if defined(HAVE_FUNC_ATTRIBUTE_HOT) && HAVE_FUNC_ATTRIBUTE_HOT
|
||||
# define HOT_FUNCTION __attribute__((hot))
|
||||
#else
|
||||
# define HOT_FUNCTION
|
||||
#endif
|
||||
|
||||
#if defined(HAVE_FUNC_ATTRIBUTE_CONST) && HAVE_FUNC_ATTRIBUTE_CONST
|
||||
# define CONST_FUNCTION __attribute__((const))
|
||||
#else
|
||||
# define CONST_FUNCTION
|
||||
#endif
|
||||
|
||||
#if defined(HAVE_FUNC_ATTRIBUTE_ALWAYS_INLINE) && HAVE_FUNC_ATTRIBUTE_ALWAYS_INLINE
|
||||
# define ALWAYS_INLINE inline __attribute__((always_inline))
|
||||
#else
|
||||
# define ALWAYS_INLINE inline
|
||||
#endif
|
||||
|
||||
#if defined(HAVE_FUNC_ATTRIBUTE_FORCE_ALIGN_ARG_POINTER) && HAVE_FUNC_ATTRIBUTE_FORCE_ALIGN_ARG_POINTER
|
||||
# define FORCE_ALIGN_ARG_POINTER __attribute__((force_align_arg_pointer))
|
||||
#else
|
||||
# define FORCE_ALIGN_ARG_POINTER
|
||||
#endif
|
||||
|
||||
#define CONCAT_HELPER(a, b) a ## b
|
||||
#define CONCAT(a, b) CONCAT_HELPER(a, b)
|
||||
|
||||
typedef uint8_t v16u8v __attribute__((vector_size(16), aligned(1)));
|
||||
typedef uint64_t v2u64v __attribute__((vector_size(16), aligned(1)));
|
||||
|
||||
#define T(t, n) t n[VSIZE / 8 / sizeof(t)]
|
||||
#define T1(t, n) t n
|
||||
|
||||
#define VSIZE 128
|
||||
typedef union {
|
||||
T(uint8_t, u8);
|
||||
T(uint64_t, u64);
|
||||
#if USE_SSE2
|
||||
T1(__m128i, m128i);
|
||||
#endif
|
||||
#if USE_ARM_NEON
|
||||
T1(uint8x16_t, uint8x16);
|
||||
T1(uint8x8x2_t, uint8x8x2);
|
||||
#endif
|
||||
#if USE_ALTIVEC
|
||||
T1(__vector uint8_t, uint8x16);
|
||||
T1(__vector uint64_t, uint64x2);
|
||||
#endif
|
||||
T1(v16u8v, v16u8);
|
||||
T1(v2u64v, v2u64);
|
||||
} v128 __attribute__((aligned(1)));
|
||||
#undef VSIZE
|
||||
|
||||
#define VSIZE 256
|
||||
typedef union {
|
||||
T(uint8_t, u8);
|
||||
#if USE_AVX2
|
||||
__m256i m256i;
|
||||
#endif
|
||||
} v256 __attribute__((aligned(1)));
|
||||
#undef VSIZE
|
||||
|
||||
#define VSIZE 512
|
||||
typedef union {
|
||||
T(uint8_t, u8);
|
||||
#if USE_AVX512
|
||||
__m512i m512i;
|
||||
#endif
|
||||
} v512 __attribute__((aligned(1)));
|
||||
|
||||
#undef T
|
||||
#undef T1
|
||||
|
||||
#if VECTOR_SIZE == 16
|
||||
typedef v128 v;
|
||||
#elif VECTOR_SIZE == 32
|
||||
typedef v256 v;
|
||||
#elif VECTOR_SIZE == 64
|
||||
typedef v512 v;
|
||||
#else
|
||||
# error Unsupported VECTOR_SIZE
|
||||
#endif
|
||||
|
||||
static ALWAYS_INLINE UNALIGNED_ACCESS v128 loadu_v128(const uint8_t *in) {
|
||||
#if USE_SSE2
|
||||
const v128 result = { .m128i = _mm_loadu_si128((const __m128i *)in) };
|
||||
#else
|
||||
v128 result;
|
||||
memcpy(&result.u64, in, sizeof(result.u64));
|
||||
#endif
|
||||
|
||||
return result;
|
||||
}
|
||||
|
||||
static ALWAYS_INLINE UNALIGNED_ACCESS v loadu_v(const uint8_t *in) {
|
||||
#if USE_AVX512
|
||||
const v512 result = { .m512i = _mm512_loadu_si512((const __m512i *)in) };
|
||||
#elif USE_AVX2
|
||||
const v256 result = { .m256i = _mm256_loadu_si256((const __m256i *)in) };
|
||||
#else
|
||||
const v128 result = loadu_v128(in);
|
||||
#endif
|
||||
|
||||
return result;
|
||||
}
|
||||
|
||||
static ALWAYS_INLINE ALIGNED_ACCESS v load_v(const uint8_t *in) {
|
||||
#if USE_AVX512
|
||||
const v512 result = { .m512i = _mm512_load_si512((const __m512i *)in) };
|
||||
#elif USE_AVX2
|
||||
const v256 result = { .m256i = _mm256_load_si256((const __m256i *)in) };
|
||||
#elif USE_SSE2
|
||||
const v128 result = { .m128i = _mm_load_si128((const __m128i *)in) };
|
||||
#elif USE_ARM_NEON
|
||||
const v128 result = { .uint8x16 = vld1q_u8(in) };
|
||||
#elif USE_ALTIVEC
|
||||
const v128 result = { .uint8x16 = vec_ld(0, in) };
|
||||
#else
|
||||
const v128 result = loadu_v128(in);
|
||||
#endif
|
||||
|
||||
return result;
|
||||
}
|
||||
|
||||
static ALWAYS_INLINE CONST_FUNCTION v set1_epi8_v(const uint8_t c) {
|
||||
#if USE_AVX512
|
||||
const v512 result = { .m512i = _mm512_set1_epi8(c) };
|
||||
#elif USE_AVX2
|
||||
const v256 result = { .m256i = _mm256_set1_epi8(c) };
|
||||
#elif USE_SSE2
|
||||
const v128 result = { .m128i = _mm_set1_epi8(c) };
|
||||
#elif USE_ARM_NEON
|
||||
const v128 result = { .uint8x16 = vdupq_n_u8(c) };
|
||||
#elif USE_ALTIVEC
|
||||
const v128 result = { .uint8x16 = { c, c, c, c, c, c, c, c,
|
||||
c, c, c, c, c, c, c, c } };
|
||||
#else
|
||||
uint64_t c2 = c,
|
||||
tmp = (c2 << (7 * 8)) |
|
||||
(c2 << (6 * 8)) |
|
||||
(c2 << (5 * 8)) |
|
||||
(c2 << (4 * 8)) |
|
||||
(c2 << (3 * 8)) |
|
||||
(c2 << (2 * 8)) |
|
||||
(c2 << (1 * 8)) |
|
||||
(c2 << (0 * 8));
|
||||
const v128 result = { .u64 = { tmp, tmp } };
|
||||
#endif
|
||||
|
||||
return result;
|
||||
}
|
||||
|
||||
static ALWAYS_INLINE CONST_FUNCTION v srli_epi64_v(const v in /*, const unsigned int n*/) {
|
||||
// TODO: Hard code n to 4 to avoid build issues on M1 Macs (the
|
||||
// `USE_ARM_NEON` path below) where apple clang is failing to
|
||||
// recognize the constant `n`.
|
||||
//
|
||||
// See https://github.com/rust-rse/reed-solomon-erasure/pull/92
|
||||
//
|
||||
#define n 4
|
||||
#if USE_AVX512
|
||||
const v512 result = { .m512i = _mm512_srli_epi64(in.m512i, n) };
|
||||
#elif USE_AVX2
|
||||
const v256 result = { .m256i = _mm256_srli_epi64(in.m256i, n) };
|
||||
#elif USE_SSE2
|
||||
const v128 result = { .m128i = _mm_srli_epi64(in.m128i, n) };
|
||||
#elif USE_ARM_NEON
|
||||
const v128 result = { .uint8x16 = vshrq_n_u8(in.uint8x16, n) };
|
||||
#elif USE_ALTIVEC
|
||||
# if RS_HAVE_VEC_VSRD
|
||||
const v128 shift = { .v2u64 = { n, n } },
|
||||
result = { .uint64x2 = vec_vsrd(in.v2u64, shift.v2u64) };
|
||||
# else
|
||||
const v128 result = { .v2u64 = in.v2u64 >> n };
|
||||
# endif
|
||||
#else
|
||||
const v128 result = { .u64 = { in.u64[0] >> n,
|
||||
in.u64[1] >> n } };
|
||||
#endif
|
||||
#undef n
|
||||
return result;
|
||||
}
|
||||
|
||||
static ALWAYS_INLINE CONST_FUNCTION v and_v(const v a, const v b) {
|
||||
#if USE_AVX512
|
||||
const v512 result = { .m512i = _mm512_and_si512(a.m512i, b.m512i) };
|
||||
#elif USE_AVX2
|
||||
const v256 result = { .m256i = _mm256_and_si256(a.m256i, b.m256i) };
|
||||
#elif USE_SSE2
|
||||
const v128 result = { .m128i = _mm_and_si128(a.m128i, b.m128i) };
|
||||
#elif USE_ARM_NEON
|
||||
const v128 result = { .uint8x16 = vandq_u8(a.uint8x16, b.uint8x16) };
|
||||
#elif USE_ALTIVEC
|
||||
const v128 result = { .uint8x16 = vec_and(a.uint8x16, b.uint8x16) };
|
||||
#else
|
||||
const v128 result = { .v2u64 = a.v2u64 & b.v2u64 };
|
||||
#endif
|
||||
|
||||
return result;
|
||||
}
|
||||
|
||||
static ALWAYS_INLINE CONST_FUNCTION v xor_v(const v a, const v b) {
|
||||
#if USE_AVX512
|
||||
const v512 result = { .m512i = _mm512_xor_si512(a.m512i, b.m512i) };
|
||||
#elif USE_AVX2
|
||||
const v256 result = { .m256i = _mm256_xor_si256(a.m256i, b.m256i) };
|
||||
#elif USE_SSE2
|
||||
const v128 result = { .m128i = _mm_xor_si128(a.m128i, b.m128i) };
|
||||
#elif USE_ARM_NEON
|
||||
const v128 result = { .uint8x16 = veorq_u8(a.uint8x16, b.uint8x16) };
|
||||
#elif USE_ALTIVEC
|
||||
const v128 result = { .uint8x16 = vec_xor(a.uint8x16, b.uint8x16) };
|
||||
#else
|
||||
const v128 result = { .v2u64 = a.v2u64 ^ b.v2u64 };
|
||||
#endif
|
||||
|
||||
return result;
|
||||
}
|
||||
|
||||
static ALWAYS_INLINE CONST_FUNCTION v shuffle_epi8_v(const v vec, const v mask) {
|
||||
#if USE_AVX512
|
||||
const v512 result = { .m512i = _mm512_shuffle_epi8(vec.m512i, mask.m512i) };
|
||||
#elif USE_AVX2
|
||||
const v256 result = { .m256i = _mm256_shuffle_epi8(vec.m256i, mask.m256i) };
|
||||
#elif USE_SSSE3
|
||||
const v128 result = { .m128i = _mm_shuffle_epi8(vec.m128i, mask.m128i) };
|
||||
#elif USE_ARM_NEON
|
||||
# if defined(RS_HAVE_VQTBL1Q_U8) && RS_HAVE_VQTBL1Q_U8
|
||||
const v128 result = { .uint8x16 = vqtbl1q_u8(vec.uint8x16, mask.uint8x16) };
|
||||
# else
|
||||
/* There's no NEON instruction mapping 1-to-1 to _mm_shuffle_epi8, but
|
||||
* this should have the same result...
|
||||
*/
|
||||
const v128 result = { .uint8x16 = vcombine_u8(vtbl2_u8(vec.uint8x8x2,
|
||||
vget_low_u8(mask.uint8x16)),
|
||||
vtbl2_u8(vec.uint8x8x2,
|
||||
vget_high_u8(mask.uint8x16))) };
|
||||
|
||||
# endif
|
||||
#elif USE_ALTIVEC
|
||||
const v128 zeros = set1_epi8_v(0),
|
||||
result = { .uint8x16 = vec_perm(vec.uint8x16, zeros.uint8x16, mask.uint8x16) };
|
||||
#elif defined(RS_HAVE_BUILTIN_SHUFFLE) && RS_HAVE_BUILTIN_SHUFFLE
|
||||
const v16u8v zeros = { 0, 0, 0, 0, 0, 0, 0, 0
|
||||
, 0, 0, 0, 0, 0, 0, 0, 0 };
|
||||
const v128 result = { .v16u8 = __builtin_shuffle(vec.v16u8, zeros, mask.v16u8) };
|
||||
#else
|
||||
v128 result = { .u64 = { 0, 0 } };
|
||||
|
||||
# define DO_BYTE(i) \
|
||||
result.u8[i] = mask.u8[i] & 0x80 ? 0 : vec.u8[mask.u8[i] & 0x0F];
|
||||
|
||||
DO_BYTE( 0); DO_BYTE( 1); DO_BYTE( 2); DO_BYTE( 3);
|
||||
DO_BYTE( 4); DO_BYTE( 5); DO_BYTE( 6); DO_BYTE( 7);
|
||||
DO_BYTE( 8); DO_BYTE( 9); DO_BYTE(10); DO_BYTE(11);
|
||||
DO_BYTE(12); DO_BYTE(13); DO_BYTE(14); DO_BYTE(15);
|
||||
#endif
|
||||
|
||||
return result;
|
||||
}
|
||||
|
||||
static ALWAYS_INLINE UNALIGNED_ACCESS void storeu_v(uint8_t *out, const v vec) {
|
||||
#if USE_AVX512
|
||||
_mm512_storeu_si512((__m512i *)out, vec.m512i);
|
||||
#elif USE_AVX2
|
||||
_mm256_storeu_si256((__m256i *)out, vec.m256i);
|
||||
#elif USE_SSE2
|
||||
_mm_storeu_si128((__m128i *)out, vec.m128i);
|
||||
#else
|
||||
memcpy(out, &vec.u64, sizeof(vec.u64));
|
||||
#endif
|
||||
}
|
||||
|
||||
static ALWAYS_INLINE ALIGNED_ACCESS void store_v(uint8_t *out, const v vec) {
|
||||
#if USE_AVX512
|
||||
_mm512_store_si512((__m512i *)out, vec.m512i);
|
||||
#elif USE_AVX2
|
||||
_mm256_store_si256((__m256i *)out, vec.m256i);
|
||||
#elif USE_SSE2
|
||||
_mm_store_si128((__m128i *)out, vec.m128i);
|
||||
#elif USE_ARM_NEON
|
||||
vst1q_u8(out, vec.uint8x16);
|
||||
#elif USE_ALTIVEC
|
||||
vec_st(vec.uint8x16, 0, out);
|
||||
#else
|
||||
storeu_v(out, vec);
|
||||
#endif
|
||||
}
|
||||
|
||||
static ALWAYS_INLINE CONST_FUNCTION v replicate_v128_v(const v128 vec) {
|
||||
#if USE_AVX512
|
||||
const v512 result = { .m512i = _mm512_broadcast_i32x4(vec.m128i) };
|
||||
#elif USE_AVX2
|
||||
const v256 result = { .m256i = _mm256_broadcastsi128_si256(vec.m128i) };
|
||||
#else
|
||||
const v128 result = vec;
|
||||
#endif
|
||||
|
||||
return result;
|
||||
}
|
||||
|
||||
|
||||
//+build !noasm !appengine
|
||||
|
||||
// Copyright 2015, Klaus Post, see LICENSE for details.
|
||||
|
||||
// Based on http://www.snia.org/sites/default/files2/SDC2013/presentations/NewThinking/EthanMiller_Screaming_Fast_Galois_Field%20Arithmetic_SIMD%20Instructions.pdf
|
||||
// and http://jerasure.org/jerasure/gf-complete/tree/master
|
||||
|
||||
/*
|
||||
// func galMulSSSE3Xor(low, high, in, out []byte)
|
||||
TEXT ·galMulSSSE3Xor(SB), 7, $0
|
||||
MOVQ low+0(FP),SI // SI: &low
|
||||
MOVQ high+24(FP),DX // DX: &high
|
||||
MOVOU (SI), X6 // X6 low
|
||||
MOVOU (DX), X7 // X7: high
|
||||
MOVQ $15, BX // BX: low mask
|
||||
MOVQ BX, X8
|
||||
PXOR X5, X5
|
||||
MOVQ in+48(FP),SI // R11: &in
|
||||
MOVQ in_len+56(FP),R9 // R9: len(in)
|
||||
MOVQ out+72(FP), DX // DX: &out
|
||||
PSHUFB X5, X8 // X8: lomask (unpacked)
|
||||
SHRQ $4, R9 // len(in) / 16
|
||||
CMPQ R9 ,$0
|
||||
JEQ done_xor
|
||||
loopback_xor:
|
||||
MOVOU (SI),X0 // in[x]
|
||||
MOVOU (DX),X4 // out[x]
|
||||
MOVOU X0, X1 // in[x]
|
||||
MOVOU X6, X2 // low copy
|
||||
MOVOU X7, X3 // high copy
|
||||
PSRLQ $4, X1 // X1: high input
|
||||
PAND X8, X0 // X0: low input
|
||||
PAND X8, X1 // X0: high input
|
||||
PSHUFB X0, X2 // X2: mul low part
|
||||
PSHUFB X1, X3 // X3: mul high part
|
||||
PXOR X2, X3 // X3: Result
|
||||
PXOR X4, X3 // X3: Result xor existing out
|
||||
MOVOU X3, (DX) // Store
|
||||
ADDQ $16, SI // in+=16
|
||||
ADDQ $16, DX // out+=16
|
||||
SUBQ $1, R9
|
||||
JNZ loopback_xor
|
||||
done_xor:
|
||||
RET
|
||||
|
||||
// func galMulSSSE3(low, high, in, out []byte)
|
||||
TEXT ·galMulSSSE3(SB), 7, $0
|
||||
MOVQ low+0(FP),SI // SI: &low
|
||||
MOVQ high+24(FP),DX // DX: &high
|
||||
MOVOU (SI), X6 // X6 low
|
||||
MOVOU (DX), X7 // X7: high
|
||||
MOVQ $15, BX // BX: low mask
|
||||
MOVQ BX, X8
|
||||
PXOR X5, X5
|
||||
MOVQ in+48(FP),SI // R11: &in
|
||||
MOVQ in_len+56(FP),R9 // R9: len(in)
|
||||
MOVQ out+72(FP), DX // DX: &out
|
||||
PSHUFB X5, X8 // X8: lomask (unpacked)
|
||||
SHRQ $4, R9 // len(in) / 16
|
||||
CMPQ R9 ,$0
|
||||
JEQ done
|
||||
loopback:
|
||||
MOVOU (SI),X0 // in[x]
|
||||
MOVOU X0, X1 // in[x]
|
||||
MOVOU X6, X2 // low copy
|
||||
MOVOU X7, X3 // high copy
|
||||
PSRLQ $4, X1 // X1: high input
|
||||
PAND X8, X0 // X0: low input
|
||||
PAND X8, X1 // X0: high input
|
||||
PSHUFB X0, X2 // X2: mul low part
|
||||
PSHUFB X1, X3 // X3: mul high part
|
||||
PXOR X2, X3 // X3: Result
|
||||
MOVOU X3, (DX) // Store
|
||||
ADDQ $16, SI // in+=16
|
||||
ADDQ $16, DX // out+=16
|
||||
SUBQ $1, R9
|
||||
JNZ loopback
|
||||
done:
|
||||
RET
|
||||
*/
|
||||
|
||||
static ALWAYS_INLINE v reedsolomon_gal_mul_v(
|
||||
const v low_mask_unpacked,
|
||||
const v low_vector,
|
||||
const v high_vector,
|
||||
|
||||
v (*modifier)(const v new, const v old),
|
||||
|
||||
const v in_x,
|
||||
const v old) {
|
||||
const v low_input = and_v(in_x, low_mask_unpacked),
|
||||
in_x_shifted = srli_epi64_v(in_x /*, 4*/),
|
||||
high_input = and_v(in_x_shifted, low_mask_unpacked),
|
||||
|
||||
mul_low_part = shuffle_epi8_v(low_vector, low_input),
|
||||
mul_high_part = shuffle_epi8_v(high_vector, high_input),
|
||||
|
||||
new = xor_v(mul_low_part, mul_high_part),
|
||||
result = modifier(new, old);
|
||||
|
||||
return result;
|
||||
}
|
||||
|
||||
static ALWAYS_INLINE PROTO_RETURN reedsolomon_gal_mul_impl(
|
||||
PROTO_ARGS,
|
||||
v (*modifier)(const v new, const v old)) {
|
||||
const v low_mask_unpacked = set1_epi8_v(0x0f);
|
||||
|
||||
const v128 low_vector128 = loadu_v128(low),
|
||||
high_vector128 = loadu_v128(high);
|
||||
const v low_vector = replicate_v128_v(low_vector128),
|
||||
high_vector = replicate_v128_v(high_vector128);
|
||||
|
||||
size_t done = 0;
|
||||
|
||||
#if USE_ALIGNED_ACCESS
|
||||
# define LOAD(addr) load_v(addr)
|
||||
# define STORE(addr, vec) store_v(addr, vec)
|
||||
#else
|
||||
# define LOAD(addr) loadu_v(addr)
|
||||
# define STORE(addr, vec) storeu_v(addr, vec)
|
||||
#endif
|
||||
|
||||
#if RS_HAVE_CLANG_LOOP_UNROLL
|
||||
# pragma clang loop unroll(enable)
|
||||
#endif
|
||||
for(size_t x = 0; x < len / sizeof(v); x++) {
|
||||
const v in_x = LOAD(&in[done]),
|
||||
old = LOAD(&out[done]),
|
||||
result = reedsolomon_gal_mul_v(
|
||||
low_mask_unpacked,
|
||||
low_vector, high_vector,
|
||||
modifier,
|
||||
in_x,
|
||||
old);
|
||||
|
||||
STORE(&out[done], result);
|
||||
|
||||
done += sizeof(v);
|
||||
}
|
||||
|
||||
return done;
|
||||
}
|
||||
|
||||
static ALWAYS_INLINE CONST_FUNCTION v noop(const v new, const v old __attribute__((__unused__))) {
|
||||
return new;
|
||||
}
|
||||
|
||||
#ifdef HOT
|
||||
HOT_FUNCTION
|
||||
#endif
|
||||
FORCE_ALIGN_ARG_POINTER PROTO(reedsolomon_gal_mul) {
|
||||
return reedsolomon_gal_mul_impl(low, high, in, out, len, noop);
|
||||
}
|
||||
|
||||
#ifdef HOT
|
||||
HOT_FUNCTION
|
||||
#endif
|
||||
FORCE_ALIGN_ARG_POINTER PROTO(reedsolomon_gal_mul_xor) {
|
||||
return reedsolomon_gal_mul_impl(low, high, in, out, len, xor_v);
|
||||
}
|
||||
54
seaweed-volume/vendor/reed-solomon-erasure/simd_c/reedsolomon.h
vendored
Normal file
54
seaweed-volume/vendor/reed-solomon-erasure/simd_c/reedsolomon.h
vendored
Normal file
@@ -0,0 +1,54 @@
|
||||
/* reedsolomon.h - SIMD-optimized Galois-field multiplication routines
|
||||
*
|
||||
* Copyright (c) 2015, 2016 Nicolas Trangez
|
||||
*
|
||||
* Permission is hereby granted, free of charge, to any person obtaining a copy
|
||||
* of this software and associated documentation files (the "Software"), to
|
||||
* deal in the Software without restriction, including without limitation the
|
||||
* rights to use, copy, modify, merge, publish, distribute, sublicense, and/or
|
||||
* sell copies of the Software, and to permit persons to whom the Software is
|
||||
* furnished to do so, subject to the following conditions:
|
||||
*
|
||||
* The above copyright notice and this permission notice shall be included in
|
||||
* all copies or substantial portions of the Software.
|
||||
*
|
||||
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
||||
* IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
||||
* FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
||||
* AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
||||
* LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
||||
* OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
||||
* SOFTWARE
|
||||
*/
|
||||
|
||||
#include <stdint.h>
|
||||
|
||||
#if HAVE_CONFIG_H
|
||||
# include "config.h"
|
||||
#endif
|
||||
|
||||
#define PROTO_RETURN size_t
|
||||
#define PROTO_ARGS \
|
||||
const uint8_t low[16], \
|
||||
const uint8_t high[16], \
|
||||
const uint8_t *restrict const in, \
|
||||
uint8_t *restrict const out, \
|
||||
const size_t len
|
||||
#define PROTO(name) \
|
||||
PROTO_RETURN \
|
||||
name (PROTO_ARGS)
|
||||
|
||||
PROTO(reedsolomon_gal_mul);
|
||||
PROTO(reedsolomon_gal_mul_xor);
|
||||
|
||||
typedef enum {
|
||||
REEDSOLOMON_CPU_GENERIC = 0,
|
||||
REEDSOLOMON_CPU_SSE2 = 1,
|
||||
REEDSOLOMON_CPU_SSSE3 = 2,
|
||||
REEDSOLOMON_CPU_AVX = 3,
|
||||
REEDSOLOMON_CPU_AVX2 = 4,
|
||||
REEDSOLOMON_CPU_NEON = 5,
|
||||
REEDSOLOMON_CPU_ALTIVEC = 6,
|
||||
} reedsolomon_cpu_support;
|
||||
|
||||
reedsolomon_cpu_support reedsolomon_determine_cpu_support(void);
|
||||
927
seaweed-volume/vendor/reed-solomon-erasure/src/core.rs
vendored
Normal file
927
seaweed-volume/vendor/reed-solomon-erasure/src/core.rs
vendored
Normal file
@@ -0,0 +1,927 @@
|
||||
extern crate alloc;
|
||||
|
||||
use alloc::sync::Arc;
|
||||
use alloc::vec;
|
||||
use alloc::vec::Vec;
|
||||
use core::num::NonZeroUsize;
|
||||
|
||||
use smallvec::SmallVec;
|
||||
|
||||
use crate::errors::Error;
|
||||
use crate::errors::SBSError;
|
||||
|
||||
use crate::matrix::Matrix;
|
||||
|
||||
use lru::LruCache;
|
||||
|
||||
#[cfg(feature = "std")]
|
||||
use parking_lot::Mutex;
|
||||
#[cfg(not(feature = "std"))]
|
||||
use spin::Mutex;
|
||||
|
||||
use super::Field;
|
||||
use super::ReconstructShard;
|
||||
|
||||
const DATA_DECODE_MATRIX_CACHE_CAPACITY: usize = 254;
|
||||
|
||||
// /// Parameters for parallelism.
|
||||
// #[derive(PartialEq, Debug, Clone, Copy)]
|
||||
// pub struct ParallelParam {
|
||||
// /// Number of bytes to split the slices into for computations
|
||||
// /// which can be done in parallel.
|
||||
// ///
|
||||
// /// Default is 32768.
|
||||
// pub bytes_per_encode: usize,
|
||||
// }
|
||||
|
||||
// impl ParallelParam {
|
||||
// /// Create a new `ParallelParam` with the given split arity.
|
||||
// pub fn new(bytes_per_encode: usize) -> ParallelParam {
|
||||
// ParallelParam { bytes_per_encode }
|
||||
// }
|
||||
// }
|
||||
|
||||
// impl Default for ParallelParam {
|
||||
// fn default() -> Self {
|
||||
// ParallelParam::new(32768)
|
||||
// }
|
||||
// }
|
||||
|
||||
/// Bookkeeper for shard by shard encoding.
|
||||
///
|
||||
/// This is useful for avoiding incorrect use of
|
||||
/// `encode_single` and `encode_single_sep`
|
||||
///
|
||||
/// # Use cases
|
||||
///
|
||||
/// Shard by shard encoding is useful for streamed data encoding
|
||||
/// where you do not have all the needed data shards immediately,
|
||||
/// but you want to spread out the encoding workload rather than
|
||||
/// doing the encoding after everything is ready.
|
||||
///
|
||||
/// A concrete example would be network packets encoding,
|
||||
/// where encoding packet by packet as you receive them may be more efficient
|
||||
/// than waiting for N packets then encode them all at once.
|
||||
///
|
||||
/// # Example
|
||||
///
|
||||
/// ```
|
||||
/// # #[macro_use] extern crate reed_solomon_erasure;
|
||||
/// # use reed_solomon_erasure::*;
|
||||
/// # fn main () {
|
||||
/// use reed_solomon_erasure::galois_8::Field;
|
||||
/// let r: ReedSolomon<Field> = ReedSolomon::new(3, 2).unwrap();
|
||||
///
|
||||
/// let mut sbs = ShardByShard::new(&r);
|
||||
///
|
||||
/// let mut shards = shards!([0u8, 1, 2, 3, 4],
|
||||
/// [5, 6, 7, 8, 9],
|
||||
/// // say we don't have the 3rd data shard yet
|
||||
/// // and we want to fill it in later
|
||||
/// [0, 0, 0, 0, 0],
|
||||
/// [0, 0, 0, 0, 0],
|
||||
/// [0, 0, 0, 0, 0]);
|
||||
///
|
||||
/// // encode 1st and 2nd data shard
|
||||
/// sbs.encode(&mut shards).unwrap();
|
||||
/// sbs.encode(&mut shards).unwrap();
|
||||
///
|
||||
/// // fill in 3rd data shard
|
||||
/// shards[2][0] = 10.into();
|
||||
/// shards[2][1] = 11.into();
|
||||
/// shards[2][2] = 12.into();
|
||||
/// shards[2][3] = 13.into();
|
||||
/// shards[2][4] = 14.into();
|
||||
///
|
||||
/// // now do the encoding
|
||||
/// sbs.encode(&mut shards).unwrap();
|
||||
///
|
||||
/// assert!(r.verify(&shards).unwrap());
|
||||
/// # }
|
||||
/// ```
|
||||
#[derive(PartialEq, Debug)]
|
||||
pub struct ShardByShard<'a, F: 'a + Field> {
|
||||
codec: &'a ReedSolomon<F>,
|
||||
cur_input: usize,
|
||||
}
|
||||
|
||||
impl<'a, F: 'a + Field> ShardByShard<'a, F> {
|
||||
/// Creates a new instance of the bookkeeping struct.
|
||||
pub fn new(codec: &'a ReedSolomon<F>) -> ShardByShard<'a, F> {
|
||||
ShardByShard {
|
||||
codec,
|
||||
cur_input: 0,
|
||||
}
|
||||
}
|
||||
|
||||
/// Checks if the parity shards are ready to use.
|
||||
pub fn parity_ready(&self) -> bool {
|
||||
self.cur_input == self.codec.data_shard_count
|
||||
}
|
||||
|
||||
/// Resets the bookkeeping data.
|
||||
///
|
||||
/// You should call this when you have added and encoded
|
||||
/// all data shards, and have finished using the parity shards.
|
||||
///
|
||||
/// Returns `SBSError::LeftoverShards` when there are shards encoded
|
||||
/// but parity shards are not ready to use.
|
||||
pub fn reset(&mut self) -> Result<(), SBSError> {
|
||||
if self.cur_input > 0 && !self.parity_ready() {
|
||||
return Err(SBSError::LeftoverShards);
|
||||
}
|
||||
|
||||
self.cur_input = 0;
|
||||
|
||||
Ok(())
|
||||
}
|
||||
|
||||
/// Resets the bookkeeping data without checking.
|
||||
pub fn reset_force(&mut self) {
|
||||
self.cur_input = 0;
|
||||
}
|
||||
|
||||
/// Returns the current input shard index.
|
||||
pub fn cur_input_index(&self) -> usize {
|
||||
self.cur_input
|
||||
}
|
||||
|
||||
fn return_ok_and_incre_cur_input(&mut self) -> Result<(), SBSError> {
|
||||
self.cur_input += 1;
|
||||
Ok(())
|
||||
}
|
||||
|
||||
fn sbs_encode_checks<U: AsRef<[F::Elem]> + AsMut<[F::Elem]>>(
|
||||
&mut self,
|
||||
slices: &mut [U],
|
||||
) -> Result<(), SBSError> {
|
||||
let internal_checks = |codec: &ReedSolomon<F>, data: &mut [U]| {
|
||||
check_piece_count!(all => codec, data);
|
||||
check_slices!(multi => data);
|
||||
|
||||
Ok(())
|
||||
};
|
||||
|
||||
if self.parity_ready() {
|
||||
return Err(SBSError::TooManyCalls);
|
||||
}
|
||||
|
||||
match internal_checks(self.codec, slices) {
|
||||
Ok(()) => Ok(()),
|
||||
Err(e) => Err(SBSError::RSError(e)),
|
||||
}
|
||||
}
|
||||
|
||||
fn sbs_encode_sep_checks<T: AsRef<[F::Elem]>, U: AsRef<[F::Elem]> + AsMut<[F::Elem]>>(
|
||||
&mut self,
|
||||
data: &[T],
|
||||
parity: &mut [U],
|
||||
) -> Result<(), SBSError> {
|
||||
let internal_checks = |codec: &ReedSolomon<F>, data: &[T], parity: &mut [U]| {
|
||||
check_piece_count!(data => codec, data);
|
||||
check_piece_count!(parity => codec, parity);
|
||||
check_slices!(multi => data, multi => parity);
|
||||
|
||||
Ok(())
|
||||
};
|
||||
|
||||
if self.parity_ready() {
|
||||
return Err(SBSError::TooManyCalls);
|
||||
}
|
||||
|
||||
match internal_checks(self.codec, data, parity) {
|
||||
Ok(()) => Ok(()),
|
||||
Err(e) => Err(SBSError::RSError(e)),
|
||||
}
|
||||
}
|
||||
|
||||
/// Constructs the parity shards partially using the current input data shard.
|
||||
///
|
||||
/// Returns `SBSError::TooManyCalls` when all input data shards
|
||||
/// have already been filled in via `encode`
|
||||
pub fn encode<T, U>(&mut self, mut shards: T) -> Result<(), SBSError>
|
||||
where
|
||||
T: AsRef<[U]> + AsMut<[U]>,
|
||||
U: AsRef<[F::Elem]> + AsMut<[F::Elem]>,
|
||||
{
|
||||
let shards = shards.as_mut();
|
||||
self.sbs_encode_checks(shards)?;
|
||||
|
||||
self.codec.encode_single(self.cur_input, shards).unwrap();
|
||||
|
||||
self.return_ok_and_incre_cur_input()
|
||||
}
|
||||
|
||||
/// Constructs the parity shards partially using the current input data shard.
|
||||
///
|
||||
/// Returns `SBSError::TooManyCalls` when all input data shards
|
||||
/// have already been filled in via `encode`
|
||||
pub fn encode_sep<T: AsRef<[F::Elem]>, U: AsRef<[F::Elem]> + AsMut<[F::Elem]>>(
|
||||
&mut self,
|
||||
data: &[T],
|
||||
parity: &mut [U],
|
||||
) -> Result<(), SBSError> {
|
||||
self.sbs_encode_sep_checks(data, parity)?;
|
||||
|
||||
self.codec
|
||||
.encode_single_sep(self.cur_input, data[self.cur_input].as_ref(), parity)
|
||||
.unwrap();
|
||||
|
||||
self.return_ok_and_incre_cur_input()
|
||||
}
|
||||
}
|
||||
|
||||
/// Reed-Solomon erasure code encoder/decoder.
|
||||
///
|
||||
/// # Common error handling
|
||||
///
|
||||
/// ## For `encode`, `encode_shards`, `verify`, `verify_shards`, `reconstruct`, `reconstruct_data`, `reconstruct_shards`, `reconstruct_data_shards`
|
||||
///
|
||||
/// Return `Error::TooFewShards` or `Error::TooManyShards`
|
||||
/// when the number of provided shards
|
||||
/// does not match the codec's one.
|
||||
///
|
||||
/// Return `Error::EmptyShard` when the first shard provided is
|
||||
/// of zero length.
|
||||
///
|
||||
/// Return `Error::IncorrectShardSize` when the provided shards
|
||||
/// are of different lengths.
|
||||
///
|
||||
/// ## For `reconstruct`, `reconstruct_data`, `reconstruct_shards`, `reconstruct_data_shards`
|
||||
///
|
||||
/// Return `Error::TooFewShardsPresent` when there are not
|
||||
/// enough shards for reconstruction.
|
||||
///
|
||||
/// Return `Error::InvalidShardFlags` when the number of flags does not match
|
||||
/// the total number of shards.
|
||||
///
|
||||
/// # Variants of encoding methods
|
||||
///
|
||||
/// ## `sep`
|
||||
///
|
||||
/// Methods ending in `_sep` takes an immutable reference to data shards,
|
||||
/// and a mutable reference to parity shards.
|
||||
///
|
||||
/// They are useful as they do not need to borrow the data shards mutably,
|
||||
/// and other work that only needs read-only access to data shards can be done
|
||||
/// in parallel/concurrently during the encoding.
|
||||
///
|
||||
/// Following is a table of all the `sep` variants
|
||||
///
|
||||
/// | not `sep` | `sep` |
|
||||
/// | --- | --- |
|
||||
/// | `encode_single` | `encode_single_sep` |
|
||||
/// | `encode` | `encode_sep` |
|
||||
///
|
||||
/// The `sep` variants do similar checks on the provided data shards and
|
||||
/// parity shards.
|
||||
///
|
||||
/// Return `Error::TooFewDataShards`, `Error::TooManyDataShards`,
|
||||
/// `Error::TooFewParityShards`, or `Error::TooManyParityShards` when applicable.
|
||||
///
|
||||
/// ## `single`
|
||||
///
|
||||
/// Methods containing `single` facilitate shard by shard encoding, where
|
||||
/// the parity shards are partially constructed using one data shard at a time.
|
||||
/// See `ShardByShard` struct for more details on how shard by shard encoding
|
||||
/// can be useful.
|
||||
///
|
||||
/// They are prone to **misuse**, and it is recommended to use the `ShardByShard`
|
||||
/// bookkeeping struct instead for shard by shard encoding.
|
||||
///
|
||||
/// The ones that are also `sep` are **ESPECIALLY** prone to **misuse**.
|
||||
/// Only use them when you actually need the flexibility.
|
||||
///
|
||||
/// Following is a table of all the shard by shard variants
|
||||
///
|
||||
/// | all shards at once | shard by shard |
|
||||
/// | --- | --- |
|
||||
/// | `encode` | `encode_single` |
|
||||
/// | `encode_sep` | `encode_single_sep` |
|
||||
///
|
||||
/// The `single` variants do similar checks on the provided data shards and parity shards,
|
||||
/// and also do index check on `i_data`.
|
||||
///
|
||||
/// Return `Error::InvalidIndex` if `i_data >= data_shard_count`.
|
||||
///
|
||||
/// # Encoding behaviour
|
||||
/// ## For `encode`
|
||||
///
|
||||
/// You do not need to clear the parity shards beforehand, as the methods
|
||||
/// will overwrite them completely.
|
||||
///
|
||||
/// ## For `encode_single`, `encode_single_sep`
|
||||
///
|
||||
/// Calling them with `i_data` being `0` will overwrite the parity shards
|
||||
/// completely. If you are using the methods correctly, then you do not need
|
||||
/// to clear the parity shards beforehand.
|
||||
///
|
||||
/// # Variants of verifying methods
|
||||
///
|
||||
/// `verify` allocate sa buffer on the heap of the same size
|
||||
/// as the parity shards, and encode the input once using the buffer to store
|
||||
/// the computed parity shards, then check if the provided parity shards
|
||||
/// match the computed ones.
|
||||
///
|
||||
/// `verify_with_buffer`, allows you to provide
|
||||
/// the buffer to avoid making heap allocation(s) for the buffer in every call.
|
||||
///
|
||||
/// The `with_buffer` variants also guarantee that the buffer contains the correct
|
||||
/// parity shards if the result is `Ok(_)` (i.e. it does not matter whether the
|
||||
/// verification passed or not, as long as the result is not an error, the buffer
|
||||
/// will contain the correct parity shards after the call).
|
||||
///
|
||||
/// Following is a table of all the `with_buffer` variants
|
||||
///
|
||||
/// | not `with_buffer` | `with_buffer` |
|
||||
/// | --- | --- |
|
||||
/// | `verify` | `verify_with_buffer` |
|
||||
///
|
||||
/// The `with_buffer` variants also check the dimensions of the buffer and return
|
||||
/// `Error::TooFewBufferShards`, `Error::TooManyBufferShards`, `Error::EmptyShard`,
|
||||
/// or `Error::IncorrectShardSize` when applicable.
|
||||
///
|
||||
#[derive(Debug)]
|
||||
pub struct ReedSolomon<F: Field> {
|
||||
data_shard_count: usize,
|
||||
parity_shard_count: usize,
|
||||
total_shard_count: usize,
|
||||
matrix: Matrix<F>,
|
||||
data_decode_matrix_cache: Mutex<LruCache<Vec<usize>, Arc<Matrix<F>>>>,
|
||||
}
|
||||
|
||||
impl<F: Field> Clone for ReedSolomon<F> {
|
||||
fn clone(&self) -> ReedSolomon<F> {
|
||||
ReedSolomon::new(self.data_shard_count, self.parity_shard_count)
|
||||
.expect("basic checks already passed as precondition of existence of self")
|
||||
}
|
||||
}
|
||||
|
||||
impl<F: Field> PartialEq for ReedSolomon<F> {
|
||||
fn eq(&self, rhs: &ReedSolomon<F>) -> bool {
|
||||
self.data_shard_count == rhs.data_shard_count
|
||||
&& self.parity_shard_count == rhs.parity_shard_count
|
||||
}
|
||||
}
|
||||
|
||||
impl<F: Field> ReedSolomon<F> {
|
||||
// AUDIT
|
||||
//
|
||||
// Error detection responsibilities
|
||||
//
|
||||
// Terminologies and symbols:
|
||||
// X =A, B, C=> Y: X delegates error checking responsibilities A, B, C to Y
|
||||
// X:= A, B, C: X needs to handle responsibilities A, B, C
|
||||
//
|
||||
// Encode methods
|
||||
//
|
||||
// `encode_single`:=
|
||||
// - check index `i_data` within range [0, data shard count)
|
||||
// - check length of `slices` matches total shard count exactly
|
||||
// - check consistency of length of individual slices
|
||||
// `encode_single_sep`:=
|
||||
// - check index `i_data` within range [0, data shard count)
|
||||
// - check length of `parity` matches parity shard count exactly
|
||||
// - check consistency of length of individual parity slices
|
||||
// - check length of `single_data` matches length of first parity slice
|
||||
// `encode`:=
|
||||
// - check length of `slices` matches total shard count exactly
|
||||
// - check consistency of length of individual slices
|
||||
// `encode_sep`:=
|
||||
// - check length of `data` matches data shard count exactly
|
||||
// - check length of `parity` matches parity shard count exactly
|
||||
// - check consistency of length of individual data slices
|
||||
// - check consistency of length of individual parity slices
|
||||
// - check length of first parity slice matches length of first data slice
|
||||
//
|
||||
// Verify methods
|
||||
//
|
||||
// `verify`:=
|
||||
// - check length of `slices` matches total shard count exactly
|
||||
// - check consistency of length of individual slices
|
||||
//
|
||||
// Generates buffer then passes control to verify_with_buffer
|
||||
//
|
||||
// `verify_with_buffer`:=
|
||||
// - check length of `slices` matches total shard count exactly
|
||||
// - check length of `buffer` matches parity shard count exactly
|
||||
// - check consistency of length of individual slices
|
||||
// - check consistency of length of individual slices in buffer
|
||||
// - check length of first slice in buffer matches length of first slice
|
||||
//
|
||||
// Reconstruct methods
|
||||
//
|
||||
// `reconstruct` =ALL=> `reconstruct_internal`
|
||||
// `reconstruct_data`=ALL=> `reconstruct_internal`
|
||||
// `reconstruct_internal`:=
|
||||
// - check length of `slices` matches total shard count exactly
|
||||
// - check consistency of length of individual slices
|
||||
// - check length of `slice_present` matches length of `slices`
|
||||
|
||||
fn get_parity_rows(&self) -> SmallVec<[&[F::Elem]; 32]> {
|
||||
let mut parity_rows = SmallVec::with_capacity(self.parity_shard_count);
|
||||
let matrix = &self.matrix;
|
||||
for i in self.data_shard_count..self.total_shard_count {
|
||||
parity_rows.push(matrix.get_row(i));
|
||||
}
|
||||
|
||||
parity_rows
|
||||
}
|
||||
|
||||
fn build_matrix(data_shards: usize, total_shards: usize) -> Matrix<F> {
|
||||
let vandermonde = Matrix::vandermonde(total_shards, data_shards);
|
||||
|
||||
let top = vandermonde.sub_matrix(0, 0, data_shards, data_shards);
|
||||
|
||||
vandermonde.multiply(&top.invert().unwrap())
|
||||
}
|
||||
|
||||
/// Creates a new instance of Reed-Solomon erasure code encoder/decoder.
|
||||
///
|
||||
/// Returns `Error::TooFewDataShards` if `data_shards == 0`.
|
||||
///
|
||||
/// Returns `Error::TooFewParityShards` if `parity_shards == 0`.
|
||||
///
|
||||
/// Returns `Error::TooManyShards` if `data_shards + parity_shards > F::ORDER`.
|
||||
pub fn new(data_shards: usize, parity_shards: usize) -> Result<ReedSolomon<F>, Error> {
|
||||
if data_shards == 0 {
|
||||
return Err(Error::TooFewDataShards);
|
||||
}
|
||||
if parity_shards == 0 {
|
||||
return Err(Error::TooFewParityShards);
|
||||
}
|
||||
if data_shards + parity_shards > F::ORDER {
|
||||
return Err(Error::TooManyShards);
|
||||
}
|
||||
|
||||
let total_shards = data_shards + parity_shards;
|
||||
|
||||
let matrix = Self::build_matrix(data_shards, total_shards);
|
||||
|
||||
Ok(ReedSolomon {
|
||||
data_shard_count: data_shards,
|
||||
parity_shard_count: parity_shards,
|
||||
total_shard_count: total_shards,
|
||||
matrix,
|
||||
data_decode_matrix_cache: Mutex::new(LruCache::new(
|
||||
NonZeroUsize::new(DATA_DECODE_MATRIX_CACHE_CAPACITY).unwrap(),
|
||||
)),
|
||||
})
|
||||
}
|
||||
|
||||
pub fn data_shard_count(&self) -> usize {
|
||||
self.data_shard_count
|
||||
}
|
||||
|
||||
pub fn parity_shard_count(&self) -> usize {
|
||||
self.parity_shard_count
|
||||
}
|
||||
|
||||
pub fn total_shard_count(&self) -> usize {
|
||||
self.total_shard_count
|
||||
}
|
||||
|
||||
fn code_some_slices<T: AsRef<[F::Elem]>, U: AsMut<[F::Elem]>>(
|
||||
&self,
|
||||
matrix_rows: &[&[F::Elem]],
|
||||
inputs: &[T],
|
||||
outputs: &mut [U],
|
||||
) {
|
||||
for i_input in 0..self.data_shard_count {
|
||||
self.code_single_slice(matrix_rows, i_input, inputs[i_input].as_ref(), outputs);
|
||||
}
|
||||
}
|
||||
|
||||
fn code_single_slice<U: AsMut<[F::Elem]>>(
|
||||
&self,
|
||||
matrix_rows: &[&[F::Elem]],
|
||||
i_input: usize,
|
||||
input: &[F::Elem],
|
||||
outputs: &mut [U],
|
||||
) {
|
||||
outputs.iter_mut().enumerate().for_each(|(i_row, output)| {
|
||||
let matrix_row_to_use = matrix_rows[i_row][i_input];
|
||||
let output = output.as_mut();
|
||||
|
||||
if i_input == 0 {
|
||||
F::mul_slice(matrix_row_to_use, input, output);
|
||||
} else {
|
||||
F::mul_slice_add(matrix_row_to_use, input, output);
|
||||
}
|
||||
})
|
||||
}
|
||||
|
||||
fn check_some_slices_with_buffer<T, U>(
|
||||
&self,
|
||||
matrix_rows: &[&[F::Elem]],
|
||||
inputs: &[T],
|
||||
to_check: &[T],
|
||||
buffer: &mut [U],
|
||||
) -> bool
|
||||
where
|
||||
T: AsRef<[F::Elem]>,
|
||||
U: AsRef<[F::Elem]> + AsMut<[F::Elem]>,
|
||||
{
|
||||
self.code_some_slices(matrix_rows, inputs, buffer);
|
||||
|
||||
let at_least_one_mismatch_present = buffer
|
||||
.iter_mut()
|
||||
.enumerate()
|
||||
.map(|(i, expected_parity_shard)| {
|
||||
expected_parity_shard.as_ref() == to_check[i].as_ref()
|
||||
})
|
||||
.any(|x| !x); // find the first false (some slice is different from the expected one)
|
||||
!at_least_one_mismatch_present
|
||||
}
|
||||
|
||||
/// Constructs the parity shards partially using only the data shard
|
||||
/// indexed by `i_data`.
|
||||
///
|
||||
/// The slots where the parity shards sit at will be overwritten.
|
||||
///
|
||||
/// # Warning
|
||||
///
|
||||
/// You must apply this method on the data shards in strict sequential order (0..data shard count),
|
||||
/// otherwise the parity shards will be incorrect.
|
||||
///
|
||||
/// It is recommended to use the `ShardByShard` bookkeeping struct instead of this method directly.
|
||||
pub fn encode_single<T, U>(&self, i_data: usize, mut shards: T) -> Result<(), Error>
|
||||
where
|
||||
T: AsRef<[U]> + AsMut<[U]>,
|
||||
U: AsRef<[F::Elem]> + AsMut<[F::Elem]>,
|
||||
{
|
||||
let slices = shards.as_mut();
|
||||
|
||||
check_slice_index!(data => self, i_data);
|
||||
check_piece_count!(all=> self, slices);
|
||||
check_slices!(multi => slices);
|
||||
|
||||
// Get the slice of output buffers.
|
||||
let (mut_input, output) = slices.split_at_mut(self.data_shard_count);
|
||||
|
||||
let input = mut_input[i_data].as_ref();
|
||||
|
||||
self.encode_single_sep(i_data, input, output)
|
||||
}
|
||||
|
||||
/// Constructs the parity shards partially using only the data shard provided.
|
||||
///
|
||||
/// The data shard must match the index `i_data`.
|
||||
///
|
||||
/// The slots where the parity shards sit at will be overwritten.
|
||||
///
|
||||
/// # Warning
|
||||
///
|
||||
/// You must apply this method on the data shards in strict sequential order (0..data shard count),
|
||||
/// otherwise the parity shards will be incorrect.
|
||||
///
|
||||
/// It is recommended to use the `ShardByShard` bookkeeping struct instead of this method directly.
|
||||
pub fn encode_single_sep<U: AsRef<[F::Elem]> + AsMut<[F::Elem]>>(
|
||||
&self,
|
||||
i_data: usize,
|
||||
single_data: &[F::Elem],
|
||||
parity: &mut [U],
|
||||
) -> Result<(), Error> {
|
||||
check_slice_index!(data => self, i_data);
|
||||
check_piece_count!(parity => self, parity);
|
||||
check_slices!(multi => parity, single => single_data);
|
||||
|
||||
let parity_rows = self.get_parity_rows();
|
||||
|
||||
// Do the coding.
|
||||
self.code_single_slice(&parity_rows, i_data, single_data, parity);
|
||||
|
||||
Ok(())
|
||||
}
|
||||
|
||||
/// Constructs the parity shards.
|
||||
///
|
||||
/// The slots where the parity shards sit at will be overwritten.
|
||||
pub fn encode<T, U>(&self, mut shards: T) -> Result<(), Error>
|
||||
where
|
||||
T: AsRef<[U]> + AsMut<[U]>,
|
||||
U: AsRef<[F::Elem]> + AsMut<[F::Elem]>,
|
||||
{
|
||||
let slices: &mut [U] = shards.as_mut();
|
||||
|
||||
check_piece_count!(all => self, slices);
|
||||
check_slices!(multi => slices);
|
||||
|
||||
// Get the slice of output buffers.
|
||||
let (input, output) = slices.split_at_mut(self.data_shard_count);
|
||||
|
||||
self.encode_sep(&*input, output)
|
||||
}
|
||||
|
||||
/// Constructs the parity shards using a read-only view into the
|
||||
/// data shards.
|
||||
///
|
||||
/// The slots where the parity shards sit at will be overwritten.
|
||||
pub fn encode_sep<T: AsRef<[F::Elem]>, U: AsRef<[F::Elem]> + AsMut<[F::Elem]>>(
|
||||
&self,
|
||||
data: &[T],
|
||||
parity: &mut [U],
|
||||
) -> Result<(), Error> {
|
||||
check_piece_count!(data => self, data);
|
||||
check_piece_count!(parity => self, parity);
|
||||
check_slices!(multi => data, multi => parity);
|
||||
|
||||
let parity_rows = self.get_parity_rows();
|
||||
|
||||
// Do the coding.
|
||||
self.code_some_slices(&parity_rows, data, parity);
|
||||
|
||||
Ok(())
|
||||
}
|
||||
|
||||
/// Checks if the parity shards are correct.
|
||||
///
|
||||
/// This is a wrapper of `verify_with_buffer`.
|
||||
pub fn verify<T: AsRef<[F::Elem]>>(&self, slices: &[T]) -> Result<bool, Error> {
|
||||
check_piece_count!(all => self, slices);
|
||||
check_slices!(multi => slices);
|
||||
|
||||
let slice_len = slices[0].as_ref().len();
|
||||
|
||||
let mut buffer: SmallVec<[Vec<F::Elem>; 32]> =
|
||||
SmallVec::with_capacity(self.parity_shard_count);
|
||||
|
||||
for _ in 0..self.parity_shard_count {
|
||||
buffer.push(vec![F::zero(); slice_len]);
|
||||
}
|
||||
|
||||
self.verify_with_buffer(slices, &mut buffer)
|
||||
}
|
||||
|
||||
/// Checks if the parity shards are correct.
|
||||
pub fn verify_with_buffer<T, U>(&self, slices: &[T], buffer: &mut [U]) -> Result<bool, Error>
|
||||
where
|
||||
T: AsRef<[F::Elem]>,
|
||||
U: AsRef<[F::Elem]> + AsMut<[F::Elem]>,
|
||||
{
|
||||
check_piece_count!(all => self, slices);
|
||||
check_piece_count!(parity_buf => self, buffer);
|
||||
check_slices!(multi => slices, multi => buffer);
|
||||
|
||||
let data = &slices[0..self.data_shard_count];
|
||||
let to_check = &slices[self.data_shard_count..];
|
||||
|
||||
let parity_rows = self.get_parity_rows();
|
||||
|
||||
Ok(self.check_some_slices_with_buffer(&parity_rows, data, to_check, buffer))
|
||||
}
|
||||
|
||||
/// Reconstructs all shards.
|
||||
///
|
||||
/// The shards marked not present are only overwritten when no error
|
||||
/// is detected. All provided shards must have the same length.
|
||||
///
|
||||
/// This means if the method returns an `Error`, then nothing is touched.
|
||||
///
|
||||
/// `reconstruct`, `reconstruct_data`, `reconstruct_shards`,
|
||||
/// `reconstruct_data_shards` share the same core code base.
|
||||
pub fn reconstruct<T: ReconstructShard<F>>(&self, slices: &mut [T]) -> Result<(), Error> {
|
||||
self.reconstruct_internal(slices, false)
|
||||
}
|
||||
|
||||
/// Reconstructs only the data shards.
|
||||
///
|
||||
/// The shards marked not present are only overwritten when no error
|
||||
/// is detected. All provided shards must have the same length.
|
||||
///
|
||||
/// This means if the method returns an `Error`, then nothing is touched.
|
||||
///
|
||||
/// `reconstruct`, `reconstruct_data`, `reconstruct_shards`,
|
||||
/// `reconstruct_data_shards` share the same core code base.
|
||||
pub fn reconstruct_data<T: ReconstructShard<F>>(&self, slices: &mut [T]) -> Result<(), Error> {
|
||||
self.reconstruct_internal(slices, true)
|
||||
}
|
||||
|
||||
fn get_data_decode_matrix(
|
||||
&self,
|
||||
valid_indices: &[usize],
|
||||
invalid_indices: &[usize],
|
||||
) -> Arc<Matrix<F>> {
|
||||
{
|
||||
let mut cache = self.data_decode_matrix_cache.lock();
|
||||
if let Some(entry) = cache.get(invalid_indices) {
|
||||
return entry.clone();
|
||||
}
|
||||
}
|
||||
// Pull out the rows of the matrix that correspond to the shards that
|
||||
// we have and build a square matrix. This matrix could be used to
|
||||
// generate the shards that we have from the original data.
|
||||
let mut sub_matrix = Matrix::new(self.data_shard_count, self.data_shard_count);
|
||||
for (sub_matrix_row, &valid_index) in valid_indices.iter().enumerate() {
|
||||
for c in 0..self.data_shard_count {
|
||||
sub_matrix.set(sub_matrix_row, c, self.matrix.get(valid_index, c));
|
||||
}
|
||||
}
|
||||
// Invert the matrix, so we can go from the encoded shards back to the
|
||||
// original data. Then pull out the row that generates the shard that
|
||||
// we want to decode. Note that since this matrix maps back to the
|
||||
// original data, it can be used to create a data shard, but not a
|
||||
// parity shard.
|
||||
let data_decode_matrix = Arc::new(sub_matrix.invert().unwrap());
|
||||
// Cache the inverted matrix for future use keyed on the indices of the
|
||||
// invalid rows.
|
||||
{
|
||||
let data_decode_matrix = data_decode_matrix.clone();
|
||||
let mut cache = self.data_decode_matrix_cache.lock();
|
||||
cache.put(Vec::from(invalid_indices), data_decode_matrix);
|
||||
}
|
||||
data_decode_matrix
|
||||
}
|
||||
|
||||
fn reconstruct_internal<T: ReconstructShard<F>>(
|
||||
&self,
|
||||
shards: &mut [T],
|
||||
data_only: bool,
|
||||
) -> Result<(), Error> {
|
||||
check_piece_count!(all => self, shards);
|
||||
|
||||
let data_shard_count = self.data_shard_count;
|
||||
|
||||
// Quick check: are all of the shards present? If so, there's
|
||||
// nothing to do.
|
||||
let mut number_present = 0;
|
||||
let mut shard_len = None;
|
||||
|
||||
for shard in shards.iter_mut() {
|
||||
if let Some(len) = shard.len() {
|
||||
if len == 0 {
|
||||
return Err(Error::EmptyShard);
|
||||
}
|
||||
number_present += 1;
|
||||
if let Some(old_len) = shard_len {
|
||||
if len != old_len {
|
||||
// mismatch between shards.
|
||||
return Err(Error::IncorrectShardSize);
|
||||
}
|
||||
}
|
||||
shard_len = Some(len);
|
||||
}
|
||||
}
|
||||
|
||||
if number_present == self.total_shard_count {
|
||||
// Cool. All of the shards are there. We don't
|
||||
// need to do anything.
|
||||
return Ok(());
|
||||
}
|
||||
|
||||
// More complete sanity check
|
||||
if number_present < data_shard_count {
|
||||
return Err(Error::TooFewShardsPresent);
|
||||
}
|
||||
|
||||
let shard_len = shard_len.expect("at least one shard present; qed");
|
||||
|
||||
// Pull out an array holding just the shards that
|
||||
// correspond to the rows of the submatrix. These shards
|
||||
// will be the input to the decoding process that re-creates
|
||||
// the missing data shards.
|
||||
//
|
||||
// Also, create an array of indices of the valid rows we do have
|
||||
// and the invalid rows we don't have.
|
||||
//
|
||||
// The valid indices are used to construct the data decode matrix,
|
||||
// the invalid indices are used to key the data decode matrix
|
||||
// in the data decode matrix cache.
|
||||
//
|
||||
// We only need exactly N valid indices, where N = `data_shard_count`,
|
||||
// as the data decode matrix is a N x N matrix, thus only needs
|
||||
// N valid indices for determining the N rows to pick from
|
||||
// `self.matrix`.
|
||||
let mut sub_shards: SmallVec<[&[F::Elem]; 32]> = SmallVec::with_capacity(data_shard_count);
|
||||
let mut missing_data_slices: SmallVec<[&mut [F::Elem]; 32]> =
|
||||
SmallVec::with_capacity(self.parity_shard_count);
|
||||
let mut missing_parity_slices: SmallVec<[&mut [F::Elem]; 32]> =
|
||||
SmallVec::with_capacity(self.parity_shard_count);
|
||||
let mut valid_indices: SmallVec<[usize; 32]> = SmallVec::with_capacity(data_shard_count);
|
||||
let mut invalid_indices: SmallVec<[usize; 32]> = SmallVec::with_capacity(data_shard_count);
|
||||
|
||||
// Separate the shards into groups
|
||||
for (matrix_row, shard) in shards.iter_mut().enumerate() {
|
||||
// get or initialize the shard so we can reconstruct in-place,
|
||||
// but if we are only reconstructing data shard,
|
||||
// do not initialize if the shard is not a data shard
|
||||
let shard_data = if matrix_row >= data_shard_count && data_only {
|
||||
shard.get().ok_or(None)
|
||||
} else {
|
||||
shard.get_or_initialize(shard_len).map_err(Some)
|
||||
};
|
||||
|
||||
match shard_data {
|
||||
Ok(shard) => {
|
||||
if sub_shards.len() < data_shard_count {
|
||||
sub_shards.push(shard);
|
||||
valid_indices.push(matrix_row);
|
||||
} else {
|
||||
// Already have enough shards in `sub_shards`
|
||||
// as we only need N shards, where N = `data_shard_count`,
|
||||
// for the data decode matrix
|
||||
//
|
||||
// So nothing to do here
|
||||
}
|
||||
}
|
||||
Err(None) => {
|
||||
// the shard data is not meant to be initialized here,
|
||||
// but we should still note it missing.
|
||||
invalid_indices.push(matrix_row);
|
||||
}
|
||||
Err(Some(x)) => {
|
||||
// initialized missing shard data.
|
||||
let shard = x?;
|
||||
if matrix_row < data_shard_count {
|
||||
missing_data_slices.push(shard);
|
||||
} else {
|
||||
missing_parity_slices.push(shard);
|
||||
}
|
||||
|
||||
invalid_indices.push(matrix_row);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
let data_decode_matrix = self.get_data_decode_matrix(&valid_indices, &invalid_indices);
|
||||
|
||||
// Re-create any data shards that were missing.
|
||||
//
|
||||
// The input to the coding is all of the shards we actually
|
||||
// have, and the output is the missing data shards. The computation
|
||||
// is done using the special decode matrix we just built.
|
||||
let mut matrix_rows: SmallVec<[&[F::Elem]; 32]> =
|
||||
SmallVec::with_capacity(self.parity_shard_count);
|
||||
|
||||
for i_slice in invalid_indices
|
||||
.iter()
|
||||
.cloned()
|
||||
.take_while(|i| i < &data_shard_count)
|
||||
{
|
||||
matrix_rows.push(data_decode_matrix.get_row(i_slice));
|
||||
}
|
||||
|
||||
self.code_some_slices(&matrix_rows, &sub_shards, &mut missing_data_slices);
|
||||
|
||||
if data_only {
|
||||
Ok(())
|
||||
} else {
|
||||
// Now that we have all of the data shards intact, we can
|
||||
// compute any of the parity that is missing.
|
||||
//
|
||||
// The input to the coding is ALL of the data shards, including
|
||||
// any that we just calculated. The output is whichever of the
|
||||
// parity shards were missing.
|
||||
let mut matrix_rows: SmallVec<[&[F::Elem]; 32]> =
|
||||
SmallVec::with_capacity(self.parity_shard_count);
|
||||
let parity_rows = self.get_parity_rows();
|
||||
|
||||
for i_slice in invalid_indices
|
||||
.iter()
|
||||
.cloned()
|
||||
.skip_while(|i| i < &data_shard_count)
|
||||
{
|
||||
matrix_rows.push(parity_rows[i_slice - data_shard_count]);
|
||||
}
|
||||
{
|
||||
// Gather up all the data shards.
|
||||
// old data shards are in `sub_shards`,
|
||||
// new ones are in `missing_data_slices`.
|
||||
let mut i_old_data_slice = 0;
|
||||
let mut i_new_data_slice = 0;
|
||||
|
||||
let mut all_data_slices: SmallVec<[&[F::Elem]; 32]> =
|
||||
SmallVec::with_capacity(data_shard_count);
|
||||
|
||||
let mut next_maybe_good = 0;
|
||||
let mut push_good_up_to = move |data_slices: &mut SmallVec<_>, up_to| {
|
||||
// if next_maybe_good == up_to, this loop is a no-op.
|
||||
for _ in next_maybe_good..up_to {
|
||||
// push all good indices we just skipped.
|
||||
data_slices.push(sub_shards[i_old_data_slice]);
|
||||
i_old_data_slice += 1;
|
||||
}
|
||||
|
||||
next_maybe_good = up_to + 1;
|
||||
};
|
||||
|
||||
for i_slice in invalid_indices
|
||||
.iter()
|
||||
.cloned()
|
||||
.take_while(|i| i < &data_shard_count)
|
||||
{
|
||||
push_good_up_to(&mut all_data_slices, i_slice);
|
||||
all_data_slices.push(missing_data_slices[i_new_data_slice]);
|
||||
i_new_data_slice += 1;
|
||||
}
|
||||
push_good_up_to(&mut all_data_slices, data_shard_count);
|
||||
|
||||
// Now do the actual computation for the missing
|
||||
// parity shards
|
||||
self.code_some_slices(&matrix_rows, &all_data_slices, &mut missing_parity_slices);
|
||||
}
|
||||
|
||||
Ok(())
|
||||
}
|
||||
}
|
||||
}
|
||||
158
seaweed-volume/vendor/reed-solomon-erasure/src/errors.rs
vendored
Normal file
158
seaweed-volume/vendor/reed-solomon-erasure/src/errors.rs
vendored
Normal file
@@ -0,0 +1,158 @@
|
||||
use core::fmt::Formatter;
|
||||
|
||||
#[derive(PartialEq, Debug, Clone, Copy)]
|
||||
pub enum Error {
|
||||
TooFewShards,
|
||||
TooManyShards,
|
||||
TooFewDataShards,
|
||||
TooManyDataShards,
|
||||
TooFewParityShards,
|
||||
TooManyParityShards,
|
||||
TooFewBufferShards,
|
||||
TooManyBufferShards,
|
||||
IncorrectShardSize,
|
||||
TooFewShardsPresent,
|
||||
EmptyShard,
|
||||
InvalidShardFlags,
|
||||
InvalidIndex,
|
||||
}
|
||||
|
||||
impl Error {
|
||||
fn to_string(&self) -> &str {
|
||||
match *self {
|
||||
Error::TooFewShards=> "The number of provided shards is smaller than the one in codec",
|
||||
Error::TooManyShards => "The number of provided shards is greater than the one in codec",
|
||||
Error::TooFewDataShards => "The number of provided data shards is smaller than the one in codec",
|
||||
Error::TooManyDataShards => "The number of provided data shards is greater than the one in codec",
|
||||
Error::TooFewParityShards => "The number of provided parity shards is smaller than the one in codec",
|
||||
Error::TooManyParityShards => "The number of provided parity shards is greater than the one in codec",
|
||||
Error::TooFewBufferShards => "The number of provided buffer shards is smaller than the number of parity shards in codec",
|
||||
Error::TooManyBufferShards => "The number of provided buffer shards is greater than the number of parity shards in codec",
|
||||
Error::IncorrectShardSize => "At least one of the provided shards is not of the correct size",
|
||||
Error::TooFewShardsPresent => "The number of shards present is smaller than number of parity shards, cannot reconstruct missing shards",
|
||||
Error::EmptyShard => "The first shard provided is of zero length",
|
||||
Error::InvalidShardFlags => "The number of flags does not match the total number of shards",
|
||||
Error::InvalidIndex => "The data shard index provided is greater or equal to the number of data shards in codec",
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
impl core::fmt::Display for Error {
|
||||
fn fmt(&self, f: &mut Formatter) -> Result<(), core::fmt::Error> {
|
||||
write!(f, "{}", self.to_string())
|
||||
}
|
||||
}
|
||||
|
||||
#[cfg(feature = "std")]
|
||||
impl std::error::Error for Error {
|
||||
fn description(&self) -> &str {
|
||||
self.to_string()
|
||||
}
|
||||
}
|
||||
|
||||
#[derive(PartialEq, Debug, Clone, Copy)]
|
||||
pub enum SBSError {
|
||||
TooManyCalls,
|
||||
LeftoverShards,
|
||||
RSError(Error),
|
||||
}
|
||||
|
||||
impl SBSError {
|
||||
fn to_string(&self) -> &str {
|
||||
match *self {
|
||||
SBSError::TooManyCalls => "Too many calls",
|
||||
SBSError::LeftoverShards => "Leftover shards",
|
||||
SBSError::RSError(ref e) => e.to_string(),
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
impl core::fmt::Display for SBSError {
|
||||
fn fmt(&self, f: &mut Formatter) -> Result<(), core::fmt::Error> {
|
||||
write!(f, "{}", self.to_string())
|
||||
}
|
||||
}
|
||||
|
||||
#[cfg(feature = "std")]
|
||||
impl std::error::Error for SBSError {
|
||||
fn description(&self) -> &str {
|
||||
self.to_string()
|
||||
}
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use crate::errors::Error;
|
||||
use crate::errors::SBSError;
|
||||
|
||||
#[test]
|
||||
fn test_error_to_string_is_okay() {
|
||||
assert_eq!(
|
||||
Error::TooFewShards.to_string(),
|
||||
"The number of provided shards is smaller than the one in codec"
|
||||
);
|
||||
assert_eq!(
|
||||
Error::TooManyShards.to_string(),
|
||||
"The number of provided shards is greater than the one in codec"
|
||||
);
|
||||
assert_eq!(
|
||||
Error::TooFewDataShards.to_string(),
|
||||
"The number of provided data shards is smaller than the one in codec"
|
||||
);
|
||||
assert_eq!(
|
||||
Error::TooManyDataShards.to_string(),
|
||||
"The number of provided data shards is greater than the one in codec"
|
||||
);
|
||||
assert_eq!(
|
||||
Error::TooFewParityShards.to_string(),
|
||||
"The number of provided parity shards is smaller than the one in codec"
|
||||
);
|
||||
assert_eq!(
|
||||
Error::TooManyParityShards.to_string(),
|
||||
"The number of provided parity shards is greater than the one in codec"
|
||||
);
|
||||
assert_eq!(
|
||||
Error::TooFewBufferShards.to_string(),
|
||||
"The number of provided buffer shards is smaller than the number of parity shards in codec"
|
||||
);
|
||||
assert_eq!(
|
||||
Error::TooManyBufferShards.to_string(),
|
||||
"The number of provided buffer shards is greater than the number of parity shards in codec"
|
||||
);
|
||||
assert_eq!(
|
||||
Error::IncorrectShardSize.to_string(),
|
||||
"At least one of the provided shards is not of the correct size"
|
||||
);
|
||||
assert_eq!(Error::TooFewShardsPresent.to_string(), "The number of shards present is smaller than number of parity shards, cannot reconstruct missing shards");
|
||||
assert_eq!(
|
||||
Error::EmptyShard.to_string(),
|
||||
"The first shard provided is of zero length"
|
||||
);
|
||||
assert_eq!(
|
||||
Error::InvalidShardFlags.to_string(),
|
||||
"The number of flags does not match the total number of shards"
|
||||
);
|
||||
assert_eq!(
|
||||
Error::InvalidIndex.to_string(),
|
||||
"The data shard index provided is greater or equal to the number of data shards in codec"
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_sbserror_to_string_is_okay() {
|
||||
assert_eq!(SBSError::TooManyCalls.to_string(), "Too many calls");
|
||||
assert_eq!(SBSError::LeftoverShards.to_string(), "Leftover shards");
|
||||
}
|
||||
|
||||
#[cfg(feature = "std")]
|
||||
#[test]
|
||||
fn test_error_display_does_not_panic() {
|
||||
println!("{}", Error::TooFewShards);
|
||||
}
|
||||
|
||||
#[cfg(feature = "std")]
|
||||
#[test]
|
||||
fn test_sbserror_display_does_not_panic() {
|
||||
println!("{}", SBSError::TooManyCalls);
|
||||
}
|
||||
}
|
||||
412
seaweed-volume/vendor/reed-solomon-erasure/src/galois_16.rs
vendored
Normal file
412
seaweed-volume/vendor/reed-solomon-erasure/src/galois_16.rs
vendored
Normal file
@@ -0,0 +1,412 @@
|
||||
//! GF(2^16) implementation.
|
||||
//!
|
||||
//! More accurately, this is a `GF((2^8)^2)` implementation which builds an extension
|
||||
//! field of `GF(2^8)`, as defined in the `galois_8` module.
|
||||
|
||||
use crate::galois_8;
|
||||
use core::ops::{Add, Div, Mul, Sub};
|
||||
|
||||
// the irreducible polynomial used as a modulus for the field.
|
||||
// print R.irreducible_element(2,algorithm="first_lexicographic" )
|
||||
// x^2 + a*x + a^7
|
||||
//
|
||||
// hopefully it is a fast polynomial
|
||||
const EXT_POLY: [u8; 3] = [1, 2, 128];
|
||||
|
||||
/// The field GF(2^16).
|
||||
#[derive(Debug, Default, Copy, Clone, PartialEq, Eq)]
|
||||
pub struct Field;
|
||||
|
||||
impl crate::Field for Field {
|
||||
const ORDER: usize = 65536;
|
||||
|
||||
type Elem = [u8; 2];
|
||||
|
||||
fn add(a: [u8; 2], b: [u8; 2]) -> [u8; 2] {
|
||||
(Element(a) + Element(b)).0
|
||||
}
|
||||
|
||||
fn mul(a: [u8; 2], b: [u8; 2]) -> [u8; 2] {
|
||||
(Element(a) * Element(b)).0
|
||||
}
|
||||
|
||||
fn div(a: [u8; 2], b: [u8; 2]) -> [u8; 2] {
|
||||
(Element(a) / Element(b)).0
|
||||
}
|
||||
|
||||
fn exp(elem: [u8; 2], n: usize) -> [u8; 2] {
|
||||
Element(elem).exp(n).0
|
||||
}
|
||||
|
||||
fn zero() -> [u8; 2] {
|
||||
[0; 2]
|
||||
}
|
||||
|
||||
fn one() -> [u8; 2] {
|
||||
[0, 1]
|
||||
}
|
||||
|
||||
fn nth_internal(n: usize) -> [u8; 2] {
|
||||
[(n >> 8) as u8, n as u8]
|
||||
}
|
||||
}
|
||||
|
||||
/// Type alias of ReedSolomon over GF(2^8).
|
||||
pub type ReedSolomon = crate::ReedSolomon<Field>;
|
||||
|
||||
/// Type alias of ShardByShard over GF(2^8).
|
||||
pub type ShardByShard<'a> = crate::ShardByShard<'a, Field>;
|
||||
|
||||
/// An element of `GF(2^16)`.
|
||||
#[derive(Debug, Copy, Clone, PartialEq, Eq)]
|
||||
struct Element(pub [u8; 2]);
|
||||
|
||||
impl Element {
|
||||
// Create the zero element.
|
||||
fn zero() -> Self {
|
||||
Element([0, 0])
|
||||
}
|
||||
|
||||
// A constant element evaluating to `n`.
|
||||
fn constant(n: u8) -> Element {
|
||||
Element([0, n])
|
||||
}
|
||||
|
||||
// Whether this is the zero element.
|
||||
fn is_zero(&self) -> bool {
|
||||
self.0 == [0; 2]
|
||||
}
|
||||
|
||||
fn exp(mut self, n: usize) -> Element {
|
||||
if n == 0 {
|
||||
Element::constant(1)
|
||||
} else if self == Element::zero() {
|
||||
Element::zero()
|
||||
} else {
|
||||
let x = self;
|
||||
for _ in 1..n {
|
||||
self = self * x;
|
||||
}
|
||||
|
||||
self
|
||||
}
|
||||
}
|
||||
|
||||
// reduces from some polynomial with degree <= 2.
|
||||
#[inline]
|
||||
fn reduce_from(mut x: [u8; 3]) -> Self {
|
||||
if x[0] != 0 {
|
||||
// divide x by EXT_POLY and use remainder.
|
||||
// i = 0 here.
|
||||
// c*x^(i+j) = a*x^i*b*x^j
|
||||
x[1] ^= galois_8::mul(EXT_POLY[1], x[0]);
|
||||
x[2] ^= galois_8::mul(EXT_POLY[2], x[0]);
|
||||
}
|
||||
|
||||
Element([x[1], x[2]])
|
||||
}
|
||||
|
||||
fn degree(&self) -> usize {
|
||||
if self.0[0] != 0 {
|
||||
1
|
||||
} else {
|
||||
0
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
impl From<[u8; 2]> for Element {
|
||||
fn from(c: [u8; 2]) -> Self {
|
||||
Element(c)
|
||||
}
|
||||
}
|
||||
|
||||
impl Default for Element {
|
||||
fn default() -> Self {
|
||||
Element::zero()
|
||||
}
|
||||
}
|
||||
|
||||
impl Add for Element {
|
||||
type Output = Element;
|
||||
|
||||
fn add(self, other: Self) -> Element {
|
||||
Element([self.0[0] ^ other.0[0], self.0[1] ^ other.0[1]])
|
||||
}
|
||||
}
|
||||
|
||||
impl Sub for Element {
|
||||
type Output = Element;
|
||||
|
||||
fn sub(self, other: Self) -> Element {
|
||||
self.add(other)
|
||||
}
|
||||
}
|
||||
|
||||
impl Mul for Element {
|
||||
type Output = Element;
|
||||
|
||||
fn mul(self, rhs: Self) -> Element {
|
||||
// FOIL; our elements are linear at most, with two coefficients
|
||||
let out: [u8; 3] = [
|
||||
galois_8::mul(self.0[0], rhs.0[0]),
|
||||
galois_8::add(
|
||||
galois_8::mul(self.0[1], rhs.0[0]),
|
||||
galois_8::mul(self.0[0], rhs.0[1]),
|
||||
),
|
||||
galois_8::mul(self.0[1], rhs.0[1]),
|
||||
];
|
||||
|
||||
Element::reduce_from(out)
|
||||
}
|
||||
}
|
||||
|
||||
impl Mul<u8> for Element {
|
||||
type Output = Element;
|
||||
|
||||
fn mul(self, rhs: u8) -> Element {
|
||||
Element([galois_8::mul(rhs, self.0[0]), galois_8::mul(rhs, self.0[1])])
|
||||
}
|
||||
}
|
||||
|
||||
impl Div for Element {
|
||||
type Output = Element;
|
||||
|
||||
fn div(self, rhs: Self) -> Element {
|
||||
self * rhs.inverse()
|
||||
}
|
||||
}
|
||||
|
||||
// helpers for division.
|
||||
|
||||
#[derive(Debug)]
|
||||
enum EgcdRhs {
|
||||
Element(Element),
|
||||
ExtPoly,
|
||||
}
|
||||
|
||||
impl Element {
|
||||
// compute extended euclidean algorithm against an element of self,
|
||||
// where the GCD is known to be constant.
|
||||
fn const_egcd(self, rhs: EgcdRhs) -> (u8, Element, Element) {
|
||||
if self.is_zero() {
|
||||
let rhs = match rhs {
|
||||
EgcdRhs::Element(elem) => elem,
|
||||
EgcdRhs::ExtPoly => panic!("const_egcd invoked with divisible"),
|
||||
};
|
||||
(rhs.0[1], Element::constant(0), Element::constant(1))
|
||||
} else {
|
||||
let (cur_quotient, cur_remainder) = match rhs {
|
||||
EgcdRhs::Element(rhs) => rhs.polynom_div(self),
|
||||
EgcdRhs::ExtPoly => Element::div_ext_by(self),
|
||||
};
|
||||
|
||||
// GCD is constant because EXT_POLY is irreducible
|
||||
let (g, x, y) = cur_remainder.const_egcd(EgcdRhs::Element(self));
|
||||
(g, y + (cur_quotient * x), x)
|
||||
}
|
||||
}
|
||||
|
||||
// divide EXT_POLY by self.
|
||||
fn div_ext_by(rhs: Self) -> (Element, Element) {
|
||||
if rhs.degree() == 0 {
|
||||
// dividing by constant is the same as multiplying by another constant.
|
||||
// and all constant multiples of EXT_POLY are in the equivalence class
|
||||
// of 0.
|
||||
return (Element::zero(), Element::zero());
|
||||
}
|
||||
|
||||
// divisor is ensured linear here.
|
||||
// now ensure divisor is monic.
|
||||
let leading_mul_inv = galois_8::div(1, rhs.0[0]);
|
||||
|
||||
let monictized = rhs * leading_mul_inv;
|
||||
let mut poly = EXT_POLY;
|
||||
|
||||
for i in 0..2 {
|
||||
let coef = poly[i];
|
||||
for j in 1..2 {
|
||||
if rhs.0[j] != 0 {
|
||||
poly[i + j] ^= galois_8::mul(monictized.0[j], coef);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
let remainder = Element::constant(poly[2]);
|
||||
let quotient = Element([poly[0], poly[1]]) * leading_mul_inv;
|
||||
|
||||
(quotient, remainder)
|
||||
}
|
||||
|
||||
fn polynom_div(self, rhs: Self) -> (Element, Element) {
|
||||
let divisor_degree = rhs.degree();
|
||||
if rhs.is_zero() {
|
||||
panic!("divide by 0");
|
||||
} else if self.degree() < divisor_degree {
|
||||
// If divisor's degree (len-1) is bigger, all dividend is a remainder
|
||||
(Element::zero(), self)
|
||||
} else if divisor_degree == 0 {
|
||||
// divide by constant.
|
||||
let invert = galois_8::div(1, rhs.0[1]);
|
||||
let quotient = Element([
|
||||
galois_8::mul(invert, self.0[0]),
|
||||
galois_8::mul(invert, self.0[1]),
|
||||
]);
|
||||
|
||||
(quotient, Element::zero())
|
||||
} else {
|
||||
// self degree is at least divisor degree, divisor degree not 0.
|
||||
// therefore both are 1.
|
||||
debug_assert_eq!(self.degree(), divisor_degree);
|
||||
debug_assert_eq!(self.degree(), 1);
|
||||
|
||||
// ensure rhs is constant.
|
||||
let leading_mul_inv = galois_8::div(1, rhs.0[0]);
|
||||
let monic = Element([
|
||||
galois_8::mul(leading_mul_inv, rhs.0[0]),
|
||||
galois_8::mul(leading_mul_inv, rhs.0[1]),
|
||||
]);
|
||||
|
||||
let leading_coeff = self.0[0];
|
||||
let mut remainder = self.0[1];
|
||||
|
||||
if monic.0[1] != 0 {
|
||||
remainder ^= galois_8::mul(monic.0[1], self.0[0]);
|
||||
}
|
||||
|
||||
(
|
||||
Element::constant(galois_8::mul(leading_mul_inv, leading_coeff)),
|
||||
Element::constant(remainder),
|
||||
)
|
||||
}
|
||||
}
|
||||
|
||||
/// Convert the inverse of this field element. Panics if zero.
|
||||
fn inverse(self) -> Element {
|
||||
if self.is_zero() {
|
||||
panic!("Cannot invert 0");
|
||||
}
|
||||
|
||||
// first step of extended euclidean algorithm.
|
||||
// done here because EXT_POLY is outside the scope of `Element`.
|
||||
let (gcd, y) = {
|
||||
// self / EXT_POLY = (0, self)
|
||||
let remainder = self;
|
||||
|
||||
// GCD is constant because EXT_POLY is irreducible
|
||||
let (g, x, _) = remainder.const_egcd(EgcdRhs::ExtPoly);
|
||||
|
||||
(g, x)
|
||||
};
|
||||
|
||||
// we still need to normalize it by dividing by the gcd
|
||||
if gcd != 0 {
|
||||
// EXT_POLY is irreducible so the GCD will always be constant.
|
||||
// EXT_POLY*x + self*y = gcd
|
||||
// self*y = gcd - EXT_POLY*x
|
||||
//
|
||||
// EXT_POLY*x is representative of the equivalence class of 0.
|
||||
let normalizer = galois_8::div(1, gcd);
|
||||
y * normalizer
|
||||
} else {
|
||||
// self is equivalent to zero.
|
||||
panic!("Cannot invert 0");
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
use quickcheck::Arbitrary;
|
||||
|
||||
impl Arbitrary for Element {
|
||||
fn arbitrary<G: quickcheck::Gen>(gen: &mut G) -> Self {
|
||||
let a = u8::arbitrary(gen);
|
||||
let b = u8::arbitrary(gen);
|
||||
|
||||
Element([a, b])
|
||||
}
|
||||
}
|
||||
|
||||
quickcheck! {
|
||||
fn qc_add_associativity(a: Element, b: Element, c: Element) -> bool {
|
||||
a + (b + c) == (a + b) + c
|
||||
}
|
||||
|
||||
fn qc_mul_associativity(a: Element, b: Element, c: Element) -> bool {
|
||||
a * (b * c) == (a * b) * c
|
||||
}
|
||||
|
||||
fn qc_additive_identity(a: Element) -> bool {
|
||||
let zero = Element::zero();
|
||||
a - (zero - a) == zero
|
||||
}
|
||||
|
||||
fn qc_multiplicative_identity(a: Element) -> bool {
|
||||
a.is_zero() || {
|
||||
let one = Element([0, 1]);
|
||||
(one / a) * a == one
|
||||
}
|
||||
}
|
||||
|
||||
fn qc_add_commutativity(a: Element, b: Element) -> bool {
|
||||
a + b == b + a
|
||||
}
|
||||
|
||||
fn qc_mul_commutativity(a: Element, b: Element) -> bool {
|
||||
a * b == b * a
|
||||
}
|
||||
|
||||
fn qc_add_distributivity(a: Element, b: Element, c: Element) -> bool {
|
||||
a * (b + c) == (a * b) + (a * c)
|
||||
}
|
||||
|
||||
fn qc_inverse(a: Element) -> bool {
|
||||
a.is_zero() || {
|
||||
let inv = a.inverse();
|
||||
a * inv == Element::constant(1)
|
||||
}
|
||||
}
|
||||
|
||||
fn qc_exponent_1(a: Element, n: u8) -> bool {
|
||||
a.is_zero() || n == 0 || {
|
||||
let mut b = a.exp(n as usize);
|
||||
for _ in 1..n {
|
||||
b = b / a;
|
||||
}
|
||||
|
||||
a == b
|
||||
}
|
||||
}
|
||||
|
||||
fn qc_exponent_2(a: Element, n: u8) -> bool {
|
||||
a.is_zero() || {
|
||||
let mut res = true;
|
||||
let mut b = Element::constant(1);
|
||||
|
||||
for i in 0..n {
|
||||
res = res && b == a.exp(i as usize);
|
||||
b = b * a;
|
||||
}
|
||||
|
||||
res
|
||||
}
|
||||
}
|
||||
|
||||
fn qc_exp_zero_is_one(a: Element) -> bool {
|
||||
a.exp(0) == Element::constant(1)
|
||||
}
|
||||
}
|
||||
|
||||
#[test]
|
||||
#[should_panic]
|
||||
fn test_div_b_is_0() {
|
||||
let _ = Element([1, 0]) / Element::zero();
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn zero_to_zero_is_one() {
|
||||
assert_eq!(Element::zero().exp(0), Element::constant(1))
|
||||
}
|
||||
}
|
||||
621
seaweed-volume/vendor/reed-solomon-erasure/src/galois_8.rs
vendored
Normal file
621
seaweed-volume/vendor/reed-solomon-erasure/src/galois_8.rs
vendored
Normal file
@@ -0,0 +1,621 @@
|
||||
//! Implementation of GF(2^8): the finite field with 2^8 elements.
|
||||
|
||||
include!(concat!(env!("OUT_DIR"), "/table.rs"));
|
||||
|
||||
/// The field GF(2^8).
|
||||
#[derive(Debug, Default, Copy, Clone, PartialEq, Eq)]
|
||||
pub struct Field;
|
||||
|
||||
impl crate::Field for Field {
|
||||
const ORDER: usize = 256;
|
||||
type Elem = u8;
|
||||
|
||||
fn add(a: u8, b: u8) -> u8 {
|
||||
add(a, b)
|
||||
}
|
||||
|
||||
fn mul(a: u8, b: u8) -> u8 {
|
||||
mul(a, b)
|
||||
}
|
||||
|
||||
fn div(a: u8, b: u8) -> u8 {
|
||||
div(a, b)
|
||||
}
|
||||
|
||||
fn exp(elem: u8, n: usize) -> u8 {
|
||||
exp(elem, n)
|
||||
}
|
||||
|
||||
fn zero() -> u8 {
|
||||
0
|
||||
}
|
||||
|
||||
fn one() -> u8 {
|
||||
1
|
||||
}
|
||||
|
||||
fn nth_internal(n: usize) -> u8 {
|
||||
n as u8
|
||||
}
|
||||
|
||||
fn mul_slice(c: u8, input: &[u8], out: &mut [u8]) {
|
||||
mul_slice(c, input, out)
|
||||
}
|
||||
|
||||
fn mul_slice_add(c: u8, input: &[u8], out: &mut [u8]) {
|
||||
mul_slice_xor(c, input, out)
|
||||
}
|
||||
}
|
||||
|
||||
/// Type alias of ReedSolomon over GF(2^8).
|
||||
pub type ReedSolomon = crate::ReedSolomon<Field>;
|
||||
|
||||
/// Type alias of ShardByShard over GF(2^8).
|
||||
pub type ShardByShard<'a> = crate::ShardByShard<'a, Field>;
|
||||
|
||||
/// Add two elements.
|
||||
pub fn add(a: u8, b: u8) -> u8 {
|
||||
a ^ b
|
||||
}
|
||||
|
||||
/// Subtract `b` from `a`.
|
||||
#[cfg(test)]
|
||||
pub fn sub(a: u8, b: u8) -> u8 {
|
||||
a ^ b
|
||||
}
|
||||
|
||||
/// Multiply two elements.
|
||||
pub fn mul(a: u8, b: u8) -> u8 {
|
||||
MUL_TABLE[a as usize][b as usize]
|
||||
}
|
||||
|
||||
/// Divide one element by another. `b`, the divisor, may not be 0.
|
||||
pub fn div(a: u8, b: u8) -> u8 {
|
||||
if a == 0 {
|
||||
0
|
||||
} else if b == 0 {
|
||||
panic!("Divisor is 0")
|
||||
} else {
|
||||
let log_a = LOG_TABLE[a as usize];
|
||||
let log_b = LOG_TABLE[b as usize];
|
||||
let mut log_result = log_a as isize - log_b as isize;
|
||||
if log_result < 0 {
|
||||
log_result += 255;
|
||||
}
|
||||
EXP_TABLE[log_result as usize]
|
||||
}
|
||||
}
|
||||
|
||||
/// Compute a^n.
|
||||
pub fn exp(a: u8, n: usize) -> u8 {
|
||||
if n == 0 {
|
||||
1
|
||||
} else if a == 0 {
|
||||
0
|
||||
} else {
|
||||
let log_a = LOG_TABLE[a as usize];
|
||||
let mut log_result = log_a as usize * n;
|
||||
while 255 <= log_result {
|
||||
log_result -= 255;
|
||||
}
|
||||
EXP_TABLE[log_result]
|
||||
}
|
||||
}
|
||||
|
||||
const PURE_RUST_UNROLL: isize = 4;
|
||||
|
||||
macro_rules! return_if_empty {
|
||||
(
|
||||
$len:expr
|
||||
) => {
|
||||
if $len == 0 {
|
||||
return;
|
||||
}
|
||||
};
|
||||
}
|
||||
|
||||
#[cfg(not(all(
|
||||
feature = "simd-accel",
|
||||
any(target_arch = "x86_64", target_arch = "aarch64"),
|
||||
not(target_env = "msvc"),
|
||||
not(any(target_os = "android", target_os = "ios"))
|
||||
)))]
|
||||
pub fn mul_slice(c: u8, input: &[u8], out: &mut [u8]) {
|
||||
mul_slice_pure_rust(c, input, out);
|
||||
}
|
||||
|
||||
#[cfg(not(all(
|
||||
feature = "simd-accel",
|
||||
any(target_arch = "x86_64", target_arch = "aarch64"),
|
||||
not(target_env = "msvc"),
|
||||
not(any(target_os = "android", target_os = "ios"))
|
||||
)))]
|
||||
pub fn mul_slice_xor(c: u8, input: &[u8], out: &mut [u8]) {
|
||||
mul_slice_xor_pure_rust(c, input, out);
|
||||
}
|
||||
|
||||
fn mul_slice_pure_rust(c: u8, input: &[u8], out: &mut [u8]) {
|
||||
let mt = &MUL_TABLE[c as usize];
|
||||
let mt_ptr: *const u8 = &mt[0];
|
||||
|
||||
assert_eq!(input.len(), out.len());
|
||||
|
||||
let len: isize = input.len() as isize;
|
||||
return_if_empty!(len);
|
||||
|
||||
let mut input_ptr: *const u8 = &input[0];
|
||||
let mut out_ptr: *mut u8 = &mut out[0];
|
||||
|
||||
let mut n: isize = 0;
|
||||
unsafe {
|
||||
assert_eq!(4, PURE_RUST_UNROLL);
|
||||
if len > PURE_RUST_UNROLL {
|
||||
let len_minus_unroll = len - PURE_RUST_UNROLL;
|
||||
while n < len_minus_unroll {
|
||||
*out_ptr = *mt_ptr.offset(*input_ptr as isize);
|
||||
*out_ptr.offset(1) = *mt_ptr.offset(*input_ptr.offset(1) as isize);
|
||||
*out_ptr.offset(2) = *mt_ptr.offset(*input_ptr.offset(2) as isize);
|
||||
*out_ptr.offset(3) = *mt_ptr.offset(*input_ptr.offset(3) as isize);
|
||||
|
||||
input_ptr = input_ptr.offset(PURE_RUST_UNROLL);
|
||||
out_ptr = out_ptr.offset(PURE_RUST_UNROLL);
|
||||
n += PURE_RUST_UNROLL;
|
||||
}
|
||||
}
|
||||
while n < len {
|
||||
*out_ptr = *mt_ptr.offset(*input_ptr as isize);
|
||||
|
||||
input_ptr = input_ptr.offset(1);
|
||||
out_ptr = out_ptr.offset(1);
|
||||
n += 1;
|
||||
}
|
||||
}
|
||||
/* for n in 0..input.len() {
|
||||
* out[n] = mt[input[n] as usize]
|
||||
* }
|
||||
*/
|
||||
}
|
||||
|
||||
fn mul_slice_xor_pure_rust(c: u8, input: &[u8], out: &mut [u8]) {
|
||||
let mt = &MUL_TABLE[c as usize];
|
||||
let mt_ptr: *const u8 = &mt[0];
|
||||
|
||||
assert_eq!(input.len(), out.len());
|
||||
|
||||
let len: isize = input.len() as isize;
|
||||
return_if_empty!(len);
|
||||
|
||||
let mut input_ptr: *const u8 = &input[0];
|
||||
let mut out_ptr: *mut u8 = &mut out[0];
|
||||
|
||||
let mut n: isize = 0;
|
||||
unsafe {
|
||||
assert_eq!(4, PURE_RUST_UNROLL);
|
||||
if len > PURE_RUST_UNROLL {
|
||||
let len_minus_unroll = len - PURE_RUST_UNROLL;
|
||||
while n < len_minus_unroll {
|
||||
*out_ptr ^= *mt_ptr.offset(*input_ptr as isize);
|
||||
*out_ptr.offset(1) ^= *mt_ptr.offset(*input_ptr.offset(1) as isize);
|
||||
*out_ptr.offset(2) ^= *mt_ptr.offset(*input_ptr.offset(2) as isize);
|
||||
*out_ptr.offset(3) ^= *mt_ptr.offset(*input_ptr.offset(3) as isize);
|
||||
|
||||
input_ptr = input_ptr.offset(PURE_RUST_UNROLL);
|
||||
out_ptr = out_ptr.offset(PURE_RUST_UNROLL);
|
||||
n += PURE_RUST_UNROLL;
|
||||
}
|
||||
}
|
||||
while n < len {
|
||||
*out_ptr ^= *mt_ptr.offset(*input_ptr as isize);
|
||||
|
||||
input_ptr = input_ptr.offset(1);
|
||||
out_ptr = out_ptr.offset(1);
|
||||
n += 1;
|
||||
}
|
||||
}
|
||||
/* for n in 0..input.len() {
|
||||
* out[n] ^= mt[input[n] as usize];
|
||||
* }
|
||||
*/
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
fn slice_xor(input: &[u8], out: &mut [u8]) {
|
||||
assert_eq!(input.len(), out.len());
|
||||
|
||||
let len: isize = input.len() as isize;
|
||||
return_if_empty!(len);
|
||||
|
||||
let mut input_ptr: *const u8 = &input[0];
|
||||
let mut out_ptr: *mut u8 = &mut out[0];
|
||||
|
||||
let mut n: isize = 0;
|
||||
unsafe {
|
||||
assert_eq!(4, PURE_RUST_UNROLL);
|
||||
if len > PURE_RUST_UNROLL {
|
||||
let len_minus_unroll = len - PURE_RUST_UNROLL;
|
||||
while n < len_minus_unroll {
|
||||
*out_ptr ^= *input_ptr;
|
||||
*out_ptr.offset(1) ^= *input_ptr.offset(1);
|
||||
*out_ptr.offset(2) ^= *input_ptr.offset(2);
|
||||
*out_ptr.offset(3) ^= *input_ptr.offset(3);
|
||||
|
||||
input_ptr = input_ptr.offset(PURE_RUST_UNROLL);
|
||||
out_ptr = out_ptr.offset(PURE_RUST_UNROLL);
|
||||
n += PURE_RUST_UNROLL;
|
||||
}
|
||||
}
|
||||
while n < len {
|
||||
*out_ptr ^= *input_ptr;
|
||||
|
||||
input_ptr = input_ptr.offset(1);
|
||||
out_ptr = out_ptr.offset(1);
|
||||
n += 1;
|
||||
}
|
||||
}
|
||||
/* for n in 0..input.len() {
|
||||
* out[n] ^= input[n]
|
||||
* }
|
||||
*/
|
||||
}
|
||||
|
||||
#[cfg(all(
|
||||
feature = "simd-accel",
|
||||
any(target_arch = "x86_64", target_arch = "aarch64"),
|
||||
not(target_env = "msvc"),
|
||||
not(any(target_os = "android", target_os = "ios"))
|
||||
))]
|
||||
extern "C" {
|
||||
fn reedsolomon_gal_mul(
|
||||
low: *const u8,
|
||||
high: *const u8,
|
||||
input: *const u8,
|
||||
out: *mut u8,
|
||||
len: libc::size_t,
|
||||
) -> libc::size_t;
|
||||
|
||||
fn reedsolomon_gal_mul_xor(
|
||||
low: *const u8,
|
||||
high: *const u8,
|
||||
input: *const u8,
|
||||
out: *mut u8,
|
||||
len: libc::size_t,
|
||||
) -> libc::size_t;
|
||||
}
|
||||
|
||||
#[cfg(all(
|
||||
feature = "simd-accel",
|
||||
any(target_arch = "x86_64", target_arch = "aarch64"),
|
||||
not(target_env = "msvc"),
|
||||
not(any(target_os = "android", target_os = "ios"))
|
||||
))]
|
||||
pub fn mul_slice(c: u8, input: &[u8], out: &mut [u8]) {
|
||||
let low: *const u8 = &MUL_TABLE_LOW[c as usize][0];
|
||||
let high: *const u8 = &MUL_TABLE_HIGH[c as usize][0];
|
||||
|
||||
assert_eq!(input.len(), out.len());
|
||||
|
||||
let input_ptr: *const u8 = &input[0];
|
||||
let out_ptr: *mut u8 = &mut out[0];
|
||||
let size: libc::size_t = input.len();
|
||||
|
||||
let bytes_done: usize =
|
||||
unsafe { reedsolomon_gal_mul(low, high, input_ptr, out_ptr, size) as usize };
|
||||
|
||||
mul_slice_pure_rust(c, &input[bytes_done..], &mut out[bytes_done..]);
|
||||
}
|
||||
|
||||
#[cfg(all(
|
||||
feature = "simd-accel",
|
||||
any(target_arch = "x86_64", target_arch = "aarch64"),
|
||||
not(target_env = "msvc"),
|
||||
not(any(target_os = "android", target_os = "ios"))
|
||||
))]
|
||||
pub fn mul_slice_xor(c: u8, input: &[u8], out: &mut [u8]) {
|
||||
let low: *const u8 = &MUL_TABLE_LOW[c as usize][0];
|
||||
let high: *const u8 = &MUL_TABLE_HIGH[c as usize][0];
|
||||
|
||||
assert_eq!(input.len(), out.len());
|
||||
|
||||
let input_ptr: *const u8 = &input[0];
|
||||
let out_ptr: *mut u8 = &mut out[0];
|
||||
let size: libc::size_t = input.len();
|
||||
|
||||
let bytes_done: usize =
|
||||
unsafe { reedsolomon_gal_mul_xor(low, high, input_ptr, out_ptr, size) as usize };
|
||||
|
||||
mul_slice_xor_pure_rust(c, &input[bytes_done..], &mut out[bytes_done..]);
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
extern crate alloc;
|
||||
|
||||
use alloc::vec;
|
||||
|
||||
use super::*;
|
||||
use crate::tests::fill_random;
|
||||
use rand;
|
||||
|
||||
static BACKBLAZE_LOG_TABLE: [u8; 256] = [
|
||||
//-1, 0, 1, 25, 2, 50, 26, 198,
|
||||
// first value is changed from -1 to 0
|
||||
0, 0, 1, 25, 2, 50, 26, 198, 3, 223, 51, 238, 27, 104, 199, 75, 4, 100, 224, 14, 52, 141,
|
||||
239, 129, 28, 193, 105, 248, 200, 8, 76, 113, 5, 138, 101, 47, 225, 36, 15, 33, 53, 147,
|
||||
142, 218, 240, 18, 130, 69, 29, 181, 194, 125, 106, 39, 249, 185, 201, 154, 9, 120, 77,
|
||||
228, 114, 166, 6, 191, 139, 98, 102, 221, 48, 253, 226, 152, 37, 179, 16, 145, 34, 136, 54,
|
||||
208, 148, 206, 143, 150, 219, 189, 241, 210, 19, 92, 131, 56, 70, 64, 30, 66, 182, 163,
|
||||
195, 72, 126, 110, 107, 58, 40, 84, 250, 133, 186, 61, 202, 94, 155, 159, 10, 21, 121, 43,
|
||||
78, 212, 229, 172, 115, 243, 167, 87, 7, 112, 192, 247, 140, 128, 99, 13, 103, 74, 222,
|
||||
237, 49, 197, 254, 24, 227, 165, 153, 119, 38, 184, 180, 124, 17, 68, 146, 217, 35, 32,
|
||||
137, 46, 55, 63, 209, 91, 149, 188, 207, 205, 144, 135, 151, 178, 220, 252, 190, 97, 242,
|
||||
86, 211, 171, 20, 42, 93, 158, 132, 60, 57, 83, 71, 109, 65, 162, 31, 45, 67, 216, 183,
|
||||
123, 164, 118, 196, 23, 73, 236, 127, 12, 111, 246, 108, 161, 59, 82, 41, 157, 85, 170,
|
||||
251, 96, 134, 177, 187, 204, 62, 90, 203, 89, 95, 176, 156, 169, 160, 81, 11, 245, 22, 235,
|
||||
122, 117, 44, 215, 79, 174, 213, 233, 230, 231, 173, 232, 116, 214, 244, 234, 168, 80, 88,
|
||||
175,
|
||||
];
|
||||
|
||||
#[test]
|
||||
fn log_table_same_as_backblaze() {
|
||||
for i in 0..256 {
|
||||
assert_eq!(LOG_TABLE[i], BACKBLAZE_LOG_TABLE[i]);
|
||||
}
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_associativity() {
|
||||
for a in 0..256 {
|
||||
let a = a as u8;
|
||||
for b in 0..256 {
|
||||
let b = b as u8;
|
||||
for c in 0..256 {
|
||||
let c = c as u8;
|
||||
let x = add(a, add(b, c));
|
||||
let y = add(add(a, b), c);
|
||||
assert_eq!(x, y);
|
||||
let x = mul(a, mul(b, c));
|
||||
let y = mul(mul(a, b), c);
|
||||
assert_eq!(x, y);
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
quickcheck! {
|
||||
fn qc_add_associativity(a: u8, b: u8, c: u8) -> bool {
|
||||
add(a, add(b, c)) == add(add(a, b), c)
|
||||
}
|
||||
|
||||
fn qc_mul_associativity(a: u8, b: u8, c: u8) -> bool {
|
||||
mul(a, mul(b, c)) == mul(mul(a, b), c)
|
||||
}
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_identity() {
|
||||
for a in 0..256 {
|
||||
let a = a as u8;
|
||||
let b = sub(0, a);
|
||||
let c = sub(a, b);
|
||||
assert_eq!(c, 0);
|
||||
if a != 0 {
|
||||
let b = div(1, a);
|
||||
let c = mul(a, b);
|
||||
assert_eq!(c, 1);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
quickcheck! {
|
||||
fn qc_additive_identity(a: u8) -> bool {
|
||||
sub(a, sub(0, a)) == 0
|
||||
}
|
||||
|
||||
fn qc_multiplicative_identity(a: u8) -> bool {
|
||||
if a == 0 { true }
|
||||
else { mul(a, div(1, a)) == 1 }
|
||||
}
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_commutativity() {
|
||||
for a in 0..256 {
|
||||
let a = a as u8;
|
||||
for b in 0..256 {
|
||||
let b = b as u8;
|
||||
let x = add(a, b);
|
||||
let y = add(b, a);
|
||||
assert_eq!(x, y);
|
||||
let x = mul(a, b);
|
||||
let y = mul(b, a);
|
||||
assert_eq!(x, y);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
quickcheck! {
|
||||
fn qc_add_commutativity(a: u8, b: u8) -> bool {
|
||||
add(a, b) == add(b, a)
|
||||
}
|
||||
|
||||
fn qc_mul_commutativity(a: u8, b: u8) -> bool {
|
||||
mul(a, b) == mul(b, a)
|
||||
}
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_distributivity() {
|
||||
for a in 0..256 {
|
||||
let a = a as u8;
|
||||
for b in 0..256 {
|
||||
let b = b as u8;
|
||||
for c in 0..256 {
|
||||
let c = c as u8;
|
||||
let x = mul(a, add(b, c));
|
||||
let y = add(mul(a, b), mul(a, c));
|
||||
assert_eq!(x, y);
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
quickcheck! {
|
||||
fn qc_add_distributivity(a: u8, b: u8, c: u8) -> bool {
|
||||
mul(a, add(b, c)) == add(mul(a, b), mul(a, c))
|
||||
}
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_exp() {
|
||||
for a in 0..256 {
|
||||
let a = a as u8;
|
||||
let mut power = 1u8;
|
||||
for j in 0..256 {
|
||||
let x = exp(a, j);
|
||||
assert_eq!(x, power);
|
||||
power = mul(power, a);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_galois() {
|
||||
assert_eq!(mul(3, 4), 12);
|
||||
assert_eq!(mul(7, 7), 21);
|
||||
assert_eq!(mul(23, 45), 41);
|
||||
|
||||
let input = [
|
||||
0, 1, 2, 3, 4, 5, 6, 10, 50, 100, 150, 174, 201, 255, 99, 32, 67, 85, 200, 199, 198,
|
||||
197, 196, 195, 194, 193, 192, 191, 190, 189, 188, 187, 186, 185,
|
||||
];
|
||||
let mut output1 = vec![0; input.len()];
|
||||
let mut output2 = vec![0; input.len()];
|
||||
mul_slice(25, &input, &mut output1);
|
||||
let expect = [
|
||||
0x0, 0x19, 0x32, 0x2b, 0x64, 0x7d, 0x56, 0xfa, 0xb8, 0x6d, 0xc7, 0x85, 0xc3, 0x1f,
|
||||
0x22, 0x7, 0x25, 0xfe, 0xda, 0x5d, 0x44, 0x6f, 0x76, 0x39, 0x20, 0xb, 0x12, 0x11, 0x8,
|
||||
0x23, 0x3a, 0x75, 0x6c, 0x47,
|
||||
];
|
||||
for i in 0..input.len() {
|
||||
assert_eq!(expect[i], output1[i]);
|
||||
}
|
||||
mul_slice(25, &input, &mut output2);
|
||||
for i in 0..input.len() {
|
||||
assert_eq!(expect[i], output2[i]);
|
||||
}
|
||||
|
||||
let expect_xor = [
|
||||
0x0, 0x2d, 0x5a, 0x77, 0xb4, 0x99, 0xee, 0x2f, 0x79, 0xf2, 0x7, 0x51, 0xd4, 0x19, 0x31,
|
||||
0xc9, 0xf8, 0xfc, 0xf9, 0x4f, 0x62, 0x15, 0x38, 0xfb, 0xd6, 0xa1, 0x8c, 0x96, 0xbb,
|
||||
0xcc, 0xe1, 0x22, 0xf, 0x78,
|
||||
];
|
||||
mul_slice_xor(52, &input, &mut output1);
|
||||
for i in 0..input.len() {
|
||||
assert_eq!(expect_xor[i], output1[i]);
|
||||
}
|
||||
mul_slice_xor(52, &input, &mut output2);
|
||||
for i in 0..input.len() {
|
||||
assert_eq!(expect_xor[i], output2[i]);
|
||||
}
|
||||
|
||||
let expect = [
|
||||
0x0, 0xb1, 0x7f, 0xce, 0xfe, 0x4f, 0x81, 0x9e, 0x3, 0x6, 0xe8, 0x75, 0xbd, 0x40, 0x36,
|
||||
0xa3, 0x95, 0xcb, 0xc, 0xdd, 0x6c, 0xa2, 0x13, 0x23, 0x92, 0x5c, 0xed, 0x1b, 0xaa,
|
||||
0x64, 0xd5, 0xe5, 0x54, 0x9a,
|
||||
];
|
||||
mul_slice(177, &input, &mut output1);
|
||||
for i in 0..input.len() {
|
||||
assert_eq!(expect[i], output1[i]);
|
||||
}
|
||||
mul_slice(177, &input, &mut output2);
|
||||
for i in 0..input.len() {
|
||||
assert_eq!(expect[i], output2[i]);
|
||||
}
|
||||
|
||||
let expect_xor = [
|
||||
0x0, 0xc4, 0x95, 0x51, 0x37, 0xf3, 0xa2, 0xfb, 0xec, 0xc5, 0xd0, 0xc7, 0x53, 0x88,
|
||||
0xa3, 0xa5, 0x6, 0x78, 0x97, 0x9f, 0x5b, 0xa, 0xce, 0xa8, 0x6c, 0x3d, 0xf9, 0xdf, 0x1b,
|
||||
0x4a, 0x8e, 0xe8, 0x2c, 0x7d,
|
||||
];
|
||||
mul_slice_xor(117, &input, &mut output1);
|
||||
for i in 0..input.len() {
|
||||
assert_eq!(expect_xor[i], output1[i]);
|
||||
}
|
||||
mul_slice_xor(117, &input, &mut output2);
|
||||
for i in 0..input.len() {
|
||||
assert_eq!(expect_xor[i], output2[i]);
|
||||
}
|
||||
|
||||
assert_eq!(exp(2, 2), 4);
|
||||
assert_eq!(exp(5, 20), 235);
|
||||
assert_eq!(exp(13, 7), 43);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_slice_add() {
|
||||
let length_list = [16, 32, 34];
|
||||
for len in length_list.iter() {
|
||||
let mut input = vec![0; *len];
|
||||
fill_random(&mut input);
|
||||
let mut output = vec![0; *len];
|
||||
fill_random(&mut output);
|
||||
let mut expect = vec![0; *len];
|
||||
for i in 0..expect.len() {
|
||||
expect[i] = input[i] ^ output[i];
|
||||
}
|
||||
slice_xor(&input, &mut output);
|
||||
for i in 0..expect.len() {
|
||||
assert_eq!(expect[i], output[i]);
|
||||
}
|
||||
fill_random(&mut output);
|
||||
for i in 0..expect.len() {
|
||||
expect[i] = input[i] ^ output[i];
|
||||
}
|
||||
slice_xor(&input, &mut output);
|
||||
for i in 0..expect.len() {
|
||||
assert_eq!(expect[i], output[i]);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_div_a_is_0() {
|
||||
assert_eq!(0, div(0, 100));
|
||||
}
|
||||
|
||||
#[test]
|
||||
#[should_panic]
|
||||
fn test_div_b_is_0() {
|
||||
div(1, 0);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_same_as_maybe_ffi() {
|
||||
let len = 10_003;
|
||||
for _ in 0..100 {
|
||||
let c = rand::random::<u8>();
|
||||
let mut input = vec![0; len];
|
||||
fill_random(&mut input);
|
||||
{
|
||||
let mut output = vec![0; len];
|
||||
fill_random(&mut output);
|
||||
let mut output_copy = output.clone();
|
||||
|
||||
mul_slice(c, &input, &mut output);
|
||||
mul_slice(c, &input, &mut output_copy);
|
||||
|
||||
assert_eq!(output, output_copy);
|
||||
}
|
||||
{
|
||||
let mut output = vec![0; len];
|
||||
fill_random(&mut output);
|
||||
let mut output_copy = output.clone();
|
||||
|
||||
mul_slice_xor(c, &input, &mut output);
|
||||
mul_slice_xor(c, &input, &mut output_copy);
|
||||
|
||||
assert_eq!(output, output_copy);
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
200
seaweed-volume/vendor/reed-solomon-erasure/src/lib.rs
vendored
Normal file
200
seaweed-volume/vendor/reed-solomon-erasure/src/lib.rs
vendored
Normal file
@@ -0,0 +1,200 @@
|
||||
//! This crate provides an encoder/decoder for Reed-Solomon erasure code.
|
||||
//!
|
||||
//! Please note that erasure coding means errors are not directly detected or corrected,
|
||||
//! but missing data pieces (shards) can be reconstructed given that
|
||||
//! the configuration provides high enough redundancy.
|
||||
//!
|
||||
//! You will have to implement error detection separately (e.g. via checksums)
|
||||
//! and simply leave out the corrupted shards when attempting to reconstruct
|
||||
//! the missing data.
|
||||
#![allow(dead_code)]
|
||||
#![cfg_attr(not(feature = "std"), no_std)]
|
||||
|
||||
#[cfg(test)]
|
||||
#[macro_use]
|
||||
extern crate quickcheck;
|
||||
|
||||
#[cfg(test)]
|
||||
extern crate rand;
|
||||
|
||||
extern crate smallvec;
|
||||
|
||||
#[cfg(feature = "simd-accel")]
|
||||
extern crate libc;
|
||||
|
||||
use ::core::iter;
|
||||
use ::core::iter::FromIterator;
|
||||
|
||||
#[macro_use]
|
||||
mod macros;
|
||||
|
||||
mod core;
|
||||
mod errors;
|
||||
mod matrix;
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests;
|
||||
|
||||
pub mod galois_16;
|
||||
pub mod galois_8;
|
||||
|
||||
pub use crate::errors::Error;
|
||||
pub use crate::errors::SBSError;
|
||||
|
||||
pub use crate::core::ReedSolomon;
|
||||
pub use crate::core::ShardByShard;
|
||||
|
||||
// TODO: Can be simplified once https://github.com/rust-lang/rfcs/issues/2505 is resolved
|
||||
#[cfg(not(feature = "std"))]
|
||||
use libm::log2f as log2;
|
||||
#[cfg(feature = "std")]
|
||||
fn log2(n: f32) -> f32 {
|
||||
n.log2()
|
||||
}
|
||||
|
||||
/// A finite field to perform encoding over.
|
||||
pub trait Field: Sized {
|
||||
/// The order of the field. This is a limit on the number of shards
|
||||
/// in an encoding.
|
||||
const ORDER: usize;
|
||||
|
||||
/// The representational type of the field.
|
||||
type Elem: Default + Clone + Copy + PartialEq + ::core::fmt::Debug;
|
||||
|
||||
/// Add two elements together.
|
||||
fn add(a: Self::Elem, b: Self::Elem) -> Self::Elem;
|
||||
|
||||
/// Multiply two elements together.
|
||||
fn mul(a: Self::Elem, b: Self::Elem) -> Self::Elem;
|
||||
|
||||
/// Divide a by b. Panics is b is zero.
|
||||
fn div(a: Self::Elem, b: Self::Elem) -> Self::Elem;
|
||||
|
||||
/// Raise `a` to the n'th power.
|
||||
fn exp(a: Self::Elem, n: usize) -> Self::Elem;
|
||||
|
||||
/// The "zero" element or additive identity.
|
||||
fn zero() -> Self::Elem;
|
||||
|
||||
/// The "one" element or multiplicative identity.
|
||||
fn one() -> Self::Elem;
|
||||
|
||||
fn nth_internal(n: usize) -> Self::Elem;
|
||||
|
||||
/// Yield the nth element of the field. Panics if n >= ORDER.
|
||||
/// Assignment is arbitrary but must be unique to `n`.
|
||||
fn nth(n: usize) -> Self::Elem {
|
||||
if n >= Self::ORDER {
|
||||
let pow = log2(Self::ORDER as f32) as usize;
|
||||
panic!("{} out of bounds for GF(2^{}) member", n, pow)
|
||||
}
|
||||
|
||||
Self::nth_internal(n)
|
||||
}
|
||||
|
||||
/// Multiply a slice of elements by another. Writes into the output slice.
|
||||
///
|
||||
/// # Panics
|
||||
/// Panics if the output slice does not have equal length to the input.
|
||||
fn mul_slice(elem: Self::Elem, input: &[Self::Elem], out: &mut [Self::Elem]) {
|
||||
assert_eq!(input.len(), out.len());
|
||||
|
||||
for (i, o) in input.iter().zip(out) {
|
||||
*o = Self::mul(elem.clone(), i.clone())
|
||||
}
|
||||
}
|
||||
|
||||
/// Multiply a slice of elements by another, adding each result to the corresponding value in
|
||||
/// `out`.
|
||||
///
|
||||
/// # Panics
|
||||
/// Panics if the output slice does not have equal length to the input.
|
||||
fn mul_slice_add(elem: Self::Elem, input: &[Self::Elem], out: &mut [Self::Elem]) {
|
||||
assert_eq!(input.len(), out.len());
|
||||
|
||||
for (i, o) in input.iter().zip(out) {
|
||||
*o = Self::add(o.clone(), Self::mul(elem.clone(), i.clone()))
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
/// Something which might hold a shard.
|
||||
///
|
||||
/// This trait is used in reconstruction, where some of the shards
|
||||
/// may be unknown.
|
||||
pub trait ReconstructShard<F: Field> {
|
||||
/// The size of the shard data; `None` if empty.
|
||||
fn len(&self) -> Option<usize>;
|
||||
|
||||
/// Get a mutable reference to the shard data, returning `None` if uninitialized.
|
||||
fn get(&mut self) -> Option<&mut [F::Elem]>;
|
||||
|
||||
/// Get a mutable reference to the shard data, initializing it to the
|
||||
/// given length if it was `None`. Returns an error if initialization fails.
|
||||
fn get_or_initialize(
|
||||
&mut self,
|
||||
len: usize,
|
||||
) -> Result<&mut [F::Elem], Result<&mut [F::Elem], Error>>;
|
||||
}
|
||||
|
||||
impl<F: Field, T: AsRef<[F::Elem]> + AsMut<[F::Elem]> + FromIterator<F::Elem>> ReconstructShard<F>
|
||||
for Option<T>
|
||||
{
|
||||
fn len(&self) -> Option<usize> {
|
||||
self.as_ref().map(|x| x.as_ref().len())
|
||||
}
|
||||
|
||||
fn get(&mut self) -> Option<&mut [F::Elem]> {
|
||||
self.as_mut().map(|x| x.as_mut())
|
||||
}
|
||||
|
||||
fn get_or_initialize(
|
||||
&mut self,
|
||||
len: usize,
|
||||
) -> Result<&mut [F::Elem], Result<&mut [F::Elem], Error>> {
|
||||
let is_some = self.is_some();
|
||||
let x = self
|
||||
.get_or_insert_with(|| iter::repeat(F::zero()).take(len).collect())
|
||||
.as_mut();
|
||||
|
||||
if is_some {
|
||||
Ok(x)
|
||||
} else {
|
||||
Err(Ok(x))
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
impl<F: Field, T: AsRef<[F::Elem]> + AsMut<[F::Elem]>> ReconstructShard<F> for (T, bool) {
|
||||
fn len(&self) -> Option<usize> {
|
||||
if !self.1 {
|
||||
None
|
||||
} else {
|
||||
Some(self.0.as_ref().len())
|
||||
}
|
||||
}
|
||||
|
||||
fn get(&mut self) -> Option<&mut [F::Elem]> {
|
||||
if !self.1 {
|
||||
None
|
||||
} else {
|
||||
Some(self.0.as_mut())
|
||||
}
|
||||
}
|
||||
|
||||
fn get_or_initialize(
|
||||
&mut self,
|
||||
len: usize,
|
||||
) -> Result<&mut [F::Elem], Result<&mut [F::Elem], Error>> {
|
||||
let x = self.0.as_mut();
|
||||
if x.len() == len {
|
||||
if self.1 {
|
||||
Ok(x)
|
||||
} else {
|
||||
Err(Ok(x))
|
||||
}
|
||||
} else {
|
||||
Err(Err(Error::IncorrectShardSize))
|
||||
}
|
||||
}
|
||||
}
|
||||
245
seaweed-volume/vendor/reed-solomon-erasure/src/macros.rs
vendored
Normal file
245
seaweed-volume/vendor/reed-solomon-erasure/src/macros.rs
vendored
Normal file
@@ -0,0 +1,245 @@
|
||||
/// Constructs vector of shards.
|
||||
///
|
||||
/// # Example
|
||||
/// ```rust
|
||||
/// # #[macro_use] extern crate reed_solomon_erasure;
|
||||
/// # use reed_solomon_erasure::*;
|
||||
/// # fn main () {
|
||||
/// let shards: Vec<Vec<u8>> = shards!([1, 2, 3],
|
||||
/// [4, 5, 6]);
|
||||
/// # }
|
||||
/// ```
|
||||
#[macro_export]
|
||||
macro_rules! shards {
|
||||
(
|
||||
$( [ $( $x:expr ),* ] ),*
|
||||
) => {{
|
||||
vec![ $( vec![ $( $x ),* ] ),* ]
|
||||
}}
|
||||
}
|
||||
|
||||
/// Makes it easier to work with 2D slices, arrays, etc.
|
||||
///
|
||||
/// # Examples
|
||||
/// ## Byte arrays on stack to `Vec<&[u8]>`
|
||||
/// ```rust
|
||||
/// # #[macro_use] extern crate reed_solomon_erasure;
|
||||
/// # fn main () {
|
||||
/// let array: [[u8; 3]; 2] = [[1, 2, 3],
|
||||
/// [4, 5, 6]];
|
||||
///
|
||||
/// let refs: Vec<&[u8]> =
|
||||
/// convert_2D_slices!(array =>to_vec &[u8]);
|
||||
/// # }
|
||||
/// ```
|
||||
/// ## Byte arrays on stack to `Vec<&mut [u8]>` (borrow mutably)
|
||||
/// ```rust
|
||||
/// # #[macro_use] extern crate reed_solomon_erasure;
|
||||
/// # fn main () {
|
||||
/// let mut array: [[u8; 3]; 2] = [[1, 2, 3],
|
||||
/// [4, 5, 6]];
|
||||
///
|
||||
/// let refs: Vec<&mut [u8]> =
|
||||
/// convert_2D_slices!(array =>to_mut_vec &mut [u8]);
|
||||
/// # }
|
||||
/// ```
|
||||
/// ## Byte arrays on stack to `SmallVec<[&mut [u8]; 32]>` (borrow mutably)
|
||||
/// ```rust
|
||||
/// # #[macro_use] extern crate reed_solomon_erasure;
|
||||
/// # extern crate smallvec;
|
||||
/// # use smallvec::SmallVec;
|
||||
/// # fn main () {
|
||||
/// let mut array: [[u8; 3]; 2] = [[1, 2, 3],
|
||||
/// [4, 5, 6]];
|
||||
///
|
||||
/// let refs: SmallVec<[&mut [u8]; 32]> =
|
||||
/// convert_2D_slices!(array =>to_mut SmallVec<[&mut [u8]; 32]>,
|
||||
/// SmallVec::with_capacity);
|
||||
/// # }
|
||||
/// ```
|
||||
/// ## Shard array to `SmallVec<[&mut [u8]; 32]>` (borrow mutably)
|
||||
/// ```rust
|
||||
/// # #[macro_use] extern crate reed_solomon_erasure;
|
||||
/// # extern crate smallvec;
|
||||
/// # use smallvec::SmallVec;
|
||||
/// # fn main () {
|
||||
/// let mut shards = shards!([1, 2, 3],
|
||||
/// [4, 5, 6]);
|
||||
///
|
||||
/// let refs: SmallVec<[&mut [u8]; 32]> =
|
||||
/// convert_2D_slices!(shards =>to_mut SmallVec<[&mut [u8]; 32]>,
|
||||
/// SmallVec::with_capacity);
|
||||
/// # }
|
||||
/// ```
|
||||
/// ## Shard array to `Vec<&mut [u8]>` (borrow mutably) into `SmallVec<[&mut [u8]; 32]>` (move)
|
||||
/// ```rust
|
||||
/// # #[macro_use] extern crate reed_solomon_erasure;
|
||||
/// # extern crate smallvec;
|
||||
/// # use smallvec::SmallVec;
|
||||
/// # fn main () {
|
||||
/// let mut shards = shards!([1, 2, 3],
|
||||
/// [4, 5, 6]);
|
||||
///
|
||||
/// let refs1 = convert_2D_slices!(shards =>to_mut_vec &mut [u8]);
|
||||
///
|
||||
/// let refs2: SmallVec<[&mut [u8]; 32]> =
|
||||
/// convert_2D_slices!(refs1 =>into SmallVec<[&mut [u8]; 32]>,
|
||||
/// SmallVec::with_capacity);
|
||||
/// # }
|
||||
/// ```
|
||||
#[macro_export]
|
||||
macro_rules! convert_2D_slices {
|
||||
(
|
||||
$slice:expr =>into_vec $dst_type:ty
|
||||
) => {
|
||||
convert_2D_slices!($slice =>into Vec<$dst_type>,
|
||||
Vec::with_capacity)
|
||||
};
|
||||
(
|
||||
$slice:expr =>to_vec $dst_type:ty
|
||||
) => {
|
||||
convert_2D_slices!($slice =>to Vec<$dst_type>,
|
||||
Vec::with_capacity)
|
||||
};
|
||||
(
|
||||
$slice:expr =>to_mut_vec $dst_type:ty
|
||||
) => {
|
||||
convert_2D_slices!($slice =>to_mut Vec<$dst_type>,
|
||||
Vec::with_capacity)
|
||||
};
|
||||
(
|
||||
$slice:expr =>into $dst_type:ty, $with_capacity:path
|
||||
) => {{
|
||||
let mut result: $dst_type =
|
||||
$with_capacity($slice.len());
|
||||
for i in $slice.into_iter() {
|
||||
result.push(i);
|
||||
}
|
||||
result
|
||||
}};
|
||||
(
|
||||
$slice:expr =>to $dst_type:ty, $with_capacity:path
|
||||
) => {{
|
||||
let mut result: $dst_type =
|
||||
$with_capacity($slice.len());
|
||||
for i in $slice.iter() {
|
||||
result.push(i);
|
||||
}
|
||||
result
|
||||
}};
|
||||
(
|
||||
$slice:expr =>to_mut $dst_type:ty, $with_capacity:path
|
||||
) => {{
|
||||
let mut result: $dst_type =
|
||||
$with_capacity($slice.len());
|
||||
for i in $slice.iter_mut() {
|
||||
result.push(i);
|
||||
}
|
||||
result
|
||||
}}
|
||||
}
|
||||
|
||||
macro_rules! check_slices {
|
||||
(
|
||||
multi => $slices:expr
|
||||
) => {{
|
||||
let size = $slices[0].as_ref().len();
|
||||
if size == 0 {
|
||||
return Err(Error::EmptyShard);
|
||||
}
|
||||
for slice in $slices.iter() {
|
||||
if slice.as_ref().len() != size {
|
||||
return Err(Error::IncorrectShardSize);
|
||||
}
|
||||
}
|
||||
}};
|
||||
(
|
||||
single => $slice_left:expr, single => $slice_right:expr
|
||||
) => {{
|
||||
if $slice_left.as_ref().len() != $slice_right.as_ref().len() {
|
||||
return Err(Error::IncorrectShardSize);
|
||||
}
|
||||
}};
|
||||
(
|
||||
multi => $slices:expr, single => $single:expr
|
||||
) => {{
|
||||
check_slices!(multi => $slices);
|
||||
|
||||
check_slices!(single => $slices[0], single => $single);
|
||||
}};
|
||||
(
|
||||
multi => $slices_left:expr, multi => $slices_right:expr
|
||||
) => {{
|
||||
check_slices!(multi => $slices_left);
|
||||
check_slices!(multi => $slices_right);
|
||||
|
||||
check_slices!(single => $slices_left[0], single => $slices_right[0]);
|
||||
}}
|
||||
}
|
||||
|
||||
macro_rules! check_slice_index {
|
||||
(
|
||||
all => $codec:expr, $index:expr
|
||||
) => {{
|
||||
if $index >= $codec.total_shard_count {
|
||||
return Err(Error::InvalidIndex);
|
||||
}
|
||||
}};
|
||||
(
|
||||
data => $codec:expr, $index:expr
|
||||
) => {{
|
||||
if $index >= $codec.data_shard_count {
|
||||
return Err(Error::InvalidIndex);
|
||||
}
|
||||
}};
|
||||
(
|
||||
parity => $codec:expr, $index:expr
|
||||
) => {{
|
||||
if $index >= $codec.parity_shard_count {
|
||||
return Err(Error::InvalidIndex);
|
||||
}
|
||||
}};
|
||||
}
|
||||
|
||||
macro_rules! check_piece_count {
|
||||
(
|
||||
all => $codec:expr, $pieces:expr
|
||||
) => {{
|
||||
if $pieces.as_ref().len() < $codec.total_shard_count {
|
||||
return Err(Error::TooFewShards);
|
||||
}
|
||||
if $pieces.as_ref().len() > $codec.total_shard_count {
|
||||
return Err(Error::TooManyShards);
|
||||
}
|
||||
}};
|
||||
(
|
||||
data => $codec:expr, $pieces:expr
|
||||
) => {{
|
||||
if $pieces.as_ref().len() < $codec.data_shard_count {
|
||||
return Err(Error::TooFewDataShards);
|
||||
}
|
||||
if $pieces.as_ref().len() > $codec.data_shard_count {
|
||||
return Err(Error::TooManyDataShards);
|
||||
}
|
||||
}};
|
||||
(
|
||||
parity => $codec:expr, $pieces:expr
|
||||
) => {{
|
||||
if $pieces.as_ref().len() < $codec.parity_shard_count {
|
||||
return Err(Error::TooFewParityShards);
|
||||
}
|
||||
if $pieces.as_ref().len() > $codec.parity_shard_count {
|
||||
return Err(Error::TooManyParityShards);
|
||||
}
|
||||
}};
|
||||
(
|
||||
parity_buf => $codec:expr, $pieces:expr
|
||||
) => {{
|
||||
if $pieces.as_ref().len() < $codec.parity_shard_count {
|
||||
return Err(Error::TooFewBufferShards);
|
||||
}
|
||||
if $pieces.as_ref().len() > $codec.parity_shard_count {
|
||||
return Err(Error::TooManyBufferShards);
|
||||
}
|
||||
}};
|
||||
}
|
||||
425
seaweed-volume/vendor/reed-solomon-erasure/src/matrix.rs
vendored
Normal file
425
seaweed-volume/vendor/reed-solomon-erasure/src/matrix.rs
vendored
Normal file
@@ -0,0 +1,425 @@
|
||||
#![allow(dead_code)]
|
||||
extern crate alloc;
|
||||
|
||||
use alloc::vec;
|
||||
use alloc::vec::Vec;
|
||||
|
||||
use crate::Field;
|
||||
use smallvec::SmallVec;
|
||||
|
||||
#[derive(Debug)]
|
||||
pub enum Error {
|
||||
SingularMatrix,
|
||||
}
|
||||
|
||||
macro_rules! acc {
|
||||
(
|
||||
$m:ident, $r:expr, $c:expr
|
||||
) => {
|
||||
$m.data[$r * $m.col_count + $c]
|
||||
};
|
||||
}
|
||||
|
||||
pub fn flatten<T>(m: Vec<Vec<T>>) -> Vec<T> {
|
||||
let mut result: Vec<T> = Vec::with_capacity(m.len() * m[0].len());
|
||||
for row in m {
|
||||
for v in row {
|
||||
result.push(v);
|
||||
}
|
||||
}
|
||||
result
|
||||
}
|
||||
|
||||
#[derive(PartialEq, Debug, Clone)]
|
||||
pub struct Matrix<F: Field> {
|
||||
row_count: usize,
|
||||
col_count: usize,
|
||||
data: SmallVec<[F::Elem; 1024]>, // store in flattened structure
|
||||
// the smallvec can hold a matrix of size up to 32x32 in stack
|
||||
}
|
||||
|
||||
fn calc_matrix_row_start_end(col_count: usize, row: usize) -> (usize, usize) {
|
||||
let start = row * col_count;
|
||||
let end = start + col_count;
|
||||
|
||||
(start, end)
|
||||
}
|
||||
|
||||
impl<F: Field> Matrix<F> {
|
||||
fn calc_row_start_end(&self, row: usize) -> (usize, usize) {
|
||||
calc_matrix_row_start_end(self.col_count, row)
|
||||
}
|
||||
|
||||
pub fn new(rows: usize, cols: usize) -> Matrix<F> {
|
||||
let data = SmallVec::from_vec(vec![F::zero(); rows * cols]);
|
||||
|
||||
Matrix {
|
||||
row_count: rows,
|
||||
col_count: cols,
|
||||
data,
|
||||
}
|
||||
}
|
||||
|
||||
pub fn new_with_data(init_data: Vec<Vec<F::Elem>>) -> Matrix<F> {
|
||||
let rows = init_data.len();
|
||||
let cols = init_data[0].len();
|
||||
|
||||
for r in init_data.iter() {
|
||||
if r.len() != cols {
|
||||
panic!("Inconsistent row sizes")
|
||||
}
|
||||
}
|
||||
|
||||
let data = SmallVec::from_vec(flatten(init_data));
|
||||
|
||||
Matrix {
|
||||
row_count: rows,
|
||||
col_count: cols,
|
||||
data,
|
||||
}
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
pub fn make_random(size: usize) -> Matrix<F>
|
||||
where
|
||||
rand::distributions::Standard: rand::distributions::Distribution<F::Elem>,
|
||||
{
|
||||
let mut vec: Vec<Vec<F::Elem>> = vec![vec![Default::default(); size]; size];
|
||||
for v in vec.iter_mut() {
|
||||
crate::tests::fill_random(v);
|
||||
}
|
||||
|
||||
Matrix::new_with_data(vec)
|
||||
}
|
||||
|
||||
pub fn identity(size: usize) -> Matrix<F> {
|
||||
let mut result = Self::new(size, size);
|
||||
for i in 0..size {
|
||||
acc!(result, i, i) = F::one();
|
||||
}
|
||||
result
|
||||
}
|
||||
|
||||
pub fn col_count(&self) -> usize {
|
||||
self.col_count
|
||||
}
|
||||
|
||||
pub fn row_count(&self) -> usize {
|
||||
self.row_count
|
||||
}
|
||||
|
||||
pub fn get(&self, r: usize, c: usize) -> F::Elem {
|
||||
acc!(self, r, c).clone()
|
||||
}
|
||||
|
||||
pub fn set(&mut self, r: usize, c: usize, val: F::Elem) {
|
||||
acc!(self, r, c) = val;
|
||||
}
|
||||
|
||||
pub fn multiply(&self, rhs: &Matrix<F>) -> Matrix<F> {
|
||||
if self.col_count != rhs.row_count {
|
||||
panic!(
|
||||
"Colomn count on left is different from row count on right, lhs: {}, rhs: {}",
|
||||
self.col_count, rhs.row_count
|
||||
)
|
||||
}
|
||||
let mut result = Self::new(self.row_count, rhs.col_count);
|
||||
for r in 0..self.row_count {
|
||||
for c in 0..rhs.col_count {
|
||||
let mut val = F::zero();
|
||||
for i in 0..self.col_count {
|
||||
let mul = F::mul(acc!(self, r, i).clone(), acc!(rhs, i, c).clone());
|
||||
|
||||
val = F::add(val, mul);
|
||||
}
|
||||
acc!(result, r, c) = val;
|
||||
}
|
||||
}
|
||||
result
|
||||
}
|
||||
|
||||
pub fn augment(&self, rhs: &Matrix<F>) -> Matrix<F> {
|
||||
if self.row_count != rhs.row_count {
|
||||
panic!(
|
||||
"Matrices do not have the same row count, lhs: {}, rhs: {}",
|
||||
self.row_count, rhs.row_count
|
||||
)
|
||||
}
|
||||
let mut result = Self::new(self.row_count, self.col_count + rhs.col_count);
|
||||
for r in 0..self.row_count {
|
||||
for c in 0..self.col_count {
|
||||
acc!(result, r, c) = acc!(self, r, c).clone();
|
||||
}
|
||||
let self_column_count = self.col_count;
|
||||
for c in 0..rhs.col_count {
|
||||
acc!(result, r, self_column_count + c) = acc!(rhs, r, c).clone();
|
||||
}
|
||||
}
|
||||
|
||||
result
|
||||
}
|
||||
|
||||
pub fn sub_matrix(&self, rmin: usize, cmin: usize, rmax: usize, cmax: usize) -> Matrix<F> {
|
||||
let mut result = Self::new(rmax - rmin, cmax - cmin);
|
||||
for r in rmin..rmax {
|
||||
for c in cmin..cmax {
|
||||
acc!(result, r - rmin, c - cmin) = acc!(self, r, c).clone();
|
||||
}
|
||||
}
|
||||
result
|
||||
}
|
||||
|
||||
pub fn get_row(&self, row: usize) -> &[F::Elem] {
|
||||
let (start, end) = self.calc_row_start_end(row);
|
||||
|
||||
&self.data[start..end]
|
||||
}
|
||||
|
||||
pub fn swap_rows(&mut self, r1: usize, r2: usize) {
|
||||
let (r1_s, _) = self.calc_row_start_end(r1);
|
||||
let (r2_s, _) = self.calc_row_start_end(r2);
|
||||
|
||||
if r1 == r2 {
|
||||
return;
|
||||
} else {
|
||||
for i in 0..self.col_count {
|
||||
self.data.swap(r1_s + i, r2_s + i);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
pub fn is_square(&self) -> bool {
|
||||
self.row_count == self.col_count
|
||||
}
|
||||
|
||||
pub fn gaussian_elim(&mut self) -> Result<(), Error> {
|
||||
for r in 0..self.row_count {
|
||||
if acc!(self, r, r) == F::zero() {
|
||||
for r_below in r + 1..self.row_count {
|
||||
if acc!(self, r_below, r) != F::zero() {
|
||||
self.swap_rows(r, r_below);
|
||||
break;
|
||||
}
|
||||
}
|
||||
}
|
||||
// If we couldn't find one, the matrix is singular.
|
||||
if acc!(self, r, r) == F::zero() {
|
||||
return Err(Error::SingularMatrix);
|
||||
}
|
||||
// Scale to 1.
|
||||
if acc!(self, r, r) != F::one() {
|
||||
let scale = F::div(F::one(), acc!(self, r, r).clone());
|
||||
for c in 0..self.col_count {
|
||||
acc!(self, r, c) = F::mul(scale, acc!(self, r, c).clone());
|
||||
}
|
||||
}
|
||||
// Make everything below the 1 be a 0 by subtracting
|
||||
// a multiple of it. (Subtraction and addition are
|
||||
// both exclusive or in the Galois field.)
|
||||
for r_below in r + 1..self.row_count {
|
||||
if acc!(self, r_below, r) != F::zero() {
|
||||
let scale = acc!(self, r_below, r).clone();
|
||||
for c in 0..self.col_count {
|
||||
acc!(self, r_below, c) = F::add(
|
||||
acc!(self, r_below, c).clone(),
|
||||
F::mul(scale, acc!(self, r, c).clone()),
|
||||
);
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// Now clear the part above the main diagonal.
|
||||
for d in 0..self.row_count {
|
||||
for r_above in 0..d {
|
||||
if acc!(self, r_above, d) != F::zero() {
|
||||
let scale = acc!(self, r_above, d).clone();
|
||||
for c in 0..self.col_count {
|
||||
acc!(self, r_above, c) = F::add(
|
||||
acc!(self, r_above, c).clone(),
|
||||
F::mul(scale, acc!(self, d, c).clone()),
|
||||
);
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
Ok(())
|
||||
}
|
||||
|
||||
pub fn invert(&self) -> Result<Matrix<F>, Error> {
|
||||
if !self.is_square() {
|
||||
panic!("Trying to invert a non-square matrix")
|
||||
}
|
||||
|
||||
let row_count = self.row_count;
|
||||
let col_count = self.col_count;
|
||||
|
||||
let mut work = self.augment(&Self::identity(row_count));
|
||||
work.gaussian_elim()?;
|
||||
|
||||
Ok(work.sub_matrix(0, row_count, col_count, col_count * 2))
|
||||
}
|
||||
|
||||
pub fn vandermonde(rows: usize, cols: usize) -> Matrix<F> {
|
||||
let mut result = Self::new(rows, cols);
|
||||
|
||||
for r in 0..rows {
|
||||
// doesn't matter what `r_a` is as long as it's unique.
|
||||
// then the vandermonde matrix is invertible.
|
||||
let r_a = F::nth(r);
|
||||
for c in 0..cols {
|
||||
acc!(result, r, c) = F::exp(r_a, c);
|
||||
}
|
||||
}
|
||||
|
||||
result
|
||||
}
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
extern crate alloc;
|
||||
|
||||
use alloc::vec;
|
||||
|
||||
use super::Matrix;
|
||||
use crate::galois_8;
|
||||
|
||||
macro_rules! matrix {
|
||||
(
|
||||
$(
|
||||
[ $( $x:expr ),+ ]
|
||||
),*
|
||||
) => (
|
||||
Matrix::<galois_8::Field>::new_with_data(vec![ $( vec![$( $x ),*] ),* ])
|
||||
);
|
||||
($rows:expr, $cols:expr) => (Matrix::new($rows, $cols));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_matrix_col_count() {
|
||||
let m1 = matrix!([1, 0, 0]);
|
||||
let m2 = matrix!([0, 0, 0], [0, 0, 0]);
|
||||
let m3: Matrix<galois_8::Field> = Matrix::new(1, 4);
|
||||
|
||||
assert_eq!(3, m1.col_count());
|
||||
assert_eq!(3, m2.col_count());
|
||||
assert_eq!(4, m3.col_count());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_matrix_row_count() {
|
||||
let m1 = matrix!([1, 0, 0]);
|
||||
let m2 = matrix!([0, 0, 0], [0, 0, 0]);
|
||||
let m3: Matrix<galois_8::Field> = Matrix::new(1, 4);
|
||||
|
||||
assert_eq!(1, m1.row_count());
|
||||
assert_eq!(2, m2.row_count());
|
||||
assert_eq!(1, m3.row_count());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_matrix_swap_rows() {
|
||||
{
|
||||
let mut m1 = matrix!([1, 2, 3], [4, 5, 6], [7, 8, 9]);
|
||||
let expect = matrix!([7, 8, 9], [4, 5, 6], [1, 2, 3]);
|
||||
m1.swap_rows(0, 2);
|
||||
assert_eq!(expect, m1);
|
||||
}
|
||||
{
|
||||
let mut m1 = matrix!([1, 2, 3], [4, 5, 6], [7, 8, 9]);
|
||||
let expect = m1.clone();
|
||||
m1.swap_rows(0, 0);
|
||||
assert_eq!(expect, m1);
|
||||
m1.swap_rows(1, 1);
|
||||
assert_eq!(expect, m1);
|
||||
m1.swap_rows(2, 2);
|
||||
assert_eq!(expect, m1);
|
||||
}
|
||||
}
|
||||
|
||||
#[test]
|
||||
#[should_panic]
|
||||
fn test_inconsistent_row_sizes() {
|
||||
matrix!([1, 0, 0], [0, 1], [0, 0, 1]);
|
||||
}
|
||||
|
||||
#[test]
|
||||
#[should_panic]
|
||||
fn test_incompatible_multiply() {
|
||||
let m1 = matrix!([0, 1], [0, 1], [0, 1]);
|
||||
let m2 = matrix!([0, 1, 2]);
|
||||
|
||||
m1.multiply(&m2);
|
||||
}
|
||||
|
||||
#[test]
|
||||
#[should_panic]
|
||||
fn test_incompatible_augment() {
|
||||
let m1 = matrix!([0, 1]);
|
||||
let m2 = matrix!([0, 1], [2, 3]);
|
||||
|
||||
m1.augment(&m2);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_matrix_identity() {
|
||||
let m1 = Matrix::identity(3);
|
||||
let m2 = matrix!([1, 0, 0], [0, 1, 0], [0, 0, 1]);
|
||||
assert_eq!(m1, m2);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_matrix_multiply() {
|
||||
let m1 = matrix!([1, 2], [3, 4]);
|
||||
let m2 = matrix!([5, 6], [7, 8]);
|
||||
let actual = m1.multiply(&m2);
|
||||
let expect = matrix!([11, 22], [19, 42]);
|
||||
assert_eq!(actual, expect);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_matrix_inverse_pass_cases() {
|
||||
{
|
||||
// Test case validating inverse of the input Matrix.
|
||||
let m = matrix!([56, 23, 98], [3, 100, 200], [45, 201, 123])
|
||||
.invert()
|
||||
.unwrap();
|
||||
let expect = matrix!([175, 133, 33], [130, 13, 245], [112, 35, 126]);
|
||||
assert_eq!(m, expect);
|
||||
}
|
||||
{
|
||||
// Test case validating inverse of the input Matrix.
|
||||
let m = matrix!(
|
||||
[1, 0, 0, 0, 0],
|
||||
[0, 1, 0, 0, 0],
|
||||
[0, 0, 0, 1, 0],
|
||||
[0, 0, 0, 0, 1],
|
||||
[7, 7, 6, 6, 1]
|
||||
)
|
||||
.invert()
|
||||
.unwrap();
|
||||
let expect = matrix!(
|
||||
[1, 0, 0, 0, 0],
|
||||
[0, 1, 0, 0, 0],
|
||||
[123, 123, 1, 122, 122],
|
||||
[0, 0, 1, 0, 0],
|
||||
[0, 0, 0, 1, 0]
|
||||
);
|
||||
assert_eq!(m, expect);
|
||||
}
|
||||
}
|
||||
|
||||
#[test]
|
||||
#[should_panic]
|
||||
fn test_matrix_inverse_non_square() {
|
||||
// Test case with a non-square matrix.
|
||||
matrix!([56, 23], [3, 100], [45, 201]).invert().unwrap();
|
||||
}
|
||||
|
||||
#[test]
|
||||
#[should_panic]
|
||||
fn test_matrix_inverse_singular() {
|
||||
matrix!([4, 2], [12, 6]).invert().unwrap();
|
||||
}
|
||||
}
|
||||
489
seaweed-volume/vendor/reed-solomon-erasure/src/tests/galois_16.rs
vendored
Normal file
489
seaweed-volume/vendor/reed-solomon-erasure/src/tests/galois_16.rs
vendored
Normal file
@@ -0,0 +1,489 @@
|
||||
extern crate alloc;
|
||||
|
||||
use alloc::vec;
|
||||
use alloc::vec::Vec;
|
||||
|
||||
use super::{fill_random, option_shards_into_shards, shards_into_option_shards};
|
||||
use crate::galois_16::ReedSolomon;
|
||||
|
||||
macro_rules! make_random_shards {
|
||||
($per_shard:expr, $size:expr) => {{
|
||||
let mut shards = Vec::with_capacity(20);
|
||||
for _ in 0..$size {
|
||||
shards.push(vec![[0; 2]; $per_shard]);
|
||||
}
|
||||
|
||||
for s in shards.iter_mut() {
|
||||
fill_random(s);
|
||||
}
|
||||
|
||||
shards
|
||||
}};
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn correct_field_order_restriction() {
|
||||
const ORDER: usize = 1 << 16;
|
||||
|
||||
assert!(ReedSolomon::new(ORDER, 1).is_err());
|
||||
assert!(ReedSolomon::new(1, ORDER).is_err());
|
||||
|
||||
// way too slow, because it needs to build a 65536*65536 vandermonde matrix
|
||||
// assert!(ReedSolomon::new(ORDER - 1, 1).is_ok());
|
||||
assert!(ReedSolomon::new(1, ORDER - 1).is_ok());
|
||||
}
|
||||
|
||||
quickcheck! {
|
||||
fn qc_encode_verify_reconstruct_verify(data: usize,
|
||||
parity: usize,
|
||||
corrupt: usize,
|
||||
size: usize) -> bool {
|
||||
let data = 1 + data % 255;
|
||||
let mut parity = 1 + parity % 255;
|
||||
if data + parity > 256 {
|
||||
parity -= data + parity - 256;
|
||||
}
|
||||
|
||||
let corrupt = corrupt % (parity + 1);
|
||||
|
||||
let mut corrupt_pos_s = Vec::with_capacity(corrupt);
|
||||
for _ in 0..corrupt {
|
||||
let mut pos = rand::random::<usize>() % (data + parity);
|
||||
|
||||
while let Some(_) = corrupt_pos_s.iter().find(|&&x| x == pos) {
|
||||
pos = rand::random::<usize>() % (data + parity);
|
||||
}
|
||||
|
||||
corrupt_pos_s.push(pos);
|
||||
}
|
||||
|
||||
let size = 1 + size % 1_000_000;
|
||||
|
||||
let r = ReedSolomon::new(data, parity).unwrap();
|
||||
|
||||
let mut expect = make_random_shards!(size, data + parity);
|
||||
{
|
||||
let mut refs =
|
||||
convert_2D_slices!(expect =>to_mut_vec &mut [[u8; 2]]);
|
||||
|
||||
r.encode(&mut refs).unwrap();
|
||||
}
|
||||
|
||||
let expect = expect;
|
||||
|
||||
let mut shards = expect.clone();
|
||||
|
||||
// corrupt shards
|
||||
for &p in corrupt_pos_s.iter() {
|
||||
fill_random(&mut shards[p]);
|
||||
}
|
||||
let mut slice_present = vec![true; data + parity];
|
||||
for &p in corrupt_pos_s.iter() {
|
||||
slice_present[p] = false;
|
||||
}
|
||||
|
||||
// reconstruct
|
||||
{
|
||||
let mut refs: Vec<_> = shards.iter_mut()
|
||||
.map(|i| &mut i[..])
|
||||
.zip(slice_present.iter().cloned())
|
||||
.collect();
|
||||
|
||||
r.reconstruct(&mut refs[..]).unwrap();
|
||||
}
|
||||
|
||||
({
|
||||
let refs =
|
||||
convert_2D_slices!(expect =>to_vec &[[u8; 2]]);
|
||||
|
||||
r.verify(&refs).unwrap()
|
||||
})
|
||||
&&
|
||||
expect == shards
|
||||
&&
|
||||
({
|
||||
let refs =
|
||||
convert_2D_slices!(shards =>to_vec &[[u8; 2]]);
|
||||
|
||||
r.verify(&refs).unwrap()
|
||||
})
|
||||
}
|
||||
|
||||
fn qc_encode_verify_reconstruct_verify_shards(data: usize,
|
||||
parity: usize,
|
||||
corrupt: usize,
|
||||
size: usize) -> bool {
|
||||
let data = 1 + data % 255;
|
||||
let mut parity = 1 + parity % 255;
|
||||
if data + parity > 256 {
|
||||
parity -= data + parity - 256;
|
||||
}
|
||||
|
||||
let corrupt = corrupt % (parity + 1);
|
||||
|
||||
let mut corrupt_pos_s = Vec::with_capacity(corrupt);
|
||||
for _ in 0..corrupt {
|
||||
let mut pos = rand::random::<usize>() % (data + parity);
|
||||
|
||||
while let Some(_) = corrupt_pos_s.iter().find(|&&x| x == pos) {
|
||||
pos = rand::random::<usize>() % (data + parity);
|
||||
}
|
||||
|
||||
corrupt_pos_s.push(pos);
|
||||
}
|
||||
|
||||
let size = 1 + size % 1_000_000;
|
||||
|
||||
let r = ReedSolomon::new(data, parity).unwrap();
|
||||
|
||||
let mut expect = make_random_shards!(size, data + parity);
|
||||
r.encode(&mut expect).unwrap();
|
||||
|
||||
let expect = expect;
|
||||
|
||||
let mut shards = shards_into_option_shards(expect.clone());
|
||||
|
||||
// corrupt shards
|
||||
for &p in corrupt_pos_s.iter() {
|
||||
shards[p] = None;
|
||||
}
|
||||
|
||||
// reconstruct
|
||||
r.reconstruct(&mut shards).unwrap();
|
||||
|
||||
let shards = option_shards_into_shards(shards);
|
||||
|
||||
r.verify(&expect).unwrap()
|
||||
&& expect == shards
|
||||
&& r.verify(&shards).unwrap()
|
||||
}
|
||||
|
||||
fn qc_verify(data: usize,
|
||||
parity: usize,
|
||||
corrupt: usize,
|
||||
size: usize) -> bool {
|
||||
let data = 1 + data % 255;
|
||||
let mut parity = 1 + parity % 255;
|
||||
if data + parity > 256 {
|
||||
parity -= data + parity - 256;
|
||||
}
|
||||
|
||||
let corrupt = corrupt % (parity + 1);
|
||||
|
||||
let mut corrupt_pos_s = Vec::with_capacity(corrupt);
|
||||
for _ in 0..corrupt {
|
||||
let mut pos = rand::random::<usize>() % (data + parity);
|
||||
|
||||
while let Some(_) = corrupt_pos_s.iter().find(|&&x| x == pos) {
|
||||
pos = rand::random::<usize>() % (data + parity);
|
||||
}
|
||||
|
||||
corrupt_pos_s.push(pos);
|
||||
}
|
||||
|
||||
let size = 1 + size % 1_000_000;
|
||||
|
||||
let r = ReedSolomon::new(data, parity).unwrap();
|
||||
|
||||
let mut expect = make_random_shards!(size, data + parity);
|
||||
{
|
||||
let mut refs =
|
||||
convert_2D_slices!(expect =>to_mut_vec &mut [[u8; 2]]);
|
||||
|
||||
r.encode(&mut refs).unwrap();
|
||||
}
|
||||
|
||||
let expect = expect;
|
||||
|
||||
let mut shards = expect.clone();
|
||||
|
||||
// corrupt shards
|
||||
for &p in corrupt_pos_s.iter() {
|
||||
fill_random(&mut shards[p]);
|
||||
}
|
||||
|
||||
({
|
||||
let refs =
|
||||
convert_2D_slices!(expect =>to_vec &[[u8; 2]]);
|
||||
|
||||
r.verify(&refs).unwrap()
|
||||
})
|
||||
&&
|
||||
((corrupt > 0 && expect != shards)
|
||||
|| (corrupt == 0 && expect == shards))
|
||||
&&
|
||||
({
|
||||
let refs =
|
||||
convert_2D_slices!(shards =>to_vec &[[u8; 2]]);
|
||||
|
||||
(corrupt > 0 && !r.verify(&refs).unwrap())
|
||||
|| (corrupt == 0 && r.verify(&refs).unwrap())
|
||||
})
|
||||
}
|
||||
|
||||
fn qc_verify_shards(data: usize,
|
||||
parity: usize,
|
||||
corrupt: usize,
|
||||
size: usize) -> bool {
|
||||
let data = 1 + data % 255;
|
||||
let mut parity = 1 + parity % 255;
|
||||
if data + parity > 256 {
|
||||
parity -= data + parity - 256;
|
||||
}
|
||||
|
||||
let corrupt = corrupt % (parity + 1);
|
||||
|
||||
let mut corrupt_pos_s = Vec::with_capacity(corrupt);
|
||||
for _ in 0..corrupt {
|
||||
let mut pos = rand::random::<usize>() % (data + parity);
|
||||
|
||||
while let Some(_) = corrupt_pos_s.iter().find(|&&x| x == pos) {
|
||||
pos = rand::random::<usize>() % (data + parity);
|
||||
}
|
||||
|
||||
corrupt_pos_s.push(pos);
|
||||
}
|
||||
|
||||
let size = 1 + size % 1_000_000;
|
||||
|
||||
let r = ReedSolomon::new(data, parity).unwrap();
|
||||
|
||||
let mut expect = make_random_shards!(size, data + parity);
|
||||
r.encode(&mut expect).unwrap();
|
||||
|
||||
let expect = expect;
|
||||
|
||||
let mut shards = expect.clone();
|
||||
|
||||
// corrupt shards
|
||||
for &p in corrupt_pos_s.iter() {
|
||||
fill_random(&mut shards[p]);
|
||||
}
|
||||
|
||||
r.verify(&expect).unwrap()
|
||||
&&
|
||||
((corrupt > 0 && expect != shards)
|
||||
|| (corrupt == 0 && expect == shards))
|
||||
&&
|
||||
((corrupt > 0 && !r.verify(&shards).unwrap())
|
||||
|| (corrupt == 0 && r.verify(&shards).unwrap()))
|
||||
}
|
||||
|
||||
fn qc_encode_sep_same_as_encode(data: usize,
|
||||
parity: usize,
|
||||
size: usize) -> bool {
|
||||
let data = 1 + data % 255;
|
||||
let mut parity = 1 + parity % 255;
|
||||
if data + parity > 256 {
|
||||
parity -= data + parity - 256;
|
||||
}
|
||||
|
||||
let size = 1 + size % 1_000_000;
|
||||
|
||||
let r = ReedSolomon::new(data, parity).unwrap();
|
||||
|
||||
let mut expect = make_random_shards!(size, data + parity);
|
||||
let mut shards = expect.clone();
|
||||
|
||||
{
|
||||
let mut refs =
|
||||
convert_2D_slices!(expect =>to_mut_vec &mut [[u8; 2]]);
|
||||
|
||||
r.encode(&mut refs).unwrap();
|
||||
}
|
||||
|
||||
let expect = expect;
|
||||
|
||||
{
|
||||
let (data, parity) = shards.split_at_mut(data);
|
||||
|
||||
let data_refs =
|
||||
convert_2D_slices!(data =>to_mut_vec &[[u8; 2]]);
|
||||
|
||||
let mut parity_refs =
|
||||
convert_2D_slices!(parity =>to_mut_vec &mut [[u8; 2]]);
|
||||
|
||||
r.encode_sep(&data_refs, &mut parity_refs).unwrap();
|
||||
}
|
||||
|
||||
let shards = shards;
|
||||
|
||||
expect == shards
|
||||
}
|
||||
|
||||
fn qc_encode_sep_same_as_encode_shards(data: usize,
|
||||
parity: usize,
|
||||
size: usize) -> bool {
|
||||
let data = 1 + data % 255;
|
||||
let mut parity = 1 + parity % 255;
|
||||
if data + parity > 256 {
|
||||
parity -= data + parity - 256;
|
||||
}
|
||||
|
||||
let size = 1 + size % 1_000_000;
|
||||
|
||||
let r = ReedSolomon::new(data, parity).unwrap();
|
||||
|
||||
let mut expect = make_random_shards!(size, data + parity);
|
||||
let mut shards = expect.clone();
|
||||
|
||||
r.encode(&mut expect).unwrap();
|
||||
|
||||
let expect = expect;
|
||||
|
||||
{
|
||||
let (data, parity) = shards.split_at_mut(data);
|
||||
|
||||
r.encode_sep(data, parity).unwrap();
|
||||
}
|
||||
|
||||
let shards = shards;
|
||||
|
||||
expect == shards
|
||||
}
|
||||
|
||||
fn qc_encode_single_same_as_encode(data: usize,
|
||||
parity: usize,
|
||||
size: usize) -> bool {
|
||||
let data = 1 + data % 255;
|
||||
let mut parity = 1 + parity % 255;
|
||||
if data + parity > 256 {
|
||||
parity -= data + parity - 256;
|
||||
}
|
||||
|
||||
let size = 1 + size % 1_000_000;
|
||||
|
||||
let r = ReedSolomon::new(data, parity).unwrap();
|
||||
|
||||
let mut expect = make_random_shards!(size, data + parity);
|
||||
let mut shards = expect.clone();
|
||||
|
||||
{
|
||||
let mut refs =
|
||||
convert_2D_slices!(expect =>to_mut_vec &mut [[u8; 2]]);
|
||||
|
||||
r.encode(&mut refs).unwrap();
|
||||
}
|
||||
|
||||
let expect = expect;
|
||||
|
||||
{
|
||||
let mut refs =
|
||||
convert_2D_slices!(shards =>to_mut_vec &mut [[u8; 2]]);
|
||||
|
||||
for i in 0..data {
|
||||
r.encode_single(i, &mut refs).unwrap();
|
||||
}
|
||||
}
|
||||
|
||||
let shards = shards;
|
||||
|
||||
expect == shards
|
||||
}
|
||||
|
||||
fn qc_encode_single_same_as_encode_shards(data: usize,
|
||||
parity: usize,
|
||||
size: usize) -> bool {
|
||||
let data = 1 + data % 255;
|
||||
let mut parity = 1 + parity % 255;
|
||||
if data + parity > 256 {
|
||||
parity -= data + parity - 256;
|
||||
}
|
||||
|
||||
let size = 1 + size % 1_000_000;
|
||||
|
||||
let r = ReedSolomon::new(data, parity).unwrap();
|
||||
|
||||
let mut expect = make_random_shards!(size, data + parity);
|
||||
let mut shards = expect.clone();
|
||||
|
||||
r.encode(&mut expect).unwrap();
|
||||
|
||||
let expect = expect;
|
||||
|
||||
for i in 0..data {
|
||||
r.encode_single(i, &mut shards).unwrap();
|
||||
}
|
||||
|
||||
let shards = shards;
|
||||
|
||||
expect == shards
|
||||
}
|
||||
|
||||
fn qc_encode_single_sep_same_as_encode(data: usize,
|
||||
parity: usize,
|
||||
size: usize) -> bool {
|
||||
let data = 1 + data % 255;
|
||||
let mut parity = 1 + parity % 255;
|
||||
if data + parity > 256 {
|
||||
parity -= data + parity - 256;
|
||||
}
|
||||
|
||||
let size = 1 + size % 1_000_000;
|
||||
|
||||
let r = ReedSolomon::new(data, parity).unwrap();
|
||||
|
||||
let mut expect = make_random_shards!(size, data + parity);
|
||||
let mut shards = expect.clone();
|
||||
|
||||
{
|
||||
let mut refs =
|
||||
convert_2D_slices!(expect =>to_mut_vec &mut [[u8; 2]]);
|
||||
|
||||
r.encode(&mut refs).unwrap();
|
||||
}
|
||||
|
||||
let expect = expect;
|
||||
|
||||
{
|
||||
let (data_shards, parity_shards) = shards.split_at_mut(data);
|
||||
|
||||
let data_refs =
|
||||
convert_2D_slices!(data_shards =>to_mut_vec &[[u8; 2]]);
|
||||
|
||||
let mut parity_refs =
|
||||
convert_2D_slices!(parity_shards =>to_mut_vec &mut [[u8; 2]]);
|
||||
|
||||
for i in 0..data {
|
||||
r.encode_single_sep(i, data_refs[i], &mut parity_refs).unwrap();
|
||||
}
|
||||
}
|
||||
|
||||
let shards = shards;
|
||||
|
||||
expect == shards
|
||||
}
|
||||
|
||||
fn qc_encode_single_sep_same_as_encode_shards(data: usize,
|
||||
parity: usize,
|
||||
size: usize) -> bool {
|
||||
let data = 1 + data % 255;
|
||||
let mut parity = 1 + parity % 255;
|
||||
if data + parity > 256 {
|
||||
parity -= data + parity - 256;
|
||||
}
|
||||
|
||||
let size = 1 + size % 1_000_000;
|
||||
|
||||
let r = ReedSolomon::new(data, parity).unwrap();
|
||||
|
||||
let mut expect = make_random_shards!(size, data + parity);
|
||||
let mut shards = expect.clone();
|
||||
|
||||
r.encode(&mut expect).unwrap();
|
||||
|
||||
let expect = expect;
|
||||
|
||||
{
|
||||
let (data_shards, parity_shards) = shards.split_at_mut(data);
|
||||
|
||||
for i in 0..data {
|
||||
r.encode_single_sep(i, &data_shards[i], parity_shards).unwrap();
|
||||
}
|
||||
}
|
||||
|
||||
let shards = shards;
|
||||
|
||||
expect == shards
|
||||
}
|
||||
}
|
||||
2619
seaweed-volume/vendor/reed-solomon-erasure/src/tests/mod.rs
vendored
Normal file
2619
seaweed-volume/vendor/reed-solomon-erasure/src/tests/mod.rs
vendored
Normal file
File diff suppressed because it is too large
Load Diff
@@ -10,6 +10,7 @@ import (
|
||||
"net"
|
||||
"net/http"
|
||||
"os"
|
||||
"os/exec"
|
||||
"path/filepath"
|
||||
"strconv"
|
||||
"sync"
|
||||
@@ -24,6 +25,7 @@ import (
|
||||
"github.com/stretchr/testify/assert"
|
||||
"github.com/stretchr/testify/require"
|
||||
|
||||
"github.com/seaweedfs/seaweedfs/test/volume_server/framework"
|
||||
"github.com/seaweedfs/seaweedfs/weed/command"
|
||||
"github.com/seaweedfs/seaweedfs/weed/glog"
|
||||
flag "github.com/seaweedfs/seaweedfs/weed/util/fla9"
|
||||
@@ -37,18 +39,19 @@ const (
|
||||
|
||||
// TestCluster manages the weed mini instance for integration testing
|
||||
type TestCluster struct {
|
||||
dataDir string
|
||||
ctx context.Context
|
||||
cancel context.CancelFunc
|
||||
s3Client *s3.S3
|
||||
isRunning bool
|
||||
startOnce sync.Once
|
||||
wg sync.WaitGroup
|
||||
masterPort int
|
||||
volumePort int
|
||||
filerPort int
|
||||
s3Port int
|
||||
s3Endpoint string
|
||||
dataDir string
|
||||
ctx context.Context
|
||||
cancel context.CancelFunc
|
||||
s3Client *s3.S3
|
||||
isRunning bool
|
||||
startOnce sync.Once
|
||||
wg sync.WaitGroup
|
||||
masterPort int
|
||||
volumePort int
|
||||
filerPort int
|
||||
s3Port int
|
||||
s3Endpoint string
|
||||
rustVolumeCmd *exec.Cmd
|
||||
}
|
||||
|
||||
// TestS3Integration demonstrates basic S3 operations against a running weed mini instance
|
||||
@@ -236,6 +239,14 @@ func startMiniCluster(t *testing.T, extraArgs ...string) (*TestCluster, error) {
|
||||
return nil, fmt.Errorf("S3 service failed to start: %v", err)
|
||||
}
|
||||
|
||||
// If VOLUME_SERVER_IMPL=rust, start a Rust volume server alongside weed mini
|
||||
if os.Getenv("VOLUME_SERVER_IMPL") == "rust" {
|
||||
if err := cluster.startRustVolumeServer(t); err != nil {
|
||||
cancel()
|
||||
return nil, fmt.Errorf("failed to start Rust volume server: %v", err)
|
||||
}
|
||||
}
|
||||
|
||||
cluster.isRunning = true
|
||||
|
||||
// Create S3 client
|
||||
@@ -257,8 +268,82 @@ func startMiniCluster(t *testing.T, extraArgs ...string) (*TestCluster, error) {
|
||||
return cluster, nil
|
||||
}
|
||||
|
||||
// startRustVolumeServer starts a Rust volume server that registers with the same master.
|
||||
func (c *TestCluster) startRustVolumeServer(t *testing.T) error {
|
||||
t.Helper()
|
||||
|
||||
rustBinary, err := framework.FindOrBuildRustBinary()
|
||||
if err != nil {
|
||||
return fmt.Errorf("resolve rust volume binary: %v", err)
|
||||
}
|
||||
|
||||
rustVolumePort, err := findAvailablePort()
|
||||
if err != nil {
|
||||
return fmt.Errorf("find rust volume port: %v", err)
|
||||
}
|
||||
rustVolumeGrpcPort, err := findAvailablePort()
|
||||
if err != nil {
|
||||
return fmt.Errorf("find rust volume grpc port: %v", err)
|
||||
}
|
||||
|
||||
rustVolumeDir := filepath.Join(c.dataDir, "rust-volume")
|
||||
if err := os.MkdirAll(rustVolumeDir, 0o755); err != nil {
|
||||
return fmt.Errorf("create rust volume dir: %v", err)
|
||||
}
|
||||
|
||||
securityToml := filepath.Join(c.dataDir, "security.toml")
|
||||
|
||||
args := []string{
|
||||
"--port", strconv.Itoa(rustVolumePort),
|
||||
"--port.grpc", strconv.Itoa(rustVolumeGrpcPort),
|
||||
"--port.public", strconv.Itoa(rustVolumePort),
|
||||
"--ip", "127.0.0.1",
|
||||
"--ip.bind", "127.0.0.1",
|
||||
"--dir", rustVolumeDir,
|
||||
"--max", "16",
|
||||
"--master", "127.0.0.1:" + strconv.Itoa(c.masterPort),
|
||||
"--securityFile", securityToml,
|
||||
"--preStopSeconds", "0",
|
||||
}
|
||||
|
||||
logFile, err := os.Create(filepath.Join(c.dataDir, "rust-volume.log"))
|
||||
if err != nil {
|
||||
return fmt.Errorf("create rust volume log: %v", err)
|
||||
}
|
||||
|
||||
c.rustVolumeCmd = exec.Command(rustBinary, args...)
|
||||
c.rustVolumeCmd.Dir = c.dataDir
|
||||
c.rustVolumeCmd.Stdout = logFile
|
||||
c.rustVolumeCmd.Stderr = logFile
|
||||
if err := c.rustVolumeCmd.Start(); err != nil {
|
||||
logFile.Close()
|
||||
return fmt.Errorf("start rust volume: %v", err)
|
||||
}
|
||||
|
||||
// Wait for the Rust volume server to be ready
|
||||
rustEndpoint := fmt.Sprintf("http://127.0.0.1:%d/healthz", rustVolumePort)
|
||||
deadline := time.Now().Add(15 * time.Second)
|
||||
client := &http.Client{Timeout: 1 * time.Second}
|
||||
for time.Now().Before(deadline) {
|
||||
resp, err := client.Get(rustEndpoint)
|
||||
if err == nil {
|
||||
resp.Body.Close()
|
||||
t.Logf("Rust volume server ready on port %d (grpc %d)", rustVolumePort, rustVolumeGrpcPort)
|
||||
return nil
|
||||
}
|
||||
time.Sleep(200 * time.Millisecond)
|
||||
}
|
||||
return fmt.Errorf("rust volume server not ready after 15s (port %d)", rustVolumePort)
|
||||
}
|
||||
|
||||
// Stop stops the test cluster
|
||||
func (c *TestCluster) Stop() {
|
||||
// Stop Rust volume server first
|
||||
if c.rustVolumeCmd != nil && c.rustVolumeCmd.Process != nil {
|
||||
c.rustVolumeCmd.Process.Kill()
|
||||
c.rustVolumeCmd.Wait()
|
||||
}
|
||||
|
||||
if c.cancel != nil {
|
||||
c.cancel()
|
||||
}
|
||||
|
||||
@@ -21,6 +21,7 @@ import (
|
||||
"github.com/aws/aws-sdk-go/aws/session"
|
||||
"github.com/aws/aws-sdk-go/service/iam"
|
||||
"github.com/aws/aws-sdk-go/service/s3"
|
||||
"github.com/seaweedfs/seaweedfs/test/volume_server/framework"
|
||||
"github.com/seaweedfs/seaweedfs/weed/command"
|
||||
"github.com/seaweedfs/seaweedfs/weed/glog"
|
||||
"github.com/seaweedfs/seaweedfs/weed/pb"
|
||||
@@ -42,6 +43,7 @@ type TestCluster struct {
|
||||
filerGrpcPort int
|
||||
s3Port int
|
||||
s3Endpoint string
|
||||
rustVolumeCmd *exec.Cmd
|
||||
}
|
||||
|
||||
func TestS3PolicyShellRevised(t *testing.T) {
|
||||
@@ -822,6 +824,15 @@ enabled = true
|
||||
cancel()
|
||||
return nil, err
|
||||
}
|
||||
|
||||
// If VOLUME_SERVER_IMPL=rust, start a Rust volume server alongside weed mini
|
||||
if os.Getenv("VOLUME_SERVER_IMPL") == "rust" {
|
||||
if err := cluster.startRustVolumeServer(t); err != nil {
|
||||
cancel()
|
||||
return nil, fmt.Errorf("failed to start Rust volume server: %v", err)
|
||||
}
|
||||
}
|
||||
|
||||
cluster.isRunning = true
|
||||
return cluster, nil
|
||||
}
|
||||
@@ -840,7 +851,80 @@ func waitForS3Ready(endpoint string, timeout time.Duration) error {
|
||||
return fmt.Errorf("timeout waiting for S3")
|
||||
}
|
||||
|
||||
// startRustVolumeServer starts a Rust volume server that registers with the same master.
|
||||
func (c *TestCluster) startRustVolumeServer(t *testing.T) error {
|
||||
t.Helper()
|
||||
|
||||
rustBinary, err := framework.FindOrBuildRustBinary()
|
||||
if err != nil {
|
||||
return fmt.Errorf("resolve rust volume binary: %v", err)
|
||||
}
|
||||
|
||||
rustVolumePort, err := findAvailablePort()
|
||||
if err != nil {
|
||||
return fmt.Errorf("find rust volume port: %v", err)
|
||||
}
|
||||
rustVolumeGrpcPort, err := findAvailablePort()
|
||||
if err != nil {
|
||||
return fmt.Errorf("find rust volume grpc port: %v", err)
|
||||
}
|
||||
|
||||
rustVolumeDir := filepath.Join(c.dataDir, "rust-volume")
|
||||
if err := os.MkdirAll(rustVolumeDir, 0o755); err != nil {
|
||||
return fmt.Errorf("create rust volume dir: %v", err)
|
||||
}
|
||||
|
||||
securityToml := filepath.Join(c.dataDir, "security.toml")
|
||||
|
||||
args := []string{
|
||||
"--port", strconv.Itoa(rustVolumePort),
|
||||
"--port.grpc", strconv.Itoa(rustVolumeGrpcPort),
|
||||
"--port.public", strconv.Itoa(rustVolumePort),
|
||||
"--ip", "127.0.0.1",
|
||||
"--ip.bind", "127.0.0.1",
|
||||
"--dir", rustVolumeDir,
|
||||
"--max", "16",
|
||||
"--master", "127.0.0.1:" + strconv.Itoa(c.masterPort),
|
||||
"--securityFile", securityToml,
|
||||
"--preStopSeconds", "0",
|
||||
}
|
||||
|
||||
logFile, err := os.Create(filepath.Join(c.dataDir, "rust-volume.log"))
|
||||
if err != nil {
|
||||
return fmt.Errorf("create rust volume log: %v", err)
|
||||
}
|
||||
|
||||
c.rustVolumeCmd = exec.Command(rustBinary, args...)
|
||||
c.rustVolumeCmd.Dir = c.dataDir
|
||||
c.rustVolumeCmd.Stdout = logFile
|
||||
c.rustVolumeCmd.Stderr = logFile
|
||||
if err := c.rustVolumeCmd.Start(); err != nil {
|
||||
logFile.Close()
|
||||
return fmt.Errorf("start rust volume: %v", err)
|
||||
}
|
||||
|
||||
rustEndpoint := fmt.Sprintf("http://127.0.0.1:%d/healthz", rustVolumePort)
|
||||
deadline := time.Now().Add(15 * time.Second)
|
||||
client := &http.Client{Timeout: 1 * time.Second}
|
||||
for time.Now().Before(deadline) {
|
||||
resp, err := client.Get(rustEndpoint)
|
||||
if err == nil {
|
||||
resp.Body.Close()
|
||||
t.Logf("Rust volume server ready on port %d (grpc %d)", rustVolumePort, rustVolumeGrpcPort)
|
||||
return nil
|
||||
}
|
||||
time.Sleep(200 * time.Millisecond)
|
||||
}
|
||||
return fmt.Errorf("rust volume server not ready after 15s (port %d)", rustVolumePort)
|
||||
}
|
||||
|
||||
func (c *TestCluster) Stop() {
|
||||
// Stop Rust volume server first
|
||||
if c.rustVolumeCmd != nil && c.rustVolumeCmd.Process != nil {
|
||||
c.rustVolumeCmd.Process.Kill()
|
||||
c.rustVolumeCmd.Wait()
|
||||
}
|
||||
|
||||
if c.cancel != nil {
|
||||
c.cancel()
|
||||
}
|
||||
|
||||
@@ -27,6 +27,12 @@ const (
|
||||
testVolumeSizeLimitMB = 32
|
||||
)
|
||||
|
||||
var (
|
||||
weedBinaryOnce sync.Once
|
||||
weedBinaryPath string
|
||||
weedBinaryErr error
|
||||
)
|
||||
|
||||
// Cluster is a lightweight SeaweedFS master + one volume server test harness.
|
||||
type Cluster struct {
|
||||
testingTB testing.TB
|
||||
@@ -326,6 +332,13 @@ func writeSecurityConfig(configDir string, profile matrix.Profile) error {
|
||||
b.WriteString("\"\n")
|
||||
b.WriteString("expires_after_seconds = 60\n")
|
||||
}
|
||||
if profile.EnableUIAccess {
|
||||
if b.Len() > 0 {
|
||||
b.WriteString("\n")
|
||||
}
|
||||
b.WriteString("[access]\n")
|
||||
b.WriteString("ui = true\n")
|
||||
}
|
||||
if b.Len() == 0 {
|
||||
b.WriteString("# optional security config generated for integration tests\n")
|
||||
}
|
||||
@@ -341,40 +354,43 @@ func FindOrBuildWeedBinary() (string, error) {
|
||||
return "", fmt.Errorf("WEED_BINARY is set but not executable: %s", fromEnv)
|
||||
}
|
||||
|
||||
repoRoot := ""
|
||||
if _, file, _, ok := runtime.Caller(0); ok {
|
||||
repoRoot = filepath.Clean(filepath.Join(filepath.Dir(file), "..", "..", ".."))
|
||||
candidate := filepath.Join(repoRoot, "weed", "weed")
|
||||
if isExecutableFile(candidate) {
|
||||
return candidate, nil
|
||||
weedBinaryOnce.Do(func() {
|
||||
repoRoot := ""
|
||||
if _, file, _, ok := runtime.Caller(0); ok {
|
||||
repoRoot = filepath.Clean(filepath.Join(filepath.Dir(file), "..", "..", ".."))
|
||||
}
|
||||
if repoRoot == "" {
|
||||
weedBinaryErr = errors.New("unable to detect repository root")
|
||||
return
|
||||
}
|
||||
}
|
||||
|
||||
if repoRoot == "" {
|
||||
return "", errors.New("unable to detect repository root")
|
||||
}
|
||||
binDir := filepath.Join(os.TempDir(), "seaweedfs_volume_server_it_bin")
|
||||
if err := os.MkdirAll(binDir, 0o755); err != nil {
|
||||
weedBinaryErr = fmt.Errorf("create binary directory %s: %w", binDir, err)
|
||||
return
|
||||
}
|
||||
binPath := filepath.Join(binDir, "weed")
|
||||
|
||||
binDir := filepath.Join(os.TempDir(), "seaweedfs_volume_server_it_bin")
|
||||
if err := os.MkdirAll(binDir, 0o755); err != nil {
|
||||
return "", fmt.Errorf("create binary directory %s: %w", binDir, err)
|
||||
}
|
||||
binPath := filepath.Join(binDir, "weed")
|
||||
if isExecutableFile(binPath) {
|
||||
return binPath, nil
|
||||
}
|
||||
cmd := exec.Command("go", "build", "-o", binPath, ".")
|
||||
cmd.Dir = filepath.Join(repoRoot, "weed")
|
||||
var out bytes.Buffer
|
||||
cmd.Stdout = &out
|
||||
cmd.Stderr = &out
|
||||
if err := cmd.Run(); err != nil {
|
||||
weedBinaryErr = fmt.Errorf("build weed binary: %w\n%s", err, out.String())
|
||||
return
|
||||
}
|
||||
if !isExecutableFile(binPath) {
|
||||
weedBinaryErr = fmt.Errorf("built weed binary is not executable: %s", binPath)
|
||||
return
|
||||
}
|
||||
weedBinaryPath = binPath
|
||||
})
|
||||
|
||||
cmd := exec.Command("go", "build", "-o", binPath, ".")
|
||||
cmd.Dir = filepath.Join(repoRoot, "weed")
|
||||
var out bytes.Buffer
|
||||
cmd.Stdout = &out
|
||||
cmd.Stderr = &out
|
||||
if err := cmd.Run(); err != nil {
|
||||
return "", fmt.Errorf("build weed binary: %w\n%s", err, out.String())
|
||||
if weedBinaryErr != nil {
|
||||
return "", weedBinaryErr
|
||||
}
|
||||
if !isExecutableFile(binPath) {
|
||||
return "", fmt.Errorf("built weed binary is not executable: %s", binPath)
|
||||
}
|
||||
return binPath, nil
|
||||
return weedBinaryPath, nil
|
||||
}
|
||||
|
||||
func isExecutableFile(path string) bool {
|
||||
|
||||
@@ -11,7 +11,7 @@ import (
|
||||
type DualVolumeCluster = MultiVolumeCluster
|
||||
|
||||
// StartDualVolumeCluster starts a cluster with 2 volume servers.
|
||||
// Deprecated: Use StartMultiVolumeCluster(t, profile, 2) directly.
|
||||
func StartDualVolumeCluster(t testing.TB, profile matrix.Profile) *DualVolumeCluster {
|
||||
return StartMultiVolumeCluster(t, profile, 2)
|
||||
// Deprecated: Use StartMultiVolumeClusterAuto(t, profile, 2) directly.
|
||||
func StartDualVolumeCluster(t testing.TB, profile matrix.Profile) MultiCluster {
|
||||
return StartMultiVolumeClusterAuto(t, profile, 2)
|
||||
}
|
||||
|
||||
63
test/volume_server/framework/cluster_interface.go
Normal file
63
test/volume_server/framework/cluster_interface.go
Normal file
@@ -0,0 +1,63 @@
|
||||
package framework
|
||||
|
||||
import (
|
||||
"os"
|
||||
"testing"
|
||||
|
||||
"github.com/seaweedfs/seaweedfs/test/volume_server/matrix"
|
||||
)
|
||||
|
||||
// TestCluster is the common interface for single-volume cluster harnesses.
|
||||
// Both *Cluster (Go volume) and *RustCluster (Rust volume) satisfy it.
|
||||
type TestCluster interface {
|
||||
MasterAddress() string
|
||||
VolumeAdminAddress() string
|
||||
VolumePublicAddress() string
|
||||
VolumeGRPCAddress() string
|
||||
VolumeServerAddress() string
|
||||
MasterURL() string
|
||||
VolumeAdminURL() string
|
||||
VolumePublicURL() string
|
||||
BaseDir() string
|
||||
Stop()
|
||||
}
|
||||
|
||||
func useRustVolumeServer() bool {
|
||||
return os.Getenv("VOLUME_SERVER_IMPL") == "rust"
|
||||
}
|
||||
|
||||
// StartVolumeCluster starts a single-volume cluster using either the Go or
|
||||
// Rust volume server, depending on the VOLUME_SERVER_IMPL environment variable.
|
||||
// Set VOLUME_SERVER_IMPL=rust to use the Rust volume server.
|
||||
func StartVolumeCluster(t testing.TB, profile matrix.Profile) TestCluster {
|
||||
t.Helper()
|
||||
if useRustVolumeServer() {
|
||||
return StartRustVolumeCluster(t, profile)
|
||||
}
|
||||
return StartSingleVolumeCluster(t, profile)
|
||||
}
|
||||
|
||||
// MultiCluster is the common interface for multi-volume cluster harnesses.
|
||||
// Both *MultiVolumeCluster (Go) and *RustMultiVolumeCluster (Rust) satisfy it.
|
||||
type MultiCluster interface {
|
||||
MasterAddress() string
|
||||
MasterURL() string
|
||||
BaseDir() string
|
||||
VolumeAdminAddress(index int) string
|
||||
VolumeAdminURL(index int) string
|
||||
VolumePublicAddress(index int) string
|
||||
VolumePublicURL(index int) string
|
||||
VolumeGRPCAddress(index int) string
|
||||
Stop()
|
||||
}
|
||||
|
||||
// StartMultiVolumeClusterAuto starts a multi-volume cluster using either Go or
|
||||
// Rust volume servers, depending on the VOLUME_SERVER_IMPL environment variable.
|
||||
// Set VOLUME_SERVER_IMPL=rust to use Rust volume servers.
|
||||
func StartMultiVolumeClusterAuto(t testing.TB, profile matrix.Profile, count int) MultiCluster {
|
||||
t.Helper()
|
||||
if useRustVolumeServer() {
|
||||
return StartRustMultiVolumeCluster(t, profile, count)
|
||||
}
|
||||
return StartMultiVolumeCluster(t, profile, count)
|
||||
}
|
||||
20
test/volume_server/framework/cluster_interface_test.go
Normal file
20
test/volume_server/framework/cluster_interface_test.go
Normal file
@@ -0,0 +1,20 @@
|
||||
package framework
|
||||
|
||||
import "testing"
|
||||
|
||||
func TestUseRustVolumeServer(t *testing.T) {
|
||||
t.Setenv("VOLUME_SERVER_IMPL", "rust")
|
||||
if !useRustVolumeServer() {
|
||||
t.Fatalf("expected rust selection when VOLUME_SERVER_IMPL=rust")
|
||||
}
|
||||
|
||||
t.Setenv("VOLUME_SERVER_IMPL", "go")
|
||||
if useRustVolumeServer() {
|
||||
t.Fatalf("expected go selection when VOLUME_SERVER_IMPL=go")
|
||||
}
|
||||
|
||||
t.Setenv("VOLUME_SERVER_IMPL", "")
|
||||
if useRustVolumeServer() {
|
||||
t.Fatalf("expected go selection when VOLUME_SERVER_IMPL is unset")
|
||||
}
|
||||
}
|
||||
289
test/volume_server/framework/cluster_multi_rust.go
Normal file
289
test/volume_server/framework/cluster_multi_rust.go
Normal file
@@ -0,0 +1,289 @@
|
||||
package framework
|
||||
|
||||
import (
|
||||
"fmt"
|
||||
"net"
|
||||
"os"
|
||||
"os/exec"
|
||||
"path/filepath"
|
||||
"strconv"
|
||||
"sync"
|
||||
"testing"
|
||||
|
||||
"github.com/seaweedfs/seaweedfs/test/volume_server/matrix"
|
||||
)
|
||||
|
||||
// RustMultiVolumeCluster wraps a Go master + multiple Rust volume servers
|
||||
// for integration testing. It mirrors MultiVolumeCluster but uses the Rust
|
||||
// volume binary instead of the Go weed binary for volume servers.
|
||||
type RustMultiVolumeCluster struct {
|
||||
testingTB testing.TB
|
||||
profile matrix.Profile
|
||||
|
||||
weedBinary string // Go weed binary (for the master)
|
||||
rustVolumeBinary string // Rust volume binary
|
||||
|
||||
baseDir string
|
||||
configDir string
|
||||
logsDir string
|
||||
keepLogs bool
|
||||
volumeServerCount int
|
||||
|
||||
masterPort int
|
||||
masterGrpcPort int
|
||||
|
||||
volumePorts []int
|
||||
volumeGrpcPorts []int
|
||||
volumePubPorts []int
|
||||
|
||||
masterCmd *exec.Cmd
|
||||
volumeCmds []*exec.Cmd
|
||||
|
||||
cleanupOnce sync.Once
|
||||
}
|
||||
|
||||
// StartRustMultiVolumeCluster starts a cluster with a Go master and the
|
||||
// specified number of Rust volume servers.
|
||||
func StartRustMultiVolumeCluster(t testing.TB, profile matrix.Profile, serverCount int) *RustMultiVolumeCluster {
|
||||
t.Helper()
|
||||
|
||||
if serverCount < 1 {
|
||||
t.Fatalf("serverCount must be at least 1, got %d", serverCount)
|
||||
}
|
||||
|
||||
weedBinary, err := FindOrBuildWeedBinary()
|
||||
if err != nil {
|
||||
t.Fatalf("resolve weed binary: %v", err)
|
||||
}
|
||||
|
||||
rustBinary, err := FindOrBuildRustBinary()
|
||||
if err != nil {
|
||||
t.Fatalf("resolve rust volume binary: %v", err)
|
||||
}
|
||||
|
||||
baseDir, keepLogs, err := newWorkDir()
|
||||
if err != nil {
|
||||
t.Fatalf("create temp test directory: %v", err)
|
||||
}
|
||||
|
||||
configDir := filepath.Join(baseDir, "config")
|
||||
logsDir := filepath.Join(baseDir, "logs")
|
||||
masterDataDir := filepath.Join(baseDir, "master")
|
||||
|
||||
// Create directories for master and all volume servers
|
||||
dirs := []string{configDir, logsDir, masterDataDir}
|
||||
for i := 0; i < serverCount; i++ {
|
||||
dirs = append(dirs, filepath.Join(baseDir, fmt.Sprintf("volume%d", i)))
|
||||
}
|
||||
for _, dir := range dirs {
|
||||
if mkErr := os.MkdirAll(dir, 0o755); mkErr != nil {
|
||||
t.Fatalf("create %s: %v", dir, mkErr)
|
||||
}
|
||||
}
|
||||
|
||||
if err = writeSecurityConfig(configDir, profile); err != nil {
|
||||
t.Fatalf("write security config: %v", err)
|
||||
}
|
||||
|
||||
masterPort, masterGrpcPort, err := allocateMasterPortPair()
|
||||
if err != nil {
|
||||
t.Fatalf("allocate master port pair: %v", err)
|
||||
}
|
||||
|
||||
// Allocate ports for all volume servers (3 ports per server: admin, grpc, public)
|
||||
// If SplitPublicPort is true, we need an additional port per server
|
||||
portsPerServer := 3
|
||||
if profile.SplitPublicPort {
|
||||
portsPerServer = 4
|
||||
}
|
||||
totalPorts := serverCount * portsPerServer
|
||||
ports, err := allocatePorts(totalPorts)
|
||||
if err != nil {
|
||||
t.Fatalf("allocate volume ports: %v", err)
|
||||
}
|
||||
|
||||
c := &RustMultiVolumeCluster{
|
||||
testingTB: t,
|
||||
profile: profile,
|
||||
weedBinary: weedBinary,
|
||||
rustVolumeBinary: rustBinary,
|
||||
baseDir: baseDir,
|
||||
configDir: configDir,
|
||||
logsDir: logsDir,
|
||||
keepLogs: keepLogs,
|
||||
volumeServerCount: serverCount,
|
||||
masterPort: masterPort,
|
||||
masterGrpcPort: masterGrpcPort,
|
||||
volumePorts: make([]int, serverCount),
|
||||
volumeGrpcPorts: make([]int, serverCount),
|
||||
volumePubPorts: make([]int, serverCount),
|
||||
volumeCmds: make([]*exec.Cmd, serverCount),
|
||||
}
|
||||
|
||||
// Assign ports to each volume server
|
||||
for i := 0; i < serverCount; i++ {
|
||||
baseIdx := i * portsPerServer
|
||||
c.volumePorts[i] = ports[baseIdx]
|
||||
c.volumeGrpcPorts[i] = ports[baseIdx+1]
|
||||
|
||||
// Assign public port, using baseIdx+3 if SplitPublicPort, else baseIdx+2
|
||||
pubPortIdx := baseIdx + 2
|
||||
if profile.SplitPublicPort {
|
||||
pubPortIdx = baseIdx + 3
|
||||
}
|
||||
c.volumePubPorts[i] = ports[pubPortIdx]
|
||||
}
|
||||
|
||||
// Start master (Go)
|
||||
if err = c.startMaster(masterDataDir); err != nil {
|
||||
c.Stop()
|
||||
t.Fatalf("start master: %v", err)
|
||||
}
|
||||
helper := &Cluster{logsDir: logsDir}
|
||||
if err = helper.waitForHTTP(c.MasterURL() + "/dir/status"); err != nil {
|
||||
masterLog := helper.tailLog("master.log")
|
||||
c.Stop()
|
||||
t.Fatalf("wait for master readiness: %v\nmaster log tail:\n%s", err, masterLog)
|
||||
}
|
||||
|
||||
// Start all Rust volume servers
|
||||
for i := 0; i < serverCount; i++ {
|
||||
volumeDataDir := filepath.Join(baseDir, fmt.Sprintf("volume%d", i))
|
||||
if err = c.startRustVolume(i, volumeDataDir); err != nil {
|
||||
volumeLog := fmt.Sprintf("volume%d.log", i)
|
||||
c.Stop()
|
||||
t.Fatalf("start rust volume server %d: %v\nvolume log tail:\n%s", i, err, helper.tailLog(volumeLog))
|
||||
}
|
||||
if err = helper.waitForHTTP(c.VolumeAdminURL(i) + "/healthz"); err != nil {
|
||||
volumeLog := fmt.Sprintf("volume%d.log", i)
|
||||
c.Stop()
|
||||
t.Fatalf("wait for rust volume server %d readiness: %v\nvolume log tail:\n%s", i, err, helper.tailLog(volumeLog))
|
||||
}
|
||||
if err = helper.waitForTCP(c.VolumeGRPCAddress(i)); err != nil {
|
||||
volumeLog := fmt.Sprintf("volume%d.log", i)
|
||||
c.Stop()
|
||||
t.Fatalf("wait for rust volume server %d grpc readiness: %v\nvolume log tail:\n%s", i, err, helper.tailLog(volumeLog))
|
||||
}
|
||||
}
|
||||
|
||||
t.Cleanup(func() {
|
||||
c.Stop()
|
||||
})
|
||||
|
||||
return c
|
||||
}
|
||||
|
||||
func (c *RustMultiVolumeCluster) Stop() {
|
||||
if c == nil {
|
||||
return
|
||||
}
|
||||
c.cleanupOnce.Do(func() {
|
||||
// Stop volume servers in reverse order
|
||||
for i := len(c.volumeCmds) - 1; i >= 0; i-- {
|
||||
stopProcess(c.volumeCmds[i])
|
||||
}
|
||||
stopProcess(c.masterCmd)
|
||||
if !c.keepLogs && !c.testingTB.Failed() {
|
||||
_ = os.RemoveAll(c.baseDir)
|
||||
} else if c.baseDir != "" {
|
||||
c.testingTB.Logf("rust multi volume server integration logs kept at %s", c.baseDir)
|
||||
}
|
||||
})
|
||||
}
|
||||
|
||||
func (c *RustMultiVolumeCluster) startMaster(dataDir string) error {
|
||||
logFile, err := os.Create(filepath.Join(c.logsDir, "master.log"))
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
|
||||
args := []string{
|
||||
"-config_dir=" + c.configDir,
|
||||
"master",
|
||||
"-ip=127.0.0.1",
|
||||
"-port=" + strconv.Itoa(c.masterPort),
|
||||
"-port.grpc=" + strconv.Itoa(c.masterGrpcPort),
|
||||
"-mdir=" + dataDir,
|
||||
"-peers=none",
|
||||
"-volumeSizeLimitMB=" + strconv.Itoa(testVolumeSizeLimitMB),
|
||||
"-defaultReplication=000",
|
||||
}
|
||||
|
||||
c.masterCmd = exec.Command(c.weedBinary, args...)
|
||||
c.masterCmd.Dir = c.baseDir
|
||||
c.masterCmd.Stdout = logFile
|
||||
c.masterCmd.Stderr = logFile
|
||||
return c.masterCmd.Start()
|
||||
}
|
||||
|
||||
func (c *RustMultiVolumeCluster) startRustVolume(index int, dataDir string) error {
|
||||
logName := fmt.Sprintf("volume%d.log", index)
|
||||
logFile, err := os.Create(filepath.Join(c.logsDir, logName))
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
|
||||
args := rustVolumeArgs(
|
||||
c.profile,
|
||||
c.configDir,
|
||||
c.masterPort,
|
||||
c.volumePorts[index],
|
||||
c.volumeGrpcPorts[index],
|
||||
c.volumePubPorts[index],
|
||||
dataDir,
|
||||
)
|
||||
|
||||
cmd := exec.Command(c.rustVolumeBinary, args...)
|
||||
cmd.Dir = c.baseDir
|
||||
cmd.Stdout = logFile
|
||||
cmd.Stderr = logFile
|
||||
|
||||
if err = cmd.Start(); err != nil {
|
||||
return err
|
||||
}
|
||||
c.volumeCmds[index] = cmd
|
||||
return nil
|
||||
}
|
||||
|
||||
// --- accessor methods (mirror MultiVolumeCluster) ---
|
||||
|
||||
func (c *RustMultiVolumeCluster) MasterAddress() string {
|
||||
return net.JoinHostPort("127.0.0.1", strconv.Itoa(c.masterPort))
|
||||
}
|
||||
|
||||
func (c *RustMultiVolumeCluster) MasterURL() string {
|
||||
return "http://" + c.MasterAddress()
|
||||
}
|
||||
|
||||
func (c *RustMultiVolumeCluster) VolumeAdminAddress(index int) string {
|
||||
if index < 0 || index >= len(c.volumePorts) {
|
||||
return ""
|
||||
}
|
||||
return net.JoinHostPort("127.0.0.1", strconv.Itoa(c.volumePorts[index]))
|
||||
}
|
||||
|
||||
func (c *RustMultiVolumeCluster) VolumePublicAddress(index int) string {
|
||||
if index < 0 || index >= len(c.volumePubPorts) {
|
||||
return ""
|
||||
}
|
||||
return net.JoinHostPort("127.0.0.1", strconv.Itoa(c.volumePubPorts[index]))
|
||||
}
|
||||
|
||||
func (c *RustMultiVolumeCluster) VolumeGRPCAddress(index int) string {
|
||||
if index < 0 || index >= len(c.volumeGrpcPorts) {
|
||||
return ""
|
||||
}
|
||||
return net.JoinHostPort("127.0.0.1", strconv.Itoa(c.volumeGrpcPorts[index]))
|
||||
}
|
||||
|
||||
func (c *RustMultiVolumeCluster) VolumeAdminURL(index int) string {
|
||||
return "http://" + c.VolumeAdminAddress(index)
|
||||
}
|
||||
|
||||
func (c *RustMultiVolumeCluster) VolumePublicURL(index int) string {
|
||||
return "http://" + c.VolumePublicAddress(index)
|
||||
}
|
||||
|
||||
func (c *RustMultiVolumeCluster) BaseDir() string {
|
||||
return c.baseDir
|
||||
}
|
||||
342
test/volume_server/framework/cluster_rust.go
Normal file
342
test/volume_server/framework/cluster_rust.go
Normal file
@@ -0,0 +1,342 @@
|
||||
package framework
|
||||
|
||||
import (
|
||||
"bytes"
|
||||
"fmt"
|
||||
"net"
|
||||
"os"
|
||||
"os/exec"
|
||||
"path/filepath"
|
||||
"runtime"
|
||||
"strconv"
|
||||
"sync"
|
||||
"testing"
|
||||
|
||||
"github.com/seaweedfs/seaweedfs/test/volume_server/matrix"
|
||||
)
|
||||
|
||||
// RustCluster wraps a Go master + Rust volume server for integration testing.
|
||||
type RustCluster struct {
|
||||
testingTB testing.TB
|
||||
profile matrix.Profile
|
||||
|
||||
weedBinary string // Go weed binary (for the master)
|
||||
rustVolumeBinary string // Rust volume binary
|
||||
|
||||
baseDir string
|
||||
configDir string
|
||||
logsDir string
|
||||
keepLogs bool
|
||||
|
||||
masterPort int
|
||||
masterGrpcPort int
|
||||
volumePort int
|
||||
volumeGrpcPort int
|
||||
volumePubPort int
|
||||
|
||||
masterCmd *exec.Cmd
|
||||
volumeCmd *exec.Cmd
|
||||
|
||||
cleanupOnce sync.Once
|
||||
}
|
||||
|
||||
var (
|
||||
rustBinaryOnce sync.Once
|
||||
rustBinaryPath string
|
||||
rustBinaryErr error
|
||||
)
|
||||
|
||||
// StartRustVolumeCluster starts a Go master + Rust volume server.
|
||||
func StartRustVolumeCluster(t testing.TB, profile matrix.Profile) *RustCluster {
|
||||
t.Helper()
|
||||
|
||||
weedBinary, err := FindOrBuildWeedBinary()
|
||||
if err != nil {
|
||||
t.Fatalf("resolve weed binary: %v", err)
|
||||
}
|
||||
|
||||
rustBinary, err := FindOrBuildRustBinary()
|
||||
if err != nil {
|
||||
t.Fatalf("resolve rust volume binary: %v", err)
|
||||
}
|
||||
|
||||
baseDir, keepLogs, err := newWorkDir()
|
||||
if err != nil {
|
||||
t.Fatalf("create temp test directory: %v", err)
|
||||
}
|
||||
|
||||
configDir := filepath.Join(baseDir, "config")
|
||||
logsDir := filepath.Join(baseDir, "logs")
|
||||
masterDataDir := filepath.Join(baseDir, "master")
|
||||
volumeDataDir := filepath.Join(baseDir, "volume")
|
||||
for _, dir := range []string{configDir, logsDir, masterDataDir, volumeDataDir} {
|
||||
if mkErr := os.MkdirAll(dir, 0o755); mkErr != nil {
|
||||
t.Fatalf("create %s: %v", dir, mkErr)
|
||||
}
|
||||
}
|
||||
|
||||
if err = writeSecurityConfig(configDir, profile); err != nil {
|
||||
t.Fatalf("write security config: %v", err)
|
||||
}
|
||||
|
||||
masterPort, masterGrpcPort, err := allocateMasterPortPair()
|
||||
if err != nil {
|
||||
t.Fatalf("allocate master port pair: %v", err)
|
||||
}
|
||||
|
||||
ports, err := allocatePorts(3)
|
||||
if err != nil {
|
||||
t.Fatalf("allocate ports: %v", err)
|
||||
}
|
||||
|
||||
rc := &RustCluster{
|
||||
testingTB: t,
|
||||
profile: profile,
|
||||
weedBinary: weedBinary,
|
||||
rustVolumeBinary: rustBinary,
|
||||
baseDir: baseDir,
|
||||
configDir: configDir,
|
||||
logsDir: logsDir,
|
||||
keepLogs: keepLogs,
|
||||
masterPort: masterPort,
|
||||
masterGrpcPort: masterGrpcPort,
|
||||
volumePort: ports[0],
|
||||
volumeGrpcPort: ports[1],
|
||||
volumePubPort: ports[0],
|
||||
}
|
||||
if profile.SplitPublicPort {
|
||||
rc.volumePubPort = ports[2]
|
||||
}
|
||||
|
||||
if err = rc.startMaster(masterDataDir); err != nil {
|
||||
rc.Stop()
|
||||
t.Fatalf("start master: %v", err)
|
||||
}
|
||||
// Reuse the same HTTP readiness helper via an unexported Cluster shim.
|
||||
helper := &Cluster{logsDir: logsDir}
|
||||
if err = helper.waitForHTTP(rc.MasterURL() + "/dir/status"); err != nil {
|
||||
masterLog := helper.tailLog("master.log")
|
||||
rc.Stop()
|
||||
t.Fatalf("wait for master readiness: %v\nmaster log tail:\n%s", err, masterLog)
|
||||
}
|
||||
|
||||
if err = rc.startRustVolume(volumeDataDir); err != nil {
|
||||
masterLog := helper.tailLog("master.log")
|
||||
rc.Stop()
|
||||
t.Fatalf("start rust volume: %v\nmaster log tail:\n%s", err, masterLog)
|
||||
}
|
||||
if err = helper.waitForHTTP(rc.VolumeAdminURL() + "/healthz"); err != nil {
|
||||
volumeLog := helper.tailLog("volume.log")
|
||||
rc.Stop()
|
||||
t.Fatalf("wait for rust volume readiness: %v\nvolume log tail:\n%s", err, volumeLog)
|
||||
}
|
||||
if err = helper.waitForTCP(rc.VolumeGRPCAddress()); err != nil {
|
||||
volumeLog := helper.tailLog("volume.log")
|
||||
rc.Stop()
|
||||
t.Fatalf("wait for rust volume grpc readiness: %v\nvolume log tail:\n%s", err, volumeLog)
|
||||
}
|
||||
|
||||
t.Cleanup(func() {
|
||||
rc.Stop()
|
||||
})
|
||||
|
||||
return rc
|
||||
}
|
||||
|
||||
// Stop terminates all processes and cleans temporary files.
|
||||
func (rc *RustCluster) Stop() {
|
||||
if rc == nil {
|
||||
return
|
||||
}
|
||||
rc.cleanupOnce.Do(func() {
|
||||
stopProcess(rc.volumeCmd)
|
||||
stopProcess(rc.masterCmd)
|
||||
if !rc.keepLogs && !rc.testingTB.Failed() {
|
||||
_ = os.RemoveAll(rc.baseDir)
|
||||
} else if rc.baseDir != "" {
|
||||
rc.testingTB.Logf("rust volume server integration logs kept at %s", rc.baseDir)
|
||||
}
|
||||
})
|
||||
}
|
||||
|
||||
func (rc *RustCluster) startMaster(dataDir string) error {
|
||||
logFile, err := os.Create(filepath.Join(rc.logsDir, "master.log"))
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
|
||||
args := []string{
|
||||
"-config_dir=" + rc.configDir,
|
||||
"master",
|
||||
"-ip=127.0.0.1",
|
||||
"-port=" + strconv.Itoa(rc.masterPort),
|
||||
"-port.grpc=" + strconv.Itoa(rc.masterGrpcPort),
|
||||
"-mdir=" + dataDir,
|
||||
"-peers=none",
|
||||
"-volumeSizeLimitMB=" + strconv.Itoa(testVolumeSizeLimitMB),
|
||||
"-defaultReplication=000",
|
||||
}
|
||||
|
||||
rc.masterCmd = exec.Command(rc.weedBinary, args...)
|
||||
rc.masterCmd.Dir = rc.baseDir
|
||||
rc.masterCmd.Stdout = logFile
|
||||
rc.masterCmd.Stderr = logFile
|
||||
return rc.masterCmd.Start()
|
||||
}
|
||||
|
||||
func rustVolumeArgs(
|
||||
profile matrix.Profile,
|
||||
configDir string,
|
||||
masterPort int,
|
||||
volumePort int,
|
||||
volumeGrpcPort int,
|
||||
volumePubPort int,
|
||||
dataDir string,
|
||||
) []string {
|
||||
args := []string{
|
||||
"--port", strconv.Itoa(volumePort),
|
||||
"--port.grpc", strconv.Itoa(volumeGrpcPort),
|
||||
"--port.public", strconv.Itoa(volumePubPort),
|
||||
"--ip", "127.0.0.1",
|
||||
"--ip.bind", "127.0.0.1",
|
||||
"--dir", dataDir,
|
||||
"--max", "16",
|
||||
"--master", "127.0.0.1:" + strconv.Itoa(masterPort),
|
||||
"--securityFile", filepath.Join(configDir, "security.toml"),
|
||||
"--readMode", profile.ReadMode,
|
||||
"--concurrentUploadLimitMB", strconv.Itoa(profile.ConcurrentUploadLimitMB),
|
||||
"--concurrentDownloadLimitMB", strconv.Itoa(profile.ConcurrentDownloadLimitMB),
|
||||
"--preStopSeconds", "0",
|
||||
}
|
||||
if profile.InflightUploadTimeout > 0 {
|
||||
args = append(args, "--inflightUploadDataTimeout", profile.InflightUploadTimeout.String())
|
||||
}
|
||||
if profile.InflightDownloadTimeout > 0 {
|
||||
args = append(args, "--inflightDownloadDataTimeout", profile.InflightDownloadTimeout.String())
|
||||
}
|
||||
return args
|
||||
}
|
||||
|
||||
func (rc *RustCluster) startRustVolume(dataDir string) error {
|
||||
logFile, err := os.Create(filepath.Join(rc.logsDir, "volume.log"))
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
|
||||
args := rustVolumeArgs(
|
||||
rc.profile,
|
||||
rc.configDir,
|
||||
rc.masterPort,
|
||||
rc.volumePort,
|
||||
rc.volumeGrpcPort,
|
||||
rc.volumePubPort,
|
||||
dataDir,
|
||||
)
|
||||
|
||||
rc.volumeCmd = exec.Command(rc.rustVolumeBinary, args...)
|
||||
rc.volumeCmd.Dir = rc.baseDir
|
||||
rc.volumeCmd.Stdout = logFile
|
||||
rc.volumeCmd.Stderr = logFile
|
||||
return rc.volumeCmd.Start()
|
||||
}
|
||||
|
||||
// FindOrBuildRustBinary returns an executable Rust volume binary, building one when needed.
|
||||
func FindOrBuildRustBinary() (string, error) {
|
||||
if fromEnv := os.Getenv("RUST_VOLUME_BINARY"); fromEnv != "" {
|
||||
if isExecutableFile(fromEnv) {
|
||||
return fromEnv, nil
|
||||
}
|
||||
return "", fmt.Errorf("RUST_VOLUME_BINARY is set but not executable: %s", fromEnv)
|
||||
}
|
||||
|
||||
rustBinaryOnce.Do(func() {
|
||||
// Derive the Rust volume crate directory from this source file's location.
|
||||
rustCrateDir := ""
|
||||
if _, file, _, ok := runtime.Caller(0); ok {
|
||||
repoRoot := filepath.Clean(filepath.Join(filepath.Dir(file), "..", "..", ".."))
|
||||
for _, candidate := range []string{"seaweed-volume", "weed-volume"} {
|
||||
dir := filepath.Join(repoRoot, candidate)
|
||||
if isDir(dir) && isFile(filepath.Join(dir, "Cargo.toml")) {
|
||||
rustCrateDir = dir
|
||||
break
|
||||
}
|
||||
}
|
||||
}
|
||||
if rustCrateDir == "" {
|
||||
rustBinaryErr = fmt.Errorf("unable to detect Rust volume crate directory")
|
||||
return
|
||||
}
|
||||
|
||||
releaseBin := filepath.Join(rustCrateDir, "target", "release", "weed-volume")
|
||||
|
||||
// Always rebuild once per test process so the harness uses current source and features.
|
||||
cmd := exec.Command("cargo", "build", "--release")
|
||||
cmd.Dir = rustCrateDir
|
||||
var out bytes.Buffer
|
||||
cmd.Stdout = &out
|
||||
cmd.Stderr = &out
|
||||
if err := cmd.Run(); err != nil {
|
||||
rustBinaryErr = fmt.Errorf("build rust volume binary: %w\n%s", err, out.String())
|
||||
return
|
||||
}
|
||||
if !isExecutableFile(releaseBin) {
|
||||
rustBinaryErr = fmt.Errorf("built rust volume binary is not executable: %s", releaseBin)
|
||||
return
|
||||
}
|
||||
rustBinaryPath = releaseBin
|
||||
})
|
||||
|
||||
if rustBinaryErr != nil {
|
||||
return "", rustBinaryErr
|
||||
}
|
||||
return rustBinaryPath, nil
|
||||
}
|
||||
|
||||
func isDir(path string) bool {
|
||||
info, err := os.Stat(path)
|
||||
return err == nil && info.IsDir()
|
||||
}
|
||||
|
||||
func isFile(path string) bool {
|
||||
info, err := os.Stat(path)
|
||||
return err == nil && info.Mode().IsRegular()
|
||||
}
|
||||
|
||||
// --- accessor methods (mirror Cluster) ---
|
||||
|
||||
func (rc *RustCluster) MasterAddress() string {
|
||||
return net.JoinHostPort("127.0.0.1", strconv.Itoa(rc.masterPort))
|
||||
}
|
||||
|
||||
func (rc *RustCluster) VolumeAdminAddress() string {
|
||||
return net.JoinHostPort("127.0.0.1", strconv.Itoa(rc.volumePort))
|
||||
}
|
||||
|
||||
func (rc *RustCluster) VolumePublicAddress() string {
|
||||
return net.JoinHostPort("127.0.0.1", strconv.Itoa(rc.volumePubPort))
|
||||
}
|
||||
|
||||
func (rc *RustCluster) VolumeGRPCAddress() string {
|
||||
return net.JoinHostPort("127.0.0.1", strconv.Itoa(rc.volumeGrpcPort))
|
||||
}
|
||||
|
||||
// VolumeServerAddress returns SeaweedFS server address format: ip:httpPort.grpcPort
|
||||
func (rc *RustCluster) VolumeServerAddress() string {
|
||||
return fmt.Sprintf("%s.%d", rc.VolumeAdminAddress(), rc.volumeGrpcPort)
|
||||
}
|
||||
|
||||
func (rc *RustCluster) MasterURL() string {
|
||||
return "http://" + rc.MasterAddress()
|
||||
}
|
||||
|
||||
func (rc *RustCluster) VolumeAdminURL() string {
|
||||
return "http://" + rc.VolumeAdminAddress()
|
||||
}
|
||||
|
||||
func (rc *RustCluster) VolumePublicURL() string {
|
||||
return "http://" + rc.VolumePublicAddress()
|
||||
}
|
||||
|
||||
func (rc *RustCluster) BaseDir() string {
|
||||
return rc.baseDir
|
||||
}
|
||||
38
test/volume_server/framework/cluster_rust_test.go
Normal file
38
test/volume_server/framework/cluster_rust_test.go
Normal file
@@ -0,0 +1,38 @@
|
||||
package framework
|
||||
|
||||
import (
|
||||
"testing"
|
||||
"time"
|
||||
|
||||
"github.com/seaweedfs/seaweedfs/test/volume_server/matrix"
|
||||
)
|
||||
|
||||
func TestRustVolumeArgsIncludeReadMode(t *testing.T) {
|
||||
profile := matrix.P1()
|
||||
profile.ReadMode = "redirect"
|
||||
profile.ConcurrentUploadLimitMB = 7
|
||||
profile.ConcurrentDownloadLimitMB = 9
|
||||
profile.InflightUploadTimeout = 3 * time.Second
|
||||
profile.InflightDownloadTimeout = 4 * time.Second
|
||||
|
||||
args := rustVolumeArgs(profile, "/tmp/config", 9333, 18080, 28080, 38080, "/tmp/data")
|
||||
|
||||
assertArgPair(t, args, "--readMode", "redirect")
|
||||
assertArgPair(t, args, "--concurrentUploadLimitMB", "7")
|
||||
assertArgPair(t, args, "--concurrentDownloadLimitMB", "9")
|
||||
assertArgPair(t, args, "--inflightUploadDataTimeout", "3s")
|
||||
assertArgPair(t, args, "--inflightDownloadDataTimeout", "4s")
|
||||
}
|
||||
|
||||
func assertArgPair(t *testing.T, args []string, flag string, want string) {
|
||||
t.Helper()
|
||||
for i := 0; i+1 < len(args); i += 2 {
|
||||
if args[i] == flag {
|
||||
if args[i+1] != want {
|
||||
t.Fatalf("%s value mismatch: got %q want %q", flag, args[i+1], want)
|
||||
}
|
||||
return
|
||||
}
|
||||
}
|
||||
t.Fatalf("missing %s in args: %v", flag, args)
|
||||
}
|
||||
@@ -2,6 +2,7 @@ package volume_server_grpc_test
|
||||
|
||||
import (
|
||||
"context"
|
||||
"io"
|
||||
"net/http"
|
||||
"strings"
|
||||
"testing"
|
||||
@@ -18,7 +19,7 @@ func TestVolumeNeedleStatusForUploadedFile(t *testing.T) {
|
||||
t.Skip("skipping integration test in short mode")
|
||||
}
|
||||
|
||||
clusterHarness := framework.StartSingleVolumeCluster(t, matrix.P1())
|
||||
clusterHarness := framework.StartVolumeCluster(t, matrix.P1())
|
||||
conn, grpcClient := framework.DialVolumeServer(t, clusterHarness.VolumeGRPCAddress())
|
||||
defer conn.Close()
|
||||
|
||||
@@ -57,12 +58,57 @@ func TestVolumeNeedleStatusForUploadedFile(t *testing.T) {
|
||||
}
|
||||
}
|
||||
|
||||
func TestVolumeNeedleStatusIncludesTtlAndLastModified(t *testing.T) {
|
||||
if testing.Short() {
|
||||
t.Skip("skipping integration test in short mode")
|
||||
}
|
||||
|
||||
clusterHarness := framework.StartVolumeCluster(t, matrix.P1())
|
||||
conn, grpcClient := framework.DialVolumeServer(t, clusterHarness.VolumeGRPCAddress())
|
||||
defer conn.Close()
|
||||
|
||||
const volumeID = uint32(27)
|
||||
const needleID = uint64(778901)
|
||||
const cookie = uint32(0xA1B2C3D6)
|
||||
framework.AllocateVolume(t, grpcClient, volumeID, "")
|
||||
|
||||
fid := framework.NewFileID(volumeID, needleID, cookie)
|
||||
client := framework.NewHTTPClient()
|
||||
uploadReq := mustNewRequest(t, http.MethodPost, clusterHarness.VolumeAdminURL()+"/"+fid+"?ttl=7d&ts=1700000000")
|
||||
uploadReq.Body = io.NopCloser(strings.NewReader("needle-status-ttl-payload"))
|
||||
uploadReq.ContentLength = int64(len("needle-status-ttl-payload"))
|
||||
uploadReq.Header.Set("Content-Type", "application/octet-stream")
|
||||
uploadResp := framework.DoRequest(t, client, uploadReq)
|
||||
_ = framework.ReadAllAndClose(t, uploadResp)
|
||||
if uploadResp.StatusCode != http.StatusCreated {
|
||||
t.Fatalf("upload status: expected 201, got %d", uploadResp.StatusCode)
|
||||
}
|
||||
|
||||
ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
|
||||
defer cancel()
|
||||
|
||||
statusResp, err := grpcClient.VolumeNeedleStatus(ctx, &volume_server_pb.VolumeNeedleStatusRequest{
|
||||
VolumeId: volumeID,
|
||||
NeedleId: needleID,
|
||||
})
|
||||
if err != nil {
|
||||
t.Fatalf("VolumeNeedleStatus with ttl failed: %v", err)
|
||||
}
|
||||
// Go's ReadTTL normalizes via fitTtlCount: 7d → 1w (7 days = 1 week)
|
||||
if statusResp.GetTtl() != "1w" {
|
||||
t.Fatalf("ttl mismatch: got %q want %q", statusResp.GetTtl(), "1w")
|
||||
}
|
||||
if statusResp.GetLastModified() != 1700000000 {
|
||||
t.Fatalf("last modified mismatch: got %d want %d", statusResp.GetLastModified(), 1700000000)
|
||||
}
|
||||
}
|
||||
|
||||
func TestVolumeNeedleStatusViaEcShardsWhenNormalVolumeUnmounted(t *testing.T) {
|
||||
if testing.Short() {
|
||||
t.Skip("skipping integration test in short mode")
|
||||
}
|
||||
|
||||
clusterHarness := framework.StartSingleVolumeCluster(t, matrix.P1())
|
||||
clusterHarness := framework.StartVolumeCluster(t, matrix.P1())
|
||||
conn, grpcClient := framework.DialVolumeServer(t, clusterHarness.VolumeGRPCAddress())
|
||||
defer conn.Close()
|
||||
|
||||
@@ -138,7 +184,7 @@ func TestVolumeNeedleStatusMissingVolumeAndNeedle(t *testing.T) {
|
||||
t.Skip("skipping integration test in short mode")
|
||||
}
|
||||
|
||||
clusterHarness := framework.StartSingleVolumeCluster(t, matrix.P1())
|
||||
clusterHarness := framework.StartVolumeCluster(t, matrix.P1())
|
||||
conn, grpcClient := framework.DialVolumeServer(t, clusterHarness.VolumeGRPCAddress())
|
||||
defer conn.Close()
|
||||
|
||||
@@ -185,7 +231,7 @@ func TestVolumeConfigureInvalidReplication(t *testing.T) {
|
||||
t.Skip("skipping integration test in short mode")
|
||||
}
|
||||
|
||||
clusterHarness := framework.StartSingleVolumeCluster(t, matrix.P1())
|
||||
clusterHarness := framework.StartVolumeCluster(t, matrix.P1())
|
||||
conn, grpcClient := framework.DialVolumeServer(t, clusterHarness.VolumeGRPCAddress())
|
||||
defer conn.Close()
|
||||
|
||||
@@ -215,7 +261,7 @@ func TestVolumeConfigureSuccessAndMissingRollbackPath(t *testing.T) {
|
||||
t.Skip("skipping integration test in short mode")
|
||||
}
|
||||
|
||||
clusterHarness := framework.StartSingleVolumeCluster(t, matrix.P1())
|
||||
clusterHarness := framework.StartVolumeCluster(t, matrix.P1())
|
||||
conn, grpcClient := framework.DialVolumeServer(t, clusterHarness.VolumeGRPCAddress())
|
||||
defer conn.Close()
|
||||
|
||||
@@ -268,7 +314,7 @@ func TestPingVolumeTargetAndLeaveAffectsHealthz(t *testing.T) {
|
||||
t.Skip("skipping integration test in short mode")
|
||||
}
|
||||
|
||||
clusterHarness := framework.StartSingleVolumeCluster(t, matrix.P1())
|
||||
clusterHarness := framework.StartVolumeCluster(t, matrix.P1())
|
||||
conn, grpcClient := framework.DialVolumeServer(t, clusterHarness.VolumeGRPCAddress())
|
||||
defer conn.Close()
|
||||
|
||||
@@ -311,7 +357,7 @@ func TestVolumeServerLeaveIsIdempotent(t *testing.T) {
|
||||
t.Skip("skipping integration test in short mode")
|
||||
}
|
||||
|
||||
clusterHarness := framework.StartSingleVolumeCluster(t, matrix.P1())
|
||||
clusterHarness := framework.StartVolumeCluster(t, matrix.P1())
|
||||
conn, grpcClient := framework.DialVolumeServer(t, clusterHarness.VolumeGRPCAddress())
|
||||
defer conn.Close()
|
||||
|
||||
@@ -346,7 +392,7 @@ func TestPingUnknownAndUnreachableTargetPaths(t *testing.T) {
|
||||
t.Skip("skipping integration test in short mode")
|
||||
}
|
||||
|
||||
clusterHarness := framework.StartSingleVolumeCluster(t, matrix.P1())
|
||||
clusterHarness := framework.StartVolumeCluster(t, matrix.P1())
|
||||
conn, grpcClient := framework.DialVolumeServer(t, clusterHarness.VolumeGRPCAddress())
|
||||
defer conn.Close()
|
||||
|
||||
@@ -367,6 +413,17 @@ func TestPingUnknownAndUnreachableTargetPaths(t *testing.T) {
|
||||
t.Fatalf("Ping unknown target type expected stop_time_ns >= start_time_ns")
|
||||
}
|
||||
|
||||
emptyTargetResp, err := grpcClient.Ping(ctx, &volume_server_pb.PingRequest{})
|
||||
if err != nil {
|
||||
t.Fatalf("Ping empty target should not return grpc error, got: %v", err)
|
||||
}
|
||||
if emptyTargetResp.GetRemoteTimeNs() != 0 {
|
||||
t.Fatalf("Ping empty target expected remote_time_ns=0, got %d", emptyTargetResp.GetRemoteTimeNs())
|
||||
}
|
||||
if emptyTargetResp.GetStopTimeNs() < emptyTargetResp.GetStartTimeNs() {
|
||||
t.Fatalf("Ping empty target expected stop_time_ns >= start_time_ns")
|
||||
}
|
||||
|
||||
_, err = grpcClient.Ping(ctx, &volume_server_pb.PingRequest{
|
||||
TargetType: cluster.MasterType,
|
||||
Target: "127.0.0.1:1",
|
||||
@@ -395,7 +452,7 @@ func TestPingMasterTargetSuccess(t *testing.T) {
|
||||
t.Skip("skipping integration test in short mode")
|
||||
}
|
||||
|
||||
clusterHarness := framework.StartSingleVolumeCluster(t, matrix.P1())
|
||||
clusterHarness := framework.StartVolumeCluster(t, matrix.P1())
|
||||
conn, grpcClient := framework.DialVolumeServer(t, clusterHarness.VolumeGRPCAddress())
|
||||
defer conn.Close()
|
||||
|
||||
|
||||
@@ -19,7 +19,7 @@ func TestVolumeAdminLifecycleRPCs(t *testing.T) {
|
||||
t.Skip("skipping integration test in short mode")
|
||||
}
|
||||
|
||||
cluster := framework.StartSingleVolumeCluster(t, matrix.P1())
|
||||
cluster := framework.StartVolumeCluster(t, matrix.P1())
|
||||
conn, client := framework.DialVolumeServer(t, cluster.VolumeGRPCAddress())
|
||||
defer conn.Close()
|
||||
|
||||
@@ -62,7 +62,7 @@ func TestVolumeDeleteOnlyEmptyVariants(t *testing.T) {
|
||||
t.Skip("skipping integration test in short mode")
|
||||
}
|
||||
|
||||
clusterHarness := framework.StartSingleVolumeCluster(t, matrix.P1())
|
||||
clusterHarness := framework.StartVolumeCluster(t, matrix.P1())
|
||||
conn, grpcClient := framework.DialVolumeServer(t, clusterHarness.VolumeGRPCAddress())
|
||||
defer conn.Close()
|
||||
|
||||
@@ -101,7 +101,7 @@ func TestMaintenanceModeRejectsAllocateVolume(t *testing.T) {
|
||||
t.Skip("skipping integration test in short mode")
|
||||
}
|
||||
|
||||
cluster := framework.StartSingleVolumeCluster(t, matrix.P1())
|
||||
cluster := framework.StartVolumeCluster(t, matrix.P1())
|
||||
conn, client := framework.DialVolumeServer(t, cluster.VolumeGRPCAddress())
|
||||
defer conn.Close()
|
||||
|
||||
@@ -133,7 +133,7 @@ func TestAllocateDuplicateAndMountUnmountMissingVariants(t *testing.T) {
|
||||
t.Skip("skipping integration test in short mode")
|
||||
}
|
||||
|
||||
cluster := framework.StartSingleVolumeCluster(t, matrix.P1())
|
||||
cluster := framework.StartVolumeCluster(t, matrix.P1())
|
||||
conn, client := framework.DialVolumeServer(t, cluster.VolumeGRPCAddress())
|
||||
defer conn.Close()
|
||||
|
||||
@@ -184,7 +184,7 @@ func TestMaintenanceModeRejectsVolumeDelete(t *testing.T) {
|
||||
t.Skip("skipping integration test in short mode")
|
||||
}
|
||||
|
||||
cluster := framework.StartSingleVolumeCluster(t, matrix.P1())
|
||||
cluster := framework.StartVolumeCluster(t, matrix.P1())
|
||||
conn, client := framework.DialVolumeServer(t, cluster.VolumeGRPCAddress())
|
||||
defer conn.Close()
|
||||
|
||||
|
||||
@@ -16,7 +16,7 @@ func TestVolumeMarkReadonlyAndWritableLifecycle(t *testing.T) {
|
||||
t.Skip("skipping integration test in short mode")
|
||||
}
|
||||
|
||||
clusterHarness := framework.StartSingleVolumeCluster(t, matrix.P1())
|
||||
clusterHarness := framework.StartVolumeCluster(t, matrix.P1())
|
||||
conn, grpcClient := framework.DialVolumeServer(t, clusterHarness.VolumeGRPCAddress())
|
||||
defer conn.Close()
|
||||
|
||||
@@ -61,7 +61,7 @@ func TestVolumeMarkReadonlyPersistTrue(t *testing.T) {
|
||||
t.Skip("skipping integration test in short mode")
|
||||
}
|
||||
|
||||
clusterHarness := framework.StartSingleVolumeCluster(t, matrix.P1())
|
||||
clusterHarness := framework.StartVolumeCluster(t, matrix.P1())
|
||||
conn, grpcClient := framework.DialVolumeServer(t, clusterHarness.VolumeGRPCAddress())
|
||||
defer conn.Close()
|
||||
|
||||
@@ -93,7 +93,7 @@ func TestVolumeMarkReadonlyWritableErrorPaths(t *testing.T) {
|
||||
t.Skip("skipping integration test in short mode")
|
||||
}
|
||||
|
||||
clusterHarness := framework.StartSingleVolumeCluster(t, matrix.P1())
|
||||
clusterHarness := framework.StartVolumeCluster(t, matrix.P1())
|
||||
conn, grpcClient := framework.DialVolumeServer(t, clusterHarness.VolumeGRPCAddress())
|
||||
defer conn.Close()
|
||||
|
||||
@@ -146,7 +146,7 @@ func TestDeleteCollectionRemovesVolumeAndIsIdempotent(t *testing.T) {
|
||||
t.Skip("skipping integration test in short mode")
|
||||
}
|
||||
|
||||
clusterHarness := framework.StartSingleVolumeCluster(t, matrix.P1())
|
||||
clusterHarness := framework.StartVolumeCluster(t, matrix.P1())
|
||||
conn, grpcClient := framework.DialVolumeServer(t, clusterHarness.VolumeGRPCAddress())
|
||||
defer conn.Close()
|
||||
|
||||
|
||||
@@ -18,7 +18,7 @@ func TestBatchDeleteInvalidFidAndMaintenanceMode(t *testing.T) {
|
||||
t.Skip("skipping integration test in short mode")
|
||||
}
|
||||
|
||||
cluster := framework.StartSingleVolumeCluster(t, matrix.P1())
|
||||
cluster := framework.StartVolumeCluster(t, matrix.P1())
|
||||
conn, client := framework.DialVolumeServer(t, cluster.VolumeGRPCAddress())
|
||||
defer conn.Close()
|
||||
|
||||
@@ -61,7 +61,7 @@ func TestBatchDeleteCookieMismatchAndSkipCheck(t *testing.T) {
|
||||
t.Skip("skipping integration test in short mode")
|
||||
}
|
||||
|
||||
cluster := framework.StartSingleVolumeCluster(t, matrix.P1())
|
||||
cluster := framework.StartVolumeCluster(t, matrix.P1())
|
||||
conn, client := framework.DialVolumeServer(t, cluster.VolumeGRPCAddress())
|
||||
defer conn.Close()
|
||||
|
||||
@@ -124,7 +124,7 @@ func TestBatchDeleteMixedStatusesAndMismatchStopsProcessing(t *testing.T) {
|
||||
t.Skip("skipping integration test in short mode")
|
||||
}
|
||||
|
||||
cluster := framework.StartSingleVolumeCluster(t, matrix.P1())
|
||||
cluster := framework.StartVolumeCluster(t, matrix.P1())
|
||||
conn, client := framework.DialVolumeServer(t, cluster.VolumeGRPCAddress())
|
||||
defer conn.Close()
|
||||
|
||||
@@ -219,7 +219,7 @@ func TestBatchDeleteRejectsChunkManifestNeedles(t *testing.T) {
|
||||
t.Skip("skipping integration test in short mode")
|
||||
}
|
||||
|
||||
cluster := framework.StartSingleVolumeCluster(t, matrix.P1())
|
||||
cluster := framework.StartVolumeCluster(t, matrix.P1())
|
||||
conn, client := framework.DialVolumeServer(t, cluster.VolumeGRPCAddress())
|
||||
defer conn.Close()
|
||||
|
||||
|
||||
Some files were not shown because too many files have changed in this diff Show More
Reference in New Issue
Block a user