* Match Go gRPC client transport defaults
* Honor Go HTTP idle timeout
* Honor maintenanceMBps during volume copy
* Honor images.fix.orientation on uploads
* Honor cpuprofile when pprof is disabled
* Match Go memory status payloads
* Propagate request IDs across gRPC calls
* Format pending Rust source updates
* Match Go stats endpoint payloads
* Serve Go volume server UI assets
* Enforce Go HTTP whitelist guards
* Align Rust metrics admin-port test with Go behavior
* Format pending Rust server updates
* Honor access.ui without per-request JWT checks
* Honor keepLocalDatFile in tier upload shortcut
* Honor Go remote volume write mode
* Load tier backends from master config
* Check master config before loading volumes
* Remove vif files on volume destroy
* Delete remote tier data on volume destroy
* Honor vif version defaults and overrides
* Reject mismatched vif bytes offsets
* Load remote-only tiered volumes
* Report Go tail offsets in sync status
* Stream remote dat in incremental copy
* Honor collection vif for EC shard config
* Persist EC expireAtSec in vif metadata
* Stream remote volume reads through HTTP
* Serve HTTP ranges from backend source
* Match Go ReadAllNeedles scan order
* Match Go CopyFile zero-stop metadata
* Delete EC volumes with collection cleanup
* Drop deleted collection metrics
* Match Go tombstone ReadNeedleMeta
* Match Go TTL parsing: all-digit default to minutes, two-pass fit algorithm
* Match Go needle ID/cookie formatting and name size computation
* Match Go image ext checks: webp resize only, no crop; empty healthz body
* Match Go Prometheus metric names and add missing handler counter constants
* Match Go ReplicaPlacement short string parsing with zero-padding
* Add missing EC constants MAX_SHARD_COUNT and MIN_TOTAL_DISKS
* Add walk_ecx_stats for accurate EC volume file counts and size
* Match Go VolumeStatus dat file size, EC shard stats, and disk pct precision
* Match Go needle map: unconditional delete counter, fix redb idx walk offset
* Add CompactMapSegment overflow panic guard matching Go
* Match Go volume: vif creation, version from superblock, TTL expiry, dedup data_size, garbage_level fallback
* Match Go 304 Not Modified: return bare status with no headers
* Match Go JWT error message: use "wrong jwt" instead of detailed error
* Match Go read handler bare 400, delete error prefix, download throttle timeout
* Match Go pretty JSON 1-space indent and "Deletion Failed:" error prefix
* Match Go heartbeat: keep is_heartbeating on error, add EC shard identification
* Match Go needle ReadBytes V2: tolerate EOF on truncated body
* Match Go volume: cookie check on any existing needle, return DataSize, 128KB meta guard
* Match Go DeleteCollection: propagate destroy errors
* Match Go gRPC: BatchDelete no flag, IncrementalCopy error, FetchAndWrite concurrent, VolumeUnmount/DeleteCollection errors, tail draining, query error code
* Match Go Content-Disposition RFC 6266 formatting with RFC 2231 encoding
* Match Go Guard isWriteActive: combine whitelist and signing key check
* Match Go DeleteCollectionMetrics: use partial label matching
* Match Go heartbeat: send state-only delta on volume state changes
* Match Go ReadNeedleMeta paged I/O: read header+tail only, skip data; add EIO tracking
* Match Go ScrubVolume INDEX mode dispatch; add VolumeCopy preallocation and EC NeedleStatus TODOs
* Add read_ec_shard_needle for full needle reconstruction from local EC shards
* Make heartbeat master config helpers pub for VolumeCopy preallocation
* Match Go gRPC: VolumeCopy preallocation, EC NeedleStatus full read, error message wording
* Match Go HTTP responses: omitempty fields, 2-space JSON indent, JWT JSON error, delete pretty/JSONP, 304 Last-Modified, raw write error
* Match Go WriteNeedleBlob V3 timestamp patching, fix makeup_diff double padding, count==0 read handling
* Add rebuild_ecx_file for EC index reconstruction from data shards
* Match Go gRPC: tail header first-chunk-only, EC cleanup on failure, copy append mode, ecx rebuild, compact cancellation
* Add EC volume read and delete support in HTTP handlers
* Add per-shard EC mount/unmount, location predicate search, idx directory for EC
* Add CheckVolumeDataIntegrity on volume load matching Go
* Match Go gRPC: EC multi-disk placement, per-shard mount/unmount, no auto-mount on reconstruct, streaming ReadAll/EcShardRead, ReceiveFile cleanup, version check, proxy streaming, redirect Content-Type
* Match Go heartbeat metric accounting
* Match Go duplicate UUID heartbeat retries
* Delete expired EC volumes during heartbeat
* Match Go volume heartbeat pruning
* Honor master preallocate in volume max
* Report remote storage info in heartbeats
* Emit EC heartbeat deltas on shard changes
* Match Go throttle boundary: use <= instead of <, fix pretty JSON to 1-space
* Match Go write_needle_blob monotonic appendAtNs via get_append_at_ns
* Match Go VolumeUnmount: idempotent success when volume not found
* Match Go TTL Display: return empty string when unit is Empty
Go checks `t.Unit == Empty` separately and returns "" for TTLs
with nonzero count but Empty unit. Rust only checked is_empty()
(count==0 && unit==0), so count>0 with unit=0 would format as
"5 " instead of "".
* Match Go error behavior for truncated needle data in read_body_v2
Go's readNeedleDataVersion2 returns "index out of range %d" errors
(indices 1-7) when needle body or metadata fields are truncated.
Rust was silently tolerating truncation and returning Ok. Now returns
NeedleError::IndexOutOfRange with the matching index for each field.
* Match Go download throttle: return JSON error instead of plain text
* Match Go crop params: default x1/y1 to 0 when not provided
* Match Go ScrubEcVolume: accumulate total_files from EC shards
* Match Go ScrubVolume: count total_files even on scrub error
* Match Go VolumeEcShardsCopy: set ignore_source_file_not_found for .vif
* Match Go VolumeTailSender: send needle_header on every chunk
* Match Go read_super_block: apply replication override from .vif
* Match Go check_volume_data_integrity: verify all 10 entries, detect trailing corruption
* Match Go WriteNeedleBlob: dedup check before writing during replication
* handlers: use meta-only reads for HEAD
* handlers: align range parsing and responses with Go
* handlers: align upload parsing with Go
* deps: enable webp support
* Make 5bytes the default feature for idx entry compatibility
* Match Go TTL: preserve original unit when count fits in byte
* Fix EC locate_needle: use get_actual_size for full needle size
* Fix raw body POST: only parse multipart when Content-Type contains form-data
* Match Go ReceiveFile: return protocol errors in response body, not gRPC status
* add docs
* Match Go VolumeEcShardsCopy: append to .ecj file instead of truncating
* Match Go ParsePath: support _delta suffix on file IDs for sub-file addressing
* Match Go chunk manifest: add Accept-Ranges, Content-Disposition, filename fallback, MIME detection
* Match Go privateStoreHandler: use proper JSON error for unsupported methods
* Match Go Destroy: add only_empty parameter to reject non-empty volume deletion
* Fix compilation: set_read_only_persist and set_writable return ()
These methods fire-and-forget save_vif internally, so gRPC callers
should not try to chain .map_err() on the unit return type.
* Match Go SaveVolumeInfo: check writability and propagate errors in save_vif
* Match Go VolumeDelete: propagate only_empty to delete_volume for defense in depth
The gRPC VolumeDelete handler had a pre-check for only_empty but then
passed false to store.delete_volume(), bypassing the store-level check.
Go passes req.OnlyEmpty directly to DeleteVolume. Now Rust does the same
for defense in depth against TOCTOU races (though the store write lock
makes this unlikely).
* Match Go ProcessRangeRequest: return full content for empty/oversized ranges
Go returns nil from ProcessRangeRequest when ranges are empty or total
range size exceeds content length, causing the caller to serve the full
content as a normal 200 response. Rust was returning an empty 200 body.
* Match Go Query: quote JSON keys in output records
Go's ToJson produces valid JSON with quoted keys like {"name":"Alice"}.
Rust was producing invalid JSON with unquoted keys like {name:"Alice"}.
* Match Go VolumeCopy: reject when no suitable disk location exists
Go returns ErrVolumeNoSpaceLeft when no location matches the disk type
and has sufficient space. Rust had an unsafe fallback that silently
picked the first location regardless of type or available space.
* Match Go DeleteVolumeNeedle: check noWriteOrDelete before allowing delete
Go checks v.noWriteOrDelete before proceeding with needle deletion,
returning "volume is read only" if true. Rust was skipping this check.
* Match Go ReceiveFile: prefer HardDrive location for EC and use response-level write errors
Two fixes: (1) Go prefers HardDriveType disk location for EC volumes,
falling back to first location. Returns "no storage location available"
when no locations exist. (2) Write failures are now response-level
errors (in response body) instead of gRPC status errors, matching Go.
* Match Go CopyFile: sync EC volume journal to disk before copying
Go calls ecVolume.Sync() before copying EC volume files to ensure the
.ecj journal is flushed to disk. Added sync_to_disk() to EcVolume and
call it in the CopyFile EC branch.
* Match Go readSuperBlock: propagate replication parse errors
Go returns an error when parsing the replication string from the .vif
file fails. Rust was silently ignoring the parse failure and using the
super block's replication as-is.
* Match Go TTL expiry: remove append_at_ns > 0 guard
Go computes TTL expiry from AppendAtNs without guarding against zero.
When append_at_ns is 0, the expiry is epoch + TTL which is in the past,
correctly returning NotFound. Rust's extra guard skipped the check,
incorrectly returning success for such needles.
* Match Go delete_collection: skip volumes with compaction in progress
Go checks !v.isCompactionInProgress.Load() before destroying a volume
during collection deletion, skipping compacting volumes. Also changed
destroy errors to log instead of aborting the entire collection delete.
* Match Go MarkReadonly/MarkWritable: always notify master even on local error
Go always notifies the master regardless of whether the local
set_read_only_persist or set_writable step fails. The Rust code was
using `?` which short-circuited on error, skipping the final master
notification. Save the result and defer the `?` until after the
notify call.
* Match Go PostHandler: return 500 for all write errors
Go returns 500 (InternalServerError) for all write failures. Rust was
returning 404 for volume-not-found and 403 for read-only volumes.
* Match Go makeupDiff: validate .cpd compaction revision is old + 1
Go reads the new .cpd file's super block and verifies the compaction
revision is exactly old + 1. Rust only validated the old revision.
* Match Go VolumeStatus: check data backend before returning status
Go checks v.DataBackend != nil before building the status response,
returning an error if missing. Rust was silently returning size 0.
* Match Go PostHandler: always include mime field in upload response JSON
Go always serializes the mime field even when empty ("mime":""). Rust was
omitting it when empty due to Option<String> with skip_serializing_if.
* Match Go FindFreeLocation: account for EC shards in free slot calculation
Go subtracts EC shard equivalents when computing available volume slots.
Rust was only comparing volume count, potentially over-counting free
slots on locations with many EC shards.
* Match Go privateStoreHandler: use INVALID as metrics label for unsupported methods
Go records the method as INVALID in metrics for unsupported HTTP methods.
Rust was using the actual method name.
* Match Go volume: add commit_compact guard and scrub data size validation
Two fixes: (1) commit_compact now checks/sets is_compacting flag to
prevent concurrent commits, matching Go's CompareAndSwap guard.
(2) scrub now validates total needle sizes against .dat file size.
* Match Go gRPC: fix TailSender error propagation, EcShardsInfo all slots, EcShardRead .ecx check
Three fixes: (1) VolumeTailSender now propagates binary search errors
instead of silently falling back to start. (2) VolumeEcShardsInfo
returns entries for all shard slots including unmounted. (3)
VolumeEcShardRead checks .ecx index for deletions instead of .ecj.
* Match Go metrics: add BuildInfo gauge and connection tracking functions
Go exposes a BuildInfo Prometheus metric with version labels, and tracks
open connections via stats.ConnectionOpen/Close. Added both to Rust.
* Match Go NeedleMap.Delete: use !is_deleted() instead of is_valid()
Go's CompactMap.Delete checks !IsDeleted() not IsValid(), so needles
with size==0 (live but anomalous) can still be deleted. The Rust code
was using is_valid() which returns false for size==0, preventing
deletion of such needles.
* Match Go fitTtlCount: always normalize TTL to coarsest unit
Go's fitTtlCount always converts to seconds first, then finds the
coarsest unit that fits in one byte (e.g., 120m → 2h). Rust had an
early return for count<=255 that skipped normalization, producing
different binary encodings for the same duration.
* Match Go BuildInfo metric: correct name and add missing labels
Go uses SeaweedFS_build_info (Namespace=SeaweedFS, Subsystem=build,
Name=info) with labels [version, commit, sizelimit, goos, goarch].
Rust had SeaweedFS_volumeServer_buildInfo with only [version].
* Match Go HTTP handlers: fix UploadResult fields, DiskStatus JSON, chunk manifest ETag
- UploadResult.mime: add skip_serializing_if to omit empty MIME (Go uses omitempty)
- UploadResult.contentMd5: only include when request provided Content-MD5 header
- Content-MD5 response header: only set when request provided it
- DiskStatuses: use camelCase field names (percentFree, percentUsed, diskType)
to match Go's protobuf JSON marshaling
- Chunk manifest: preserve needle ETag in expanded response headers
* Match Go volume: fix version(), integrity check, scrub, and commit_compact
- version(): use self.version() instead of self.super_block.version in
read_all_needles, check_volume_data_integrity, scan_raw_needles_from
to respect volumeInfo.version override
- check_volume_data_integrity: initialize healthy_index_size to idx_size
(matching Go) and continue on EOF instead of returning error
- scrub(): count deleted needles in total_read since they still occupy
space in the .dat file (matches Go's totalRead += actualSize for deleted)
- commit_compact: clean up .cpd/.cpx files on makeup_diff failure
(matches Go's error path cleanup)
* Match Go write queue: add 4MB batch byte limit
Go's startWorker breaks the batch at either 128 requests or 4MB of
accumulated write data. Rust only had the 128-request limit, allowing
large writes to accumulate unbounded latency.
* Add TTL normalization tests for Go parity verification
Test that fit_ttl_count normalizes 120m→2h, 24h→1d, 7d→1w even
when count fits in a byte, matching Go's fitTtlCount behavior.
* Match Go FindFreeLocation: account for EC shards in free slot calculation
Go's free volume count subtracts both regular volumes and EC volumes
from max_volume_count. Rust was only counting regular volumes, which
could over-report available slots when EC shards are mounted.
* Match Go EC volume: mark deletions in .ecx and replay .ecj at startup
Go's DeleteNeedleFromEcx marks needles as deleted in the .ecx index
in-place (writing TOMBSTONE_FILE_SIZE at the size field) in addition
to appending to the .ecj journal. Go's RebuildEcxFile replays .ecj
entries into .ecx on startup, then removes the .ecj file.
Rust was only appending to .ecj without marking .ecx, which meant
deleted EC needles remained readable via .ecx binary search. This
fix:
- Opens .ecx in read/write mode (was read-only)
- Adds mark_needle_deleted_in_ecx: binary search + in-place write
- Calls it from journal_delete before appending to .ecj
- Adds rebuild_ecx_from_journal: replays .ecj into .ecx on startup
* Match Go check_all_ec_shards_deleted: use MAX_SHARD_COUNT instead of hardcoded 14
Go's TotalShardsCount is DataShardsCount + ParityShardsCount = 14 by
default, but custom EC configs via .vif can have more shards (up to
MaxShardCount = 32). Using MAX_SHARD_COUNT ensures all shard files
are checked regardless of EC configuration.
* Match Go EC locate: subtract 1 from shard size and use datFileSize override
Go's LocateEcShardNeedleInterval passes shard.ecdFileSize-1 to
LocateData (shards are padded, -1 avoids overcounting large block
rows). When datFileSize is known, Go uses datFileSize/DataShards
instead. Rust was passing the raw shard file size without adjustment.
* Fix TTL parsing and DiskStatus field names to match Go exactly
TTL::read: Go's ReadTTL preserves the original unit (7d stays 7d,
not 1w) and errors on count > 255. The previous normalization change
was incorrect — Go only normalizes internally via fitTtlCount, not
during string parsing.
DiskStatus: Go uses encoding/json on protobuf structs, which reads
the json struct tags (snake_case: percent_free, percent_used,
disk_type), not the protobuf JSON names (camelCase). Revert to
snake_case to match Go's actual output.
* Fix heartbeat: check leader != current master before redirect, process duplicated UUIDs first
Match Go's volume_grpc_client_to_master.go behavior:
1. Only trigger leader redirect when the leader address differs from the
current master (prevents unnecessary reconnect loops when master confirms
its own address).
2. Process duplicated_uuids before leader redirect check, matching Go's
ordering where duplicate UUID detection takes priority.
* Remove SetState version check to match Go behavior
Go's SetState unconditionally applies the state without any version
mismatch check. The Rust version had an extra optimistic concurrency
check that would reject valid requests from Go clients that don't
track versions.
* Fix TTL::read() to normalize via fit_ttl_count matching Go's ReadTTL
Go's ReadTTL calls fitTtlCount which converts to seconds and normalizes
to the coarsest unit that fits in a byte count (e.g. 120m->2h, 7d->1w,
24h->1d). The Rust version was preserving the original unit, producing
different binary encodings on disk and in heartbeat messages.
* Always return Content-MD5 header and JSON field on successful writes
Go always sets Content-MD5 in the response regardless of whether the
request included it. The Rust version was conditionally including it
only when the request provided Content-MD5.
* Include name and size in UploadResult JSON even when empty/zero
Go's encoding/json always includes empty strings and zero values in
the upload response. The Rust version was using skip_serializing_if
to omit them, causing JSON structure differences.
* Include deleted needles in scan_raw_needles_from to match Go
Go's ScanVolumeFileFrom visits ALL needles including deleted ones.
Skipping deleted entries during incremental copy would cause tombstones
to not be propagated, making deleted files reappear on the receiving side.
* Match Go NeedleMap.Delete: always write tombstone to idx file
Go's NeedleMap.Delete unconditionally writes a tombstone entry to the
idx file and updates metrics, even if the needle doesn't exist or is
already deleted. This is important for replication where every delete
operation must produce an idx write. The Rust version was skipping the
tombstone write for non-existent or already-deleted needles.
* Limit MIME type to 255 bytes matching Go's CreateNeedleFromRequest
* Title-case Seaweed-* pair keys to match Go HTTP header canonicalization
* Unify DiskType::Hdd into HardDrive to match Go's single HardDriveType
* Skip tombstone entries in walk_ecx_stats total_size matching Go's Raw()
* Return EMPTY TTL when computed seconds is zero matching Go's fitTtlCount
* Include disk-space-low in Volume.is_read_only() matching Go
* Log error on CIDR parse failure in whitelist matching Go's glog.Errorf
* Log cookie mismatch in gRPC Query matching Go's V(0).Infof
* Fix is_expired volume_size comparison to use < matching Go
Go checks `volumeSize < super_block.SuperBlockSize` (strict less-than),
but Rust used `<=`. This meant Rust would fail to expire a volume that
is exactly SUPER_BLOCK_SIZE bytes.
* Apply Go's JWT expiry defaults: 10s write, 60s read
Go calls v.SetDefault("jwt.signing.expires_after_seconds", 10) and
v.SetDefault("jwt.signing.read.expires_after_seconds", 60). Rust
defaulted to 0 for both, which meant tokens would never expire when
security.toml has a signing key but omits expires_after_seconds.
* Stop [grpc.volume].ca from overriding [grpc].ca matching Go
Go reads the gRPC CA file only from config.GetString("grpc.ca"), i.e.
the [grpc] section. The [grpc.volume] section only provides cert and
key. Rust was also reading ca from [grpc.volume] which would silently
override the [grpc].ca value when both were present.
* Fix free_volume_count to use EC shard count matching Go
Was counting EC volumes instead of EC shards, which underestimates EC
space usage. One EC volume with 14 shards uses ~1.4 volume slots, not 1.
Now uses Go's formula: ((max - volumes) * DataShardsCount - ecShardCount) / DataShardsCount.
* Include preallocate in compaction space check matching Go
Go uses max(preallocate, estimatedCompactSize) for the free space check.
Rust was only using the estimated volume size, which could start a
compaction that fails mid-way if preallocate exceeds the volume size.
* Check gzip magic bytes before setting Content-Encoding matching Go
Go checks both Accept-Encoding contains "gzip" AND IsGzippedContent
(data starts with 0x1f 0x8b) before setting Content-Encoding: gzip.
Rust only checked Accept-Encoding, which could incorrectly declare
gzip encoding for non-gzip compressed data.
* Only set upload response name when needle HasName matching Go
Go checks reqNeedle.HasName() before setting ret.Name. Rust always set
the name from the filename variable, which could return the fid portion
of the path as the name for raw PUT requests without a filename.
* Treat MaxVolumeCount==0 as unlimited matching Go's hasFreeDiskLocation
Go's hasFreeDiskLocation returns true immediately when MaxVolumeCount
is 0, treating it as unlimited. Rust was computing effective_free as
<= 0 for max==0, rejecting the location. This could fail volume
creation during early startup before the first heartbeat adjusts max.
* Read lastAppendAtNs from deleted V3 entries in integrity check
Go's doCheckAndFixVolumeData reads AppendAtNs from both live entries
(verifyNeedleIntegrity) and deleted tombstones (verifyDeletedNeedleIntegrity).
Rust was skipping deleted entries, which could result in a stale
last_append_at_ns if the last index entry is a deletion.
* Return empty body for empty/oversized range requests matching Go
Go's ProcessRangeRequest returns nil (empty body, 200 OK) when
parsed ranges are empty or combined range size exceeds total content
size. The Rust buffered path incorrectly returned the full file data
for both cases. The streaming path already handled this correctly.
* Dispatch ScrubEcVolume by mode matching Go's INDEX/LOCAL/FULL
Go's ScrubEcVolume switches on mode: INDEX calls v.ScrubIndex()
(ecx integrity only), LOCAL calls v.ScrubLocal(), FULL calls
vs.store.ScrubEcVolume(). Rust was ignoring the mode and always
running verify_ec_shards. Now INDEX mode checks ecx index integrity
(sorted overlap detection + file size validation) without shard I/O,
while LOCAL/FULL modes run the existing shard verification.
* Fix TTL test expectation: 7d normalizes to 1w matching Go's fitTtlCount
Go's ReadTTL calls fitTtlCount which normalizes to the coarsest unit
that fits: 7 days = 1 week, so "7d" becomes {Count:1, Unit:Week}
which displays as "1w". Both Go and Rust normalize identically.
* Add version mismatch check to SetState matching Go's State.Update
Go's State.Update compares the incoming version with the stored
version and returns "version mismatch" error if they differ. This
provides optimistic concurrency control. The Rust implementation
was accepting any version unconditionally.
* Use unquoted keys in Query JSON output matching Go's json.ToJson
Go's json.ToJson produces records with unquoted keys like
{score:12} not {"score":12}. This is a custom format used
internally by SeaweedFS for query results.
* Fix TTL test expectation in VolumeNeedleStatus: 7d normalizes to 1w
Same normalization as the HTTP test: Go's ReadTTL calls fitTtlCount
which converts 7 days to 1 week.
* Include ETag header in 304 Not Modified responses matching Go behavior
Go sets ETag on the response writer (via SetEtag) before the
If-Modified-Since and If-None-Match conditional checks, so both
304 response paths include the ETag header. The Rust implementation
was only adding ETag to 200 responses.
* Remove needle-name fallback in chunk manifest filename resolution
Go's tryHandleChunkedFile only falls back from URL filename to
manifest name. Rust had an extra fallback to needle.name that
Go does not perform, which could produce different
Content-Disposition filenames for chunk manifests.
* Validate JWT nbf (Not Before) claim matching Go's jwt-go/v5
Go's jwt.ParseWithClaims validates the nbf claim when present,
rejecting tokens whose nbf is in the future. The Rust jsonwebtoken
crate defaults validate_nbf to false, so tokens with future nbf
were incorrectly accepted.
* Set isHeartbeating to true at startup matching Go's VolumeServer init
Go unconditionally sets isHeartbeating: true in the VolumeServer
struct literal. Rust was starting with false when masters are
configured, causing /healthz to return 503 until the first
heartbeat succeeds.
* Call store.close() on shutdown matching Go's Shutdown()
Go's Shutdown() calls vs.store.Close() which closes all volumes
and flushes file handles. The Rust server was relying on process
exit for cleanup, which could leave data unflushed.
* Include server ID in maintenance mode error matching Go's format
Go returns "volume server %s is in maintenance mode" with the
store ID. Rust was returning a generic "maintenance mode" message.
* Fix DiskType test: use HardDrive variant matching Go's HddType=""
Go maps both "" and "hdd" to HardDriveType (empty string). The
Rust enum variant is HardDrive, not Hdd. The test referenced a
nonexistent Hdd variant causing compilation failure.
* Do not include ETag in 304 responses matching Go's GetOrHeadHandler
Go sets ETag at L235 AFTER the If-Modified-Since and If-None-Match
304 return paths, so Go's 304 responses do not include the ETag header.
The Rust code was incorrectly including ETag in both 304 response paths.
* Return 400 on malformed query strings in PostHandler matching Go's ParseForm
Go's r.ParseForm() returns HTTP 400 with "form parse error: ..." when
the query string is malformed. Rust was silently falling back to empty
query params via unwrap_or_default().
* Load EC volume version from .vif matching Go's NewEcVolume
Go sets ev.Version = needle.Version(volumeInfo.Version) from the .vif
file. Rust was always using Version::current() (V3), which would produce
wrong needle actual size calculations for volumes created with V1 or V2.
* Sync .ecx file before close matching Go's EcVolume.Close
Go calls ev.ecxFile.Sync() before closing to ensure in-place deletion
marks are flushed to disk. Without this, deletion marks written via
MarkNeedleDeleted could be lost on crash.
* Validate SuperBlock extra data size matching Go's Bytes() guard
Go checks extraSize > 256*256-2 and calls glog.Fatalf to prevent
corrupt super block headers. Rust was silently truncating via u16 cast,
which would write an incorrect extra_size field.
* Update quinn-proto 0.11.13 -> 0.11.14 to fix GHSA-6xvm-j4wr-6v98
Fixes Dependency Review CI failure: quinn-proto < 0.11.14 is vulnerable
to unauthenticated remote DoS via panic in QUIC transport parameter
parsing.
* Skip TestMultipartUploadUsesFormFieldsForTimestampAndTTL for Go server
Go's r.FormValue() cannot read multipart text fields after
r.MultipartReader() consumes the body, so ts/ttl sent as multipart
form fields only work with the Rust volume server. Skip this test
when VOLUME_SERVER_IMPL != "rust" to fix CI failure.
* Flush .ecx in EC volume sync_to_disk matching Go's Sync()
Go's EcVolume.Sync() flushes both the .ecj journal and the .ecx index
to disk. The Rust version only flushed .ecj, leaving in-place deletion
marks in .ecx unpersisted until close(). This could cause data
inconsistency if the server crashes after marking a needle deleted in
.ecx but before close().
* Remove .vif file in EC volume destroy matching Go's Destroy()
Go's EcVolume.Destroy() removes .ecx, .ecj, and .vif files. The Rust
version only removed .ecx and .ecj, leaving orphaned .vif files on
disk after EC volume destruction (e.g., after TTL expiry).
* Fix is_expired to use <= for SuperBlockSize check matching Go
Go checks contentSize <= SuperBlockSize to detect empty volumes (no
needles). Rust used < which would incorrectly allow a volume with
exactly SuperBlockSize bytes (header only, no data) to proceed to
the TTL expiry check and potentially be marked as expired.
* Fix read_append_at_ns to read timestamps from tombstone entries
Go reads the full needle body for all entries including tombstones
(deleted needles with size=0) to extract the actual AppendAtNs
timestamp. The Rust version returned 0 early for size <= 0 entries,
which would cause the binary search in incremental copy to produce
incorrect results for positions containing deleted needles.
Now uses get_actual_size to compute the on-disk size (which handles
tombstones correctly) and only returns 0 when the actual size is 0.
* Add X-Request-Id response header matching Go's requestIDMiddleware
Go sets both X-Request-Id and x-amz-request-id response headers.
The Rust server only set x-amz-request-id, missing X-Request-Id.
* Add skip_serializing_if for UploadResult name and size fields
Go's UploadResult uses json:"name,omitempty" and json:"size,omitempty",
omitting these fields from JSON when they are zero values (empty
string / 0). The Rust struct always serialized them, producing
"name":"" and "size":0 where Go would omit them.
* Support JSONP/pretty-print for write success responses
Go's writeJsonQuiet checks for callback (JSONP) and pretty query
parameters on all JSON responses including write success. The Rust
write success path used axum::Json directly, bypassing JSONP and
pretty-print support. Now uses json_result_with_query to match Go.
* Include actual limit in file size limit error message
Go returns "file over the limited %d bytes" with the actual limit
value included. Rust returned a generic "file size limit exceeded"
without the limit value, making it harder to debug.
* Extract extension from 2-segment URL paths for image operations
Go's parseURLPath extracts the file extension from all URL formats
including 2-segment paths like /vid,fid.jpg. The Rust version only
handled 3-segment paths (/vid/fid/filename.ext), so extensions in
2-segment paths were lost. This caused image resize/crop operations
requested via query params to be silently skipped for those paths.
* Add size_hint to TrackedBody so throttled downloads get Content-Length
TrackedBody (used for download throttling) did not implement
size_hint(), causing HTTP/1.1 to fall back to chunked transfer
encoding instead of setting Content-Length. Go always sets
Content-Length explicitly for non-range responses.
* Add Last-Modified, pairs, and S3 headers to chunk manifest responses
Go sets Last-Modified, needle pairs, and S3 pass-through headers on
the response writer BEFORE calling tryHandleChunkedFile. Since the
Rust chunk manifest handler created fresh response headers and
returned early, these headers were missing from chunk manifest
responses. Now passes last_modified_str into the chunk manifest
handler and applies pairs and S3 pass-through query params
(response-cache-control, response-content-encoding, etc.) to the
chunk manifest response headers.
* Fix multipart fallback to use first part data when no filename
Go reads the first part's data unconditionally, then looks for a
part with a filename. If none found, Go uses the first part's data
(with empty filename). Rust only captured parts with filenames, so
when no part had a filename it fell back to the raw multipart body
bytes (including boundary delimiters), producing corrupt needle data.
* Set HasName and HasMime flags for empty values matching Go
Go's CreateNeedleFromRequest sets HasName and HasMime flags even when
the filename or MIME type is empty (len < 256 is true for len 0).
Rust skipped empty values, causing the on-disk needle format to
differ: Go-written needles include extra bytes for the empty name/mime
size fields, changing the serialized needle size in the idx entry.
This ensures binary format compatibility between Go and Rust servers.
* Add is_stopping guard to vacuum_volume_commit matching Go
Go's CommitCompactVolume (store_vacuum.go L53-54) checks
s.isStopping before committing compaction to prevent file
swaps during shutdown. The Rust handler was missing this
check, which could allow compaction commits while the
server is stopping.
* Remove disk_type from required status fields since Go omits it
Go's default DiskType is "" (HardDriveType), and protobuf's omitempty
tag causes empty strings to be dropped from JSON output.
* test: honor rust env in dual volume harness
* grpc: notify master after volume lifecycle changes
* http: proxy to replicas before download-limit timeout
* test: pass readMode to rust volume harnesses
* fix store free-location predicate selection
* fix volume copy disk placement and heartbeat notification
* fix chunk manifest delete replication
* fix write replication to survive client disconnects
* fix download limit proxy and wait flow
* fix crop gating for streamed reads
* fix upload limit wait counter behavior
* fix chunk manifest image transforms
* fix has_resize_ops to check width/height > 0 instead of is_some()
Go's shouldResizeImages condition is `width > 0 || height > 0`, so
`?width=0` correctly evaluates to false. Rust was using `is_some()`
which made `?width=0` evaluate to true, unnecessarily disabling
streaming reads for those requests.
* fix Content-MD5 to only compute and return when provided by client
Go only computes the MD5 of uncompressed data when a Content-MD5
header or multipart field is provided. Rust was always computing and
returning it. Also fix the mismatch error message to include size,
matching Go's format.
* fix save_vif to compute ExpireAtSec from TTL
Go's SaveVolumeInfo always computes ExpireAtSec = now + ttlSeconds
when the volume has a TTL. The save_vif path (used by set_read_only
and set_writable) was missing this computation, causing .vif files
to be written without the correct expiration timestamp for TTL volumes.
* fix set_writable to not modify no_write_can_delete
Go's MarkVolumeWritable only sets noWriteOrDelete=false and persists.
Rust was additionally setting no_write_can_delete=has_remote_file,
which could incorrectly change the write mode for remote-file volumes
when the master explicitly asks to make the volume writable.
* fix write_needle_blob_and_index to error on too-small V3 blob
Go returns an error when the needle blob is too small for timestamp
patching. Rust was silently skipping the patch and writing the blob
with a stale/zero timestamp, which could cause data integrity issues
during incremental replication that relies on AppendAtNs ordering.
* fix VolumeEcShardsToVolume to validate dataShards range
Go validates that dataShards is > 0 and <= MaxShardCount before
proceeding with EC-to-volume reconstruction. Without this check,
a zero or excessively large data_shards value could cause confusing
downstream failures.
* fix destroy to use VolumeError::NotEmpty instead of generic Io error
The dedicated NotEmpty variant exists in the enum but was not being
used. This makes error matching consistent with Go's ErrVolumeNotEmpty.
* fix SetState to persist state to disk with rollback on failure
Go's State.Update saves VolumeServerState to a state.pb file after
each SetState call, and rolls back the in-memory state if persistence
fails. Rust was only updating in-memory atomics, so maintenance mode
would be lost on server restart. Now saves protobuf-encoded state.pb
and loads it on startup.
* fix VolumeTierMoveDatToRemote to close local dat backend after upload
Go calls v.LoadRemoteFile() after saving volume info, which closes
the local DataBackend before transitioning to remote storage. Without
this, the volume holds a stale file handle to the deleted local .dat
file, causing reads to fail until server restart.
* fix VolumeTierMoveDatFromRemote to close remote dat backend after download
Go calls v.DataBackend.Close() and sets DataBackend=nil after removing
the remote file reference. Without this, the stale remote backend
state lingers and reads may not discover the newly downloaded local
.dat file until server restart.
* fix redirect to use internal url instead of public_url
Go's proxyReqToTargetServer builds the redirect Location header from
loc.Url (the internal URL), not publicUrl. Using public_url could
cause redirect failures when internal and external URLs differ.
* fix redirect test and add state_file_path to integration test
Update redirect unit test to expect internal url (matching the
previous fix). Add missing state_file_path field to the integration
test VolumeServerState constructor.
* fix FetchAndWriteNeedle to await all writes before checking errors
Go uses a WaitGroup to await all writes (local + replicas) before
checking errors. Rust was short-circuiting on local write failure,
which could leave replica writes in-flight without waiting for
completion.
* fix shutdown to send deregister heartbeat before pre_stop delay
Go's StopHeartbeat() closes stopChan immediately on interrupt, causing
the heartbeat goroutine to send the deregister heartbeat right away,
before the preStopSeconds delay. Rust was only setting is_stopping=true
without waking the heartbeat loop, so the deregister was delayed until
after the pre_stop sleep. Now we call volume_state_notify.notify_one()
to wake the heartbeat immediately.
* fix heartbeat response ordering to check duplicate UUIDs first
Go processes heartbeat responses in this order: DuplicatedUuids first,
then volume options (prealloc/size limit), then leader redirect. Rust
was applying volume options before checking for duplicate UUIDs, which
meant volume option changes would take effect even when the response
contained a duplicate UUID error that should cause an immediate return.
* the test thread was blocked
* fix(deps): update aws-lc-sys 0.38.0 → 0.39.0 to resolve security advisories
Bumps aws-lc-rs 1.16.1 → 1.16.2, pulling in aws-lc-sys 0.39.0 which
fixes GHSA-394x-vwmw-crm3 (X.509 Name Constraints wildcard/unicode
bypass) and GHSA-9f94-5g5w-gf6r (CRL Distribution Point scope check
logic error).
* fix: match Go Content-MD5 mismatch error message format
Go uses "Content-MD5 did not match md5 of file data expected [X]
received [Y] size Z" while Rust had a shorter format. Match the
exact Go error string so clients see identical messages.
* fix: match Go Bearer token length check (> 7, not >= 7)
Go requires len(bearer) > 7 ensuring at least one char after
"Bearer ". Rust used >= 7 which would accept an empty token.
* fix(deps): drop legacy rustls 0.21 to resolve rustls-webpki GHSA-pwjx-qhcg-rvj4
aws-sdk-s3's default "rustls" feature enables tls-rustls in
aws-smithy-runtime, which pulls in legacy-rustls-ring (rustls 0.21
→ rustls-webpki 0.101.7, moderate CRL advisory). Replace with
explicit default-https-client which uses only rustls 0.23 /
rustls-webpki 0.103.9.
* fix: use uploaded filename for auto-compression extension detection
Go extracts the file extension from pu.FileName (the uploaded
filename) for auto-compression decisions. Rust was using the URL
path, which typically has no extension for SeaweedFS file IDs.
* fix: add CRC legacy Value() backward-compat check on needle read
Go double-checks CRC: n.Checksum != crc && uint32(n.Checksum) !=
crc.Value(). The Value() path is a deprecated transform for compat
with seaweed versions prior to commit 056c480eb. Rust had the
legacy_value() method but wasn't using it in validation.
* fix: remove /stats/* endpoints to match Go (commented out since L130)
Go's volume_server.go has the /stats/counter, /stats/memory, and
/stats/disk endpoints commented out (lines 130-134). Remove them
from the Rust router along with the now-unused whitelist_guard
middleware.
* fix: filter application/octet-stream MIME for chunk manifests
Go's tryHandleChunkedFile (L334) filters out application/octet-stream
from chunk manifest MIME types, falling back to extension-based
detection. Rust was returning the stored MIME as-is for manifests.
* fix: VolumeMarkWritable returns error before notifying master
Go returns early at L200 if MarkVolumeWritable fails, before
reaching the master notification at L206. Rust was notifying master
even on failure, creating inconsistent state where master thinks
the volume is writable but local marking failed.
* fix: check volume existence before maintenance in MarkReadonly/Writable
Go's VolumeMarkReadonly (L239-241) and VolumeMarkWritable (L253-255)
look up the volume first, then call makeVolumeReadonly/Writable which
checks maintenance. Rust was checking maintenance first, returning
"maintenance mode" instead of "not found" for missing volumes.
* feat: implement ScrubVolume mark_broken_volumes_readonly (PR #8360)
Add the mark_broken_volumes_readonly flag from PR #8360:
- Sync proto field (tag 3) to local volume_server.proto
- After scrubbing, if flag is set, call makeVolumeReadonly on each
broken volume (notify master, mark local readonly, notify again)
- Collect errors via joined error semantics matching Go's errors.Join
- Factor out make_volume_readonly helper reused by both
VolumeMarkReadonly and ScrubVolume
Also refactors VolumeMarkReadonly to use the shared helper.
* fix(deps): update rustls-webpki 0.103.9 → 0.103.10 (GHSA-pwjx-qhcg-rvj4)
CRL Distribution Point matching logic fix for moderate severity
advisory about CRLs not considered authoritative.
* test: update integration tests for removed /stats/* endpoints
Replace tests that expected /stats/* routes to return 200/401 with
tests confirming they now fall through to the store handler (400),
matching Go's commented-out stats endpoints.
* docs: fix misleading comment about default offset feature
The comment said "4-byte offsets unless explicitly built with 5-byte
support" but the default feature enables 5bytes. This is intentional
for production parity with Go -tags 5BytesOffset builds. Fix the
comment to match reality.
SeaweedFS
Sponsor SeaweedFS via Patreon
SeaweedFS is an independent Apache-licensed open source project with its ongoing development made possible entirely thanks to the support of these awesome backers. If you'd like to grow SeaweedFS even stronger, please consider joining our sponsors on Patreon.
Your support will be really appreciated by me and other supporters!
Gold Sponsors
- Download Binaries for different platforms
- SeaweedFS on Slack
- SeaweedFS on Twitter
- SeaweedFS on Telegram
- SeaweedFS on Reddit
- SeaweedFS Mailing List
- Wiki Documentation
- SeaweedFS White Paper
- SeaweedFS Introduction Slides 2025.5
- SeaweedFS Introduction Slides 2021.5
- SeaweedFS Introduction Slides 2019.3
Table of Contents
- Quick Start
- Introduction
- Features
- Example: Using Seaweed Object Store
- Architecture
- Compared to Other File Systems
- Dev Plan
- Installation Guide
- Disk Related Topics
- Benchmark
- Enterprise
- License
Quick Start
Quick Start with weed mini
The easiest way to get started with SeaweedFS for development and testing:
- Download the latest binary from https://github.com/seaweedfs/seaweedfs/releases and unzip a single binary file
weedorweed.exe.
Example:
# remove quarantine on macOS
# xattr -d com.apple.quarantine ./weed
./weed mini -dir=/data
This single command starts a complete SeaweedFS setup with:
- Master UI: http://localhost:9333
- Volume Server: http://localhost:9340
- Filer UI: http://localhost:8888
- S3 Endpoint: http://localhost:8333
- WebDAV: http://localhost:7333
- Admin UI: http://localhost:23646
Perfect for development, testing, learning SeaweedFS, and single node deployments!
Quick Start for S3 API on Docker
docker run -p 8333:8333 chrislusf/seaweedfs server -s3
Quick Start with Single Binary
- Download the latest binary from https://github.com/seaweedfs/seaweedfs/releases and unzip a single binary file
weedorweed.exe. Or rungo install github.com/seaweedfs/seaweedfs/weed@latest. export AWS_ACCESS_KEY_ID=admin ; export AWS_SECRET_ACCESS_KEY=keyas the admin credentials to access the object store.- Run
weed server -dir=/some/data/dir -s3to start one master, one volume server, one filer, and one S3 gateway. The difference withweed miniis thatweed minican auto configure based on the single host environment, whileweed serverrequires manual configuration and are designed for production use.
Also, to increase capacity, just add more volume servers by running weed volume -dir="/some/data/dir2" -master="<master_host>:9333" -port=8081 locally, or on a different machine, or on thousands of machines. That is it!
Introduction
SeaweedFS is a simple and highly scalable distributed file system. There are two objectives:
- to store billions of files!
- to serve the files fast!
SeaweedFS started as a blob store to handle small files efficiently. Instead of managing all file metadata in a central master, the central master only manages volumes on volume servers, and these volume servers manage files and their metadata. This relieves concurrency pressure from the central master and spreads file metadata into volume servers, allowing faster file access (O(1), usually just one disk read operation).
There is only 40 bytes of disk storage overhead for each file's metadata. It is so simple with O(1) disk reads that you are welcome to challenge the performance with your actual use cases.
SeaweedFS started by implementing Facebook's Haystack design paper. Also, SeaweedFS implements erasure coding with ideas from f4: Facebook’s Warm BLOB Storage System, and has a lot of similarities with Facebook’s Tectonic Filesystem and Google's Colossus File System
On top of the blob store, optional Filer can support directories and POSIX attributes. Filer is a separate linearly-scalable stateless server with customizable metadata stores, e.g., MySql, Postgres, Redis, Cassandra, HBase, Mongodb, Elastic Search, LevelDB, RocksDB, Sqlite, MemSql, TiDB, Etcd, CockroachDB, YDB, etc.
SeaweedFS can transparently integrate with the cloud. With hot data on local cluster, and warm data on the cloud with O(1) access time, SeaweedFS can achieve both fast local access time and elastic cloud storage capacity. What's more, the cloud storage access API cost is minimized. Faster and cheaper than direct cloud storage!
Features
Additional Blob Store Features
- Support different replication levels, with rack and data center aware.
- Automatic master servers failover - no single point of failure (SPOF).
- Automatic compression depending on file MIME type.
- Automatic compaction to reclaim disk space after deletion or update.
- Automatic entry TTL expiration.
- Flexible Capacity Expansion: Any server with some disk space can add to the total storage space.
- Adding/Removing servers does not cause any data re-balancing unless triggered by admin commands.
- Optional picture resizing.
- Support ETag, Accept-Range, Last-Modified, etc.
- Support in-memory/leveldb/readonly mode tuning for memory/performance balance.
- Support rebalancing the writable and readonly volumes.
- Customizable Multiple Storage Tiers: Customizable storage disk types to balance performance and cost.
- Transparent cloud integration: unlimited capacity via tiered cloud storage for warm data.
- Erasure Coding for warm storage Rack-Aware 10.4 erasure coding reduces storage cost and increases availability. Enterprise version can customize EC ratio.
Filer Features
- Filer server provides "normal" directories and files via HTTP.
- File TTL automatically expires file metadata and actual file data.
- Mount filer reads and writes files directly as a local directory via FUSE.
- Filer Store Replication enables HA for filer meta data stores.
- Active-Active Replication enables asynchronous one-way or two-way cross cluster continuous replication.
- Amazon S3 compatible API accesses files with S3 tooling.
- Hadoop Compatible File System accesses files from Hadoop/Spark/Flink/etc or even runs HBase.
- Async Replication To Cloud has extremely fast local access and backups to Amazon S3, Google Cloud Storage, Azure, BackBlaze.
- WebDAV accesses as a mapped drive on Mac and Windows, or from mobile devices.
- AES256-GCM Encrypted Storage safely stores the encrypted data.
- Super Large Files stores large or super large files in tens of TB.
- Cloud Drive mounts cloud storage to local cluster, cached for fast read and write with asynchronous write back.
- Gateway to Remote Object Store mirrors bucket operations to remote object storage, in addition to Cloud Drive
Kubernetes
- Kubernetes CSI Driver A Container Storage Interface (CSI) Driver.
- SeaweedFS Operator
Example: Using Seaweed Blob Store
By default, the master node runs on port 9333, and the volume nodes run on port 8080. Let's start one master node, and two volume nodes on port 8080 and 8081. Ideally, they should be started from different machines. We'll use localhost as an example.
SeaweedFS uses HTTP REST operations to read, write, and delete. The responses are in JSON or JSONP format.
Start Master Server
> ./weed master
Start Volume Servers
> weed volume -dir="/tmp/data1" -max=5 -master="localhost:9333" -port=8080 &
> weed volume -dir="/tmp/data2" -max=10 -master="localhost:9333" -port=8081 &
Write A Blob
A blob, also referred as a needle, a chunk, or mistakenly as a file, is just a byte array. It can have attributes, such as name, mime type, create or update time, etc. But basically it is just a byte array of a relatively small size, such as 2 MB ~ 64 MB. The size is not fixed.
To upload a blob: first, send a HTTP POST, PUT, or GET request to /dir/assign to get an fid and a volume server URL:
> curl http://localhost:9333/dir/assign
{"count":1,"fid":"3,01637037d6","url":"127.0.0.1:8080","publicUrl":"localhost:8080"}
Second, to store the blob content, send a HTTP multi-part POST request to url + '/' + fid from the response:
> curl -F file=@/home/chris/myphoto.jpg http://127.0.0.1:8080/3,01637037d6
{"name":"myphoto.jpg","size":43234,"eTag":"1cc0118e"}
To update, send another POST request with updated blob content.
For deletion, send an HTTP DELETE request to the same url + '/' + fid URL:
> curl -X DELETE http://127.0.0.1:8080/3,01637037d6
Save Blob Id
Now, you can save the fid, 3,01637037d6 in this case, to a database field.
The number 3 at the start represents a volume id. After the comma, it's one file key, 01, and a file cookie, 637037d6.
The volume id is an unsigned 32-bit integer. The file key is an unsigned 64-bit integer. The file cookie is an unsigned 32-bit integer, used to prevent URL guessing.
The file key and file cookie are both coded in hex. You can store the <volume id, file key, file cookie> tuple in your own format, or simply store the fid as a string.
If stored as a string, in theory, you would need 8+1+16+8=33 bytes. A char(33) would be enough, if not more than enough, since most uses will not need 2^32 volumes.
If space is really a concern, you can store the file id in the binary format. You would need one 4-byte integer for volume id, 8-byte long number for file key, and a 4-byte integer for the file cookie. So 16 bytes are more than enough.
Read a Blob
Here is an example of how to render the URL.
First look up the volume server's URLs by the file's volumeId:
> curl http://localhost:9333/dir/lookup?volumeId=3
{"volumeId":"3","locations":[{"publicUrl":"localhost:8080","url":"localhost:8080"}]}
Since (usually) there are not too many volume servers, and volumes don't move often, you can cache the results most of the time. Depending on the replication type, one volume can have multiple replica locations. Just randomly pick one location to read.
Now you can take the public URL, render the URL or directly read from the volume server via URL:
http://localhost:8080/3,01637037d6.jpg
Notice we add a file extension ".jpg" here. It's optional and just one way for the client to specify the file content type.
If you want a nicer URL, you can use one of these alternative URL formats:
http://localhost:8080/3/01637037d6/my_preferred_name.jpg
http://localhost:8080/3/01637037d6.jpg
http://localhost:8080/3,01637037d6.jpg
http://localhost:8080/3/01637037d6
http://localhost:8080/3,01637037d6
If you want to get a scaled version of an image, you can add some params:
http://localhost:8080/3/01637037d6.jpg?height=200&width=200
http://localhost:8080/3/01637037d6.jpg?height=200&width=200&mode=fit
http://localhost:8080/3/01637037d6.jpg?height=200&width=200&mode=fill
Rack-Aware and Data Center-Aware Replication
SeaweedFS applies the replication strategy at a volume level. So, when you are getting a blob id, you can specify the replication strategy. For example:
curl http://localhost:9333/dir/assign?replication=001
The replication parameter options are:
000: no replication
001: replicate once on the same rack
010: replicate once on a different rack, but same data center
100: replicate once on a different data center
200: replicate twice on two different data center
110: replicate once on a different rack, and once on a different data center
More details about replication can be found on the wiki.
You can also set the default replication strategy when starting the master server.
Allocate Blob Key on Specific Data Center
Volume servers can be started with a specific data center name:
weed volume -dir=/tmp/1 -port=8080 -dataCenter=dc1
weed volume -dir=/tmp/2 -port=8081 -dataCenter=dc2
When requesting a blob key, an optional "dataCenter" parameter can limit the assigned volume to the specific data center. For example, this specifies that the assigned volume should be limited to 'dc1':
http://localhost:9333/dir/assign?dataCenter=dc1
Other Features
- No Single Point of Failure
- Insert with your own keys
- Chunking large files
- Collection as a Simple Name Space
Blob Store Architecture
Usually distributed file systems split each file into chunks. A central server keeps a mapping of filenames to chunks, and also which chunks each chunk server has.
The main drawback is that the central server can't handle many small files efficiently, and since all read requests need to go through the central master, so it might not scale well for many concurrent users.
Instead of managing chunks, SeaweedFS manages data volumes in the master server. Each data volume is 32GB in size, and can hold a lot of blobs. And each storage node can have many data volumes. So the master node only needs to store the metadata about the volumes, which is a fairly small amount of data and is generally stable.
The actual blob metadata, which are the blob volume, offset, and size, is stored in each volume on volume servers. Since each volume server only manages metadata of blobs on its own disk, with only 16 bytes for each blob, all access can read the metadata just from memory and only needs one disk operation to actually read file data.
For comparison, consider that an xfs inode structure in Linux is 536 bytes.
Master Server and Volume Server
The architecture is fairly simple. The actual data is stored in volumes on storage nodes. One volume server can have multiple volumes, and can both support read and write access with basic authentication.
All volumes are managed by a master server. The master server contains the volume id to volume server mapping. This is fairly static information, and can be easily cached.
On each write request, the master server also generates a file key, which is a growing 64-bit unsigned integer. Since write requests are not generally as frequent as read requests, one master server should be able to handle the concurrency well.
Write and Read files
When a client sends a write request, the master server returns (volume id, file key, file cookie, volume node URL) for the blob. The client then contacts the volume node and POSTs the blob content.
When a client needs to read a blob based on (volume id, file key, file cookie), it asks the master server by the volume id for the (volume node URL, volume node public URL), or retrieves this from a cache. Then the client can GET the content, or just render the URL on web pages and let browsers fetch the content.
Saving memory
All blob metadata stored on a volume server is readable from memory without disk access. Each file takes just a 16-byte map entry of <64bit key, 32bit offset, 32bit size>. Of course, each map entry has its own space cost for the map. But usually the disk space runs out before the memory does.
Tiered Storage to the cloud
The local volume servers are much faster, while cloud storages have elastic capacity and are actually more cost-efficient if not accessed often (usually free to upload, but relatively costly to access). With the append-only structure and O(1) access time, SeaweedFS can take advantage of both local and cloud storage by offloading the warm data to the cloud.
Usually hot data are fresh and warm data are old. SeaweedFS puts the newly created volumes on local servers, and optionally upload the older volumes on the cloud. If the older data are accessed less often, this literally gives you unlimited capacity with limited local servers, and still fast for new data.
With the O(1) access time, the network latency cost is kept at minimum.
If the hot/warm data is split as 20/80, with 20 servers, you can achieve storage capacity of 100 servers. That's a cost saving of 80%! Or you can repurpose the 80 servers to store new data also, and get 5X storage throughput.
SeaweedFS Filer
Built on top of the blob store, SeaweedFS Filer adds directory structure to create a file system. The directory sturcture is an interface that is implemented in many key-value stores or databases.
The content of a file is mapped to one or many blobs, distributed to multiple volumes on multiple volume servers.
Compared to Other File Systems
Most other distributed file systems seem more complicated than necessary.
SeaweedFS is meant to be fast and simple, in both setup and operation. If you do not understand how it works when you reach here, we've failed! Please raise an issue with any questions or update this file with clarifications.
SeaweedFS is constantly moving forward. Same with other systems. These comparisons can be outdated quickly. Please help to keep them updated.
Compared to HDFS
HDFS uses the chunk approach for each file, and is ideal for storing large files.
SeaweedFS is ideal for serving relatively smaller files quickly and concurrently.
SeaweedFS can also store extra large files by splitting them into manageable data chunks, and store the file ids of the data chunks into a meta chunk. This is managed by "weed upload/download" tool, and the weed master or volume servers are agnostic about it.
Compared to GlusterFS, Ceph
The architectures are mostly the same. SeaweedFS aims to store and read files fast, with a simple and flat architecture. The main differences are
- SeaweedFS optimizes for small files, ensuring O(1) disk seek operation, and can also handle large files.
- SeaweedFS statically assigns a volume id for a file. Locating file content becomes just a lookup of the volume id, which can be easily cached.
- SeaweedFS Filer metadata store can be any well-known and proven data store, e.g., Redis, Cassandra, HBase, Mongodb, Elastic Search, MySql, Postgres, Sqlite, MemSql, TiDB, CockroachDB, Etcd, YDB etc, and is easy to customize.
- SeaweedFS Volume server also communicates directly with clients via HTTP, supporting range queries, direct uploads, etc.
| System | File Metadata | File Content Read | POSIX | REST API | Optimized for large number of small files |
|---|---|---|---|---|---|
| SeaweedFS | lookup volume id, cacheable | O(1) disk seek | Yes | Yes | |
| SeaweedFS Filer | Linearly Scalable, Customizable | O(1) disk seek | FUSE | Yes | Yes |
| GlusterFS | hashing | FUSE, NFS | |||
| Ceph | hashing + rules | FUSE | Yes | ||
| MooseFS | in memory | FUSE | No | ||
| MinIO | separate meta file for each file | Yes | No |
Compared to GlusterFS
GlusterFS stores files, both directories and content, in configurable volumes called "bricks".
GlusterFS hashes the path and filename into ids, and assigned to virtual volumes, and then mapped to "bricks".
Compared to MooseFS
MooseFS chooses to neglect small file issue. From moosefs 3.0 manual, "even a small file will occupy 64KiB plus additionally 4KiB of checksums and 1KiB for the header", because it "was initially designed for keeping large amounts (like several thousands) of very big files"
MooseFS Master Server keeps all meta data in memory. Same issue as HDFS namenode.
Compared to Ceph
Ceph can be setup similar to SeaweedFS as a key->blob store. It is much more complicated, with the need to support layers on top of it. Here is a more detailed comparison
SeaweedFS has a centralized master group to look up free volumes, while Ceph uses hashing and metadata servers to locate its objects. Having a centralized master makes it easy to code and manage.
Ceph, like SeaweedFS, is based on the object store RADOS. Ceph is rather complicated with mixed reviews.
Ceph uses CRUSH hashing to automatically manage data placement, which is efficient to locate the data. But the data has to be placed according to the CRUSH algorithm. Any wrong configuration would cause data loss. Topology changes, such as adding new servers to increase capacity, will cause data migration with high IO cost to fit the CRUSH algorithm. SeaweedFS places data by assigning them to any writable volumes. If writes to one volume failed, just pick another volume to write. Adding more volumes is also as simple as it can be.
SeaweedFS is optimized for small files. Small files are stored as one continuous block of content, with at most 8 unused bytes between files. Small file access is O(1) disk read.
SeaweedFS Filer uses off-the-shelf stores, such as MySql, Postgres, Sqlite, Mongodb, Redis, Elastic Search, Cassandra, HBase, MemSql, TiDB, CockroachCB, Etcd, YDB, to manage file directories. These stores are proven, scalable, and easier to manage.
| SeaweedFS | comparable to Ceph | advantage |
|---|---|---|
| Master | MDS | simpler |
| Volume | OSD | optimized for small files |
| Filer | Ceph FS | linearly scalable, Customizable, O(1) or O(logN) |
Compared to MinIO
MinIO follows AWS S3 closely and is ideal for testing for S3 API. It has good UI, policies, versionings, etc. SeaweedFS is trying to catch up here. It is also possible to put MinIO as a gateway in front of SeaweedFS later.
MinIO metadata are in simple files. Each file write will incur extra writes to corresponding meta file.
MinIO does not have optimization for lots of small files. The files are simply stored as is to local disks. Plus the extra meta file and shards for erasure coding, it only amplifies the LOSF problem.
MinIO has multiple disk IO to read one file. SeaweedFS has O(1) disk reads, even for erasure coded files.
MinIO has full-time erasure coding. SeaweedFS uses replication on hot data for faster speed and optionally applies erasure coding on warm data.
MinIO does not have POSIX-like API support.
MinIO has specific requirements on storage layout. It is not flexible to adjust capacity. In SeaweedFS, just start one volume server pointing to the master. That's all.
Dev Plan
- More tools and documentation, on how to manage and scale the system.
- Read and write stream data.
- Support structured data.
This is a super exciting project! And we need helpers and support!
Installation Guide
Installation guide for users who are not familiar with golang
Step 1: install go on your machine and setup the environment by following the instructions at:
https://golang.org/doc/install
make sure to define your $GOPATH
Step 2: checkout this repo:
git clone https://github.com/seaweedfs/seaweedfs.git
Step 3: download, compile, and install the project by executing the following command
cd seaweedfs/weed && make install
Once this is done, you will find the executable "weed" in your $GOPATH/bin directory
For more installation options, including how to run with Docker, see the Getting Started guide.
Disk Related Topics
Hard Drive Performance
When testing read performance on SeaweedFS, it basically becomes a performance test of your hard drive's random read speed. Hard drives usually get 100MB/s~200MB/s.
Solid State Disk
To modify or delete small files, SSD must delete a whole block at a time, and move content in existing blocks to a new block. SSD is fast when brand new, but will get fragmented over time and you have to garbage collect, compacting blocks. SeaweedFS is friendly to SSD since it is append-only. Deletion and compaction are done on volume level in the background, not slowing reading and not causing fragmentation.
Benchmark
My Own Unscientific Single Machine Results on Mac Book with Solid State Disk, CPU: 1 Intel Core i7 2.6GHz.
Write 1 million 1KB file:
Concurrency Level: 16
Time taken for tests: 66.753 seconds
Completed requests: 1048576
Failed requests: 0
Total transferred: 1106789009 bytes
Requests per second: 15708.23 [#/sec]
Transfer rate: 16191.69 [Kbytes/sec]
Connection Times (ms)
min avg max std
Total: 0.3 1.0 84.3 0.9
Percentage of the requests served within a certain time (ms)
50% 0.8 ms
66% 1.0 ms
75% 1.1 ms
80% 1.2 ms
90% 1.4 ms
95% 1.7 ms
98% 2.1 ms
99% 2.6 ms
100% 84.3 ms
Randomly read 1 million files:
Concurrency Level: 16
Time taken for tests: 22.301 seconds
Completed requests: 1048576
Failed requests: 0
Total transferred: 1106812873 bytes
Requests per second: 47019.38 [#/sec]
Transfer rate: 48467.57 [Kbytes/sec]
Connection Times (ms)
min avg max std
Total: 0.0 0.3 54.1 0.2
Percentage of the requests served within a certain time (ms)
50% 0.3 ms
90% 0.4 ms
98% 0.6 ms
99% 0.7 ms
100% 54.1 ms
Run WARP and launch a mixed benchmark.
make benchmark
warp: Benchmark data written to "warp-mixed-2025-12-05[194844]-kBpU.csv.zst"
Mixed operations.
Operation: DELETE, 10%, Concurrency: 20, Ran 42s.
* Throughput: 55.13 obj/s
Operation: GET, 45%, Concurrency: 20, Ran 42s.
* Throughput: 2477.45 MiB/s, 247.75 obj/s
Operation: PUT, 15%, Concurrency: 20, Ran 42s.
* Throughput: 825.85 MiB/s, 82.59 obj/s
Operation: STAT, 30%, Concurrency: 20, Ran 42s.
* Throughput: 165.27 obj/s
Cluster Total: 3302.88 MiB/s, 550.51 obj/s over 43s.
Enterprise
For enterprise users, please visit seaweedfs.com for the SeaweedFS Enterprise Edition, which has a self-healing storage format with better data protection.
License
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
The text of this page is available for modification and reuse under the terms of the Creative Commons Attribution-Sharealike 3.0 Unported License and the GNU Free Documentation License (unversioned, with no invariant sections, front-cover texts, or back-cover texts).



