Chris Lu 72c2c7ef8b Add iceberg_maintenance plugin worker handler (Phase 1) (#8501)
* Add iceberg_maintenance plugin worker handler (Phase 1)

Implement automated Iceberg table maintenance as a new plugin worker job
type. The handler scans S3 table buckets for tables needing maintenance
and executes operations in the correct Iceberg order: expire snapshots,
remove orphan files, and rewrite manifests.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Fix unsafe int64→int narrowing for MaxSnapshotsToKeep

Use int64(wouldKeep) instead of int(config.MaxSnapshotsToKeep) to
avoid potential truncation on 32-bit platforms (CodeQL high severity).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Fix unsafe int64→int narrowing for MinInputFiles

Use int64(len(manifests)) instead of int(config.MinInputFiles) to
avoid potential truncation on 32-bit platforms (CodeQL high severity).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Fix unsafe int64→int narrowing for MaxCommitRetries

Clamp MaxCommitRetries to [1,20] range and keep as int64 throughout
the retry loop to avoid truncation on 32-bit platforms (CodeQL high
severity).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Sort snapshots explicitly by timestamp in expireSnapshots

The previous logic relied on implicit ordering of the snapshot list.
Now explicitly sorts snapshots by timestamp descending (most recent
first) and uses a simpler keep-count loop: keep the first
MaxSnapshotsToKeep newest snapshots plus the current snapshot
unconditionally, then expire the rest that exceed the retention window.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Handle errors properly in listFilerEntries

Previously all errors from ListEntries and Recv were silently swallowed.
Now: treat "not found" errors as empty directory, propagate other
ListEntries errors, and check for io.EOF explicitly on Recv instead of
breaking on any error.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Fix overly broad HasSuffix check in orphan detection

The bare strings.HasSuffix(ref, entry.Name) could match files with
similar suffixes (e.g. "123.avro" matching "snap-123.avro"). Replaced
with exact relPath match and a "/"-prefixed suffix check to avoid
false positives.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Replace fmt.Sscanf with strconv.Atoi in extractMetadataVersion

strconv.Atoi is more explicit and less fragile than fmt.Sscanf for
parsing a simple integer from a trimmed string.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Recursively traverse directories for orphan file detection

The orphan cleanup only listed a single directory level under data/
and metadata/, skipping IsDirectory entries. Partitioned Iceberg
tables store data files in nested partition directories (e.g.
data/region=us-east/file.parquet) which were never evaluated.

Add walkFilerEntries helper that recursively descends into
subdirectories, and use it in removeOrphans so all nested files
are considered for orphan checks.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Fix manifest path drift from double time.Now() calls

rewriteManifests called time.Now().UnixMilli() twice: once for the
path embedded in WriteManifest and once for the filename passed to
saveFilerFile. These timestamps would differ, causing the manifest's
internal path reference to not match the actual saved filename.

Compute the filename once and reuse it for both WriteManifest and
saveFilerFile so they always reference the same path.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Add TestManifestRewritePathConsistency test

Verifies that WriteManifest returns a ManifestFile whose FilePath()
matches the path passed in, and that path.Base() of that path matches
the filename used for saveFilerFile. This validates the single-
timestamp pattern used in rewriteManifests produces consistent paths.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Make parseOperations return error on unknown operations

Previously parseOperations silently dropped unknown operation names
and could return an empty list. Now validates inputs against the
canonical set and returns a clear error if any unknown operation is
specified. Updated Execute to surface the error instead of proceeding
with an empty operation list.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Use gRPC status codes instead of string matching in listFilerEntries

Replace brittle strings.Contains(err.Error(), "not found") check with
status.Code(err) == codes.NotFound for proper gRPC error handling.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Add stale-plan guard in commit closures for expireSnapshots and rewriteManifests

Both operations plan outside the commit mutation using a snapshot ID
captured from the initial metadata read. If the table head advances
concurrently, the mutation would create a snapshot parented to the
wrong head or remove snapshots based on a stale view.

Add a guard inside each mutation closure that verifies
currentMeta.CurrentSnapshot().SnapshotID still matches the planned
snapshot ID. If it differs, return errStalePlan which propagates
immediately (not retried, since the plan itself is invalid).

Also fix rewriteManifests to derive SequenceNumber from the fresh
metadata (cs.SequenceNumber) instead of the captured currentSnap.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Add compare-and-swap to updateTableMetadataXattr

updateTableMetadataXattr previously re-read the entry but did not
verify the metadataVersion matched what commitWithRetry had loaded.
A concurrent update could be silently clobbered.

Now accepts expectedVersion parameter and compares it against the
stored metadataVersion before writing. Returns errMetadataVersionConflict
on mismatch, which commitWithRetry treats as retryable (deletes the
staged metadata file and retries with fresh state).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Export shared plugin worker helpers for use by sub-packages

Export ShouldSkipDetectionByInterval, BuildExecutorActivity, and
BuildDetectorActivity so the iceberg sub-package can reuse them
without duplicating logic.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Refactor iceberg maintenance handler into weed/plugin/worker/iceberg package

Split the 1432-line iceberg_maintenance_handler.go into focused files
in a new iceberg sub-package: handler.go, config.go, detection.go,
operations.go, filer_io.go, and compact.go (Phase 2 data compaction).

Key changes:
- Rename types to drop stutter (IcebergMaintenanceHandler → Handler, etc.)
- Fix loadFileByIcebergPath to preserve nested directory paths via
  normalizeIcebergPath instead of path.Base which dropped subdirectories
- Check SendProgress errors instead of discarding them
- Add stale-plan guard to compactDataFiles commitWithRetry closure
- Add "compact" operation to parseOperations canonical order
- Duplicate readStringConfig/readInt64Config helpers (~20 lines)
- Update worker_runtime.go to import new iceberg sub-package

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Remove iceberg_maintenance from default plugin worker job types

Iceberg maintenance is not yet ready to be enabled by default.
Workers can still opt in by explicitly listing iceberg_maintenance
in their job types configuration.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Clamp config values to safe minimums in ParseConfig

Prevents misconfiguration by enforcing minimum values using the
default constants for all config fields.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Harden filer I/O: path helpers, strict CAS guard, path traversal prevention

- Use path.Dir/path.Base instead of strings.SplitN in loadCurrentMetadata
- Make CAS guard error on missing or unparseable metadataVersion
- Add path.Clean and traversal validation in loadFileByIcebergPath

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Fix compact: single snapshot ID, oversized bin splitting, ensureFilerDir

- Use single newSnapID for all manifest entries in a compaction run
- Add splitOversizedBin to break bins exceeding targetSize
- Make ensureFilerDir only create on NotFound, propagate other errors

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Add wildcard filters, scan limit, and context cancellation to table scanning

- Use wildcard matchers (*, ?) for bucket/namespace/table filters
- Add limit parameter to scanTablesForMaintenance for early termination
- Add ctx.Done() checks in bucket and namespace scan loops
- Update filter UI descriptions and placeholders for wildcard support

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Remove dead detection interval check and validate namespace parameter

- Remove ineffective ShouldSkipDetectionByInterval call with hardcoded 0
- Add namespace to required parameter validation in Execute

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Improve operations: exponential backoff, orphan matching, full file cleanup

- Use exponential backoff (50ms, 100ms, 200ms, ...) in commitWithRetry
- Use normalizeIcebergPath for orphan matching instead of fragile suffix check
- Add collectSnapshotFiles to traverse manifest lists → manifests → data files
- Delete all unreferenced files after expiring snapshots, not just manifest lists
- Refactor removeOrphans to reuse collectSnapshotFiles

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* iceberg: fix ensureFilerDir to handle filer_pb.ErrNotFound sentinel

filer_pb.LookupEntry converts gRPC NotFound errors to filer_pb.ErrNotFound
(a plain sentinel), so status.Code() never returns codes.NotFound for that
error. This caused ensureFilerDir to return an error instead of creating
the directory when it didn't exist.

* iceberg: clean up orphaned artifacts when compaction commit fails

Track all files written during compaction (merged data files, manifest,
manifest list) and delete them if the commit or any subsequent write step
fails, preventing orphaned files from accumulating in the filer.

* iceberg: derive tablePath from namespace/tableName when empty

An empty table_path parameter would be passed to maintenance operations
unchecked. Default it to path.Join(namespace, tableName) when not provided.

* iceberg: make collectSnapshotFiles return error on read/parse failure

Previously, errors reading manifests were logged and skipped, returning a
partial reference set. This could cause incorrect delete decisions during
snapshot expiration or orphan cleanup. Now the function returns an error
and all callers abort when reference data is incomplete.

* iceberg: include active metadata file in removeOrphans referenced set

The metadataFileName returned by loadCurrentMetadata was discarded, so
the active metadata file could be incorrectly treated as an orphan and
deleted. Capture it and add it to the referencedFiles map.

* iceberg: only retry commitWithRetry on metadata version conflicts

Previously all errors from updateTableMetadataXattr triggered retries.
Now only errMetadataVersionConflict causes retry; other errors (permissions,
transport, malformed xattr) fail immediately.

* iceberg: respect req.Limit in fakeFilerServer.ListEntries mock

The mock ListEntries ignored the Limit field, so tests couldn't exercise
pagination. Now it stops streaming once Limit entries have been sent.

* iceberg: validate parquet schema compatibility before merging files

mergeParquetFiles now compares each source file's schema against the
first file's schema and aborts with a clear error if they differ, instead
of blindly writing rows that could panic or produce corrupt output.

* iceberg: normalize empty JobType to canonical jobType in Execute events

When request.Job.JobType is empty, status events and completion messages
were emitted with a blank job type. Derive a canonical value early and
use it consistently in all outbound events.

* iceberg: log warning on unexpected config value types in read helpers

readStringConfig and readInt64Config now log a V(1) warning when they
encounter an unhandled ConfigValue kind, aiding debugging of unexpected
config types that silently fall back to defaults.

* worker: add iceberg_maintenance to default plugin worker job types

Workers using the default job types list didn't advertise the
iceberg_maintenance handler despite the handler and canonical name
being registered. Add it so workers pick up the handler by default.

* iceberg: use defer and detached context for compaction artifact cleanup

The cleanup closure used the job context which could already be canceled,
and was not called on ctx.Done() early exits. Switch to a deferred
cleanup with a detached context (30s timeout) so artifact deletion
completes on all exit paths including context cancellation.

* iceberg: use proportional jitter in commitWithRetry backoff

Fixed 25ms max jitter becomes insignificant at higher retry attempts.
Use 0-20% of the current backoff value instead so jitter scales with
the exponential delay.

* iceberg: add malformed filename cases to extractMetadataVersion test

Cover edge cases like "invalid.metadata.json", "metadata.json", "",
and "v.metadata.json" to ensure the function returns 0 for unparseable
inputs.

* iceberg: fail compaction on manifest read errors and skip delete manifests

Previously, unreadable manifests were silently skipped during compaction,
which could drop live files from the entry set. Now manifest read/parse
errors are returned as fatal errors.

Also abort compaction when delete manifests exist since the compactor
does not apply deletes — carrying them through unchanged could produce
incorrect results.

* iceberg: use table-relative path for active metadata file in orphan scan

metadataFileName was stored as a basename (e.g. "v1.metadata.json") but
the orphan scanner matches against table-relative paths like
"metadata/v1.metadata.json". Prefix with "metadata/" so the active
metadata file is correctly recognized as referenced.

* iceberg: fix MetadataBuilderFromBase location to use metadata file path

The second argument to MetadataBuilderFromBase records the previous
metadata file in the metadata log. Using meta.Location() (the table
root) was incorrect — it must be the actual metadata file path so
old metadata files can be tracked and eventually cleaned up.

* iceberg: update metadataLocation and versionToken in xattr on commit

updateTableMetadataXattr was only updating metadataVersion,
modifiedAt, and fullMetadata but not metadataLocation or
versionToken. This left catalog state inconsistent after
maintenance commits — the metadataLocation still pointed to the
old metadata file and the versionToken was stale.

Add a newMetadataLocation parameter and regenerate the
versionToken on every commit, matching the S3 Tables handler
behavior.

* iceberg: group manifest entries by partition spec in rewriteManifests

rewriteManifests was writing all entries into a single manifest
using the table's current partition spec. For spec-evolved tables
where manifests reference different partition specs, this produces
an invalid manifest.

Group entries by the source manifest's PartitionSpecID and write
one merged manifest per spec, looking up each spec from the
table's PartitionSpecs list.

* iceberg: remove dead code loop for non-data manifests in compaction

The early abort guard at the top of compactDataFiles already ensures
no delete manifests are present. The loop that copied non-data
manifests into allManifests was unreachable dead code.

* iceberg: use JSON encoding in partitionKey for unambiguous grouping

partitionKey used fmt.Sprintf("%d=%v") joined by commas, which
produces ambiguous keys when partition values contain commas or '='.
Use json.Marshal for values and NUL byte as separator to eliminate
collisions.

* iceberg: precompute normalized reference set in removeOrphans

The orphan check was O(files × refs) because it normalized each
reference path inside the per-file loop. Precompute the normalized
set once for O(1) lookups per candidate file.

* iceberg: add artifact cleanup to rewriteManifests on commit failure

rewriteManifests writes merged manifests and a manifest list to
the filer before committing but did not clean them up on failure.
Add the same deferred cleanup pattern used by compactDataFiles:
track written artifacts and delete them if the commit does not
succeed.

* iceberg: pass isDeleteData=true in deleteFilerFile

deleteFilerFile called DoRemove with isDeleteData=false, which only
removed filer metadata and left chunk data behind on volume servers.
All other data-file deletion callers in the codebase pass true.

* iceberg: clean up test: remove unused snapID, simplify TestDetectWithFakeFiler

Remove unused snapID variable and eliminate the unnecessary second
fake filer + entry copy in TestDetectWithFakeFiler by capturing
the client from the first startFakeFiler call.

* fix: update TestWorkerDefaultJobTypes to expect 5 job types

The test expected 4 default job types but iceberg_maintenance was
added as a 5th default in a previous commit.

* iceberg: document client-side CAS TOCTOU limitation in updateTableMetadataXattr

Add a note explaining the race window where two workers can both
pass the version check and race at UpdateEntry. The proper fix
requires server-side precondition support on UpdateEntryRequest.

* iceberg: remove unused sender variable in TestFullExecuteFlow

* iceberg: abort compaction when multiple partition specs are present

The compactor writes all entries into a single manifest using the
current partition spec, which is invalid for spec-evolved tables.
Detect multiple PartitionSpecIDs and skip compaction until
per-spec compaction is implemented.

* iceberg: validate tablePath to prevent directory traversal

Sanitize the table_path parameter with path.Clean and verify it
matches the expected namespace/tableName prefix to prevent path
traversal attacks via crafted job parameters.

* iceberg: cap retry backoff at 5s and make it context-aware

The exponential backoff could grow unbounded and blocked on
time.Sleep ignoring context cancellation. Cap at 5s and use
a timer with select on ctx.Done so retries respect cancellation.

* iceberg: write manifest list with new snapshot identity in rewriteManifests

The manifest list was written with the old snapshot's ID and sequence
number, but the new snapshot created afterwards used a different
identity. Compute newSnapshotID and newSeqNum before writing
manifests and the manifest list so all artifacts are consistent.

* ec: also remove .vif file in removeEcVolumeFiles

removeEcVolumeFiles cleaned up .ecx, .ecj, and shard files but
not the .vif volume info file, leaving it orphaned. The .vif file
lives in the data directory alongside shard files.

The directory handling for index vs data files was already correct:
.ecx/.ecj are removed from IdxDirectory and shard files from
Directory, matching how NewEcVolume loads them.

Revert "ec: also remove .vif file in removeEcVolumeFiles"

This reverts commit acc82449e12a00115268a5652aef0d6c46d9f2dd.

* iceberg: skip orphan entries with nil Attributes instead of defaulting to epoch

When entry.Attributes is nil, mtime defaulted to Unix epoch (1970),
making unknown-age entries appear ancient and eligible for deletion.
Skip these entries instead to avoid deleting files whose age cannot
be determined.

* iceberg: use unique metadata filenames to prevent concurrent write clobbering

Add timestamp nonce to metadata filenames (e.g. v3-1709766000.metadata.json)
so concurrent writers stage to distinct files. Update extractMetadataVersion
to strip the nonce suffix, and loadCurrentMetadata to read the actual filename
from the metadataLocation xattr field.

* iceberg: defer artifact tracking until data file builder succeeds

Move the writtenArtifacts append to after NewDataFileBuilder succeeds,
so a failed builder doesn't leave a stale entry for an already-deleted
file in the cleanup list.

* iceberg: use detached context for metadata file cleanup

Use context.WithTimeout(context.Background(), 10s) when deleting staged
metadata files after CAS failure, so cleanup runs even if the original
request context is canceled.

* test: update default job types count to include iceberg_maintenance

* iceberg: use parquet.EqualNodes for structural schema comparison

Replace String()-based schema comparison with parquet.EqualNodes which
correctly compares types, repetition levels, and logical types.

* iceberg: add nonce-suffixed filename cases to TestExtractMetadataVersion

* test: assert iceberg_maintenance is present in default job types

* iceberg: validate operations config early in Detect

Call parseOperations in Detect so typos in the operations config fail
fast before emitting proposals, matching the validation already done
in Execute.

* iceberg: detect chunked files in loadFileByIcebergPath

Return an explicit error when a file has chunks but no inline content,
rather than silently returning empty data. Data files uploaded via S3
are stored as chunks, so compaction would otherwise produce corrupt
merged files.

---------

Co-authored-by: Copilot <copilot@github.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-07 12:45:33 -08:00
2026-01-20 14:12:14 -08:00
2026-02-20 18:42:00 -08:00
2019-04-30 03:23:20 +00:00
2026-01-27 17:27:02 -08:00
2023-01-05 11:01:22 -08:00

SeaweedFS

Slack Twitter Build Status GoDoc Wiki Docker Pulls SeaweedFS on Maven Central Artifact Hub

SeaweedFS Logo

Sponsor SeaweedFS via Patreon

SeaweedFS is an independent Apache-licensed open source project with its ongoing development made possible entirely thanks to the support of these awesome backers. If you'd like to grow SeaweedFS even stronger, please consider joining our sponsors on Patreon.

Your support will be really appreciated by me and other supporters!

Gold Sponsors

nodion piknik keepsec


Table of Contents

Quick Start

Quick Start with weed mini

The easiest way to get started with SeaweedFS for development and testing:

Example:

# remove quarantine on macOS
# xattr -d com.apple.quarantine  ./weed

./weed mini -dir=/data

This single command starts a complete SeaweedFS setup with:

Perfect for development, testing, learning SeaweedFS, and single node deployments!

Quick Start for S3 API on Docker

docker run -p 8333:8333 chrislusf/seaweedfs server -s3

Quick Start with Single Binary

  • Download the latest binary from https://github.com/seaweedfs/seaweedfs/releases and unzip a single binary file weed or weed.exe. Or run go install github.com/seaweedfs/seaweedfs/weed@latest.
  • export AWS_ACCESS_KEY_ID=admin ; export AWS_SECRET_ACCESS_KEY=key as the admin credentials to access the object store.
  • Run weed server -dir=/some/data/dir -s3 to start one master, one volume server, one filer, and one S3 gateway. The difference with weed mini is that weed mini can auto configure based on the single host environment, while weed server requires manual configuration and are designed for production use.

Also, to increase capacity, just add more volume servers by running weed volume -dir="/some/data/dir2" -master="<master_host>:9333" -port=8081 locally, or on a different machine, or on thousands of machines. That is it!

Introduction

SeaweedFS is a simple and highly scalable distributed file system. There are two objectives:

  1. to store billions of files!
  2. to serve the files fast!

SeaweedFS started as a blob store to handle small files efficiently. Instead of managing all file metadata in a central master, the central master only manages volumes on volume servers, and these volume servers manage files and their metadata. This relieves concurrency pressure from the central master and spreads file metadata into volume servers, allowing faster file access (O(1), usually just one disk read operation).

There is only 40 bytes of disk storage overhead for each file's metadata. It is so simple with O(1) disk reads that you are welcome to challenge the performance with your actual use cases.

SeaweedFS started by implementing Facebook's Haystack design paper. Also, SeaweedFS implements erasure coding with ideas from f4: Facebooks Warm BLOB Storage System, and has a lot of similarities with Facebooks Tectonic Filesystem and Google's Colossus File System

On top of the blob store, optional Filer can support directories and POSIX attributes. Filer is a separate linearly-scalable stateless server with customizable metadata stores, e.g., MySql, Postgres, Redis, Cassandra, HBase, Mongodb, Elastic Search, LevelDB, RocksDB, Sqlite, MemSql, TiDB, Etcd, CockroachDB, YDB, etc.

SeaweedFS can transparently integrate with the cloud. With hot data on local cluster, and warm data on the cloud with O(1) access time, SeaweedFS can achieve both fast local access time and elastic cloud storage capacity. What's more, the cloud storage access API cost is minimized. Faster and cheaper than direct cloud storage!

Back to TOC

Features

Additional Blob Store Features

  • Support different replication levels, with rack and data center aware.
  • Automatic master servers failover - no single point of failure (SPOF).
  • Automatic compression depending on file MIME type.
  • Automatic compaction to reclaim disk space after deletion or update.
  • Automatic entry TTL expiration.
  • Flexible Capacity Expansion: Any server with some disk space can add to the total storage space.
  • Adding/Removing servers does not cause any data re-balancing unless triggered by admin commands.
  • Optional picture resizing.
  • Support ETag, Accept-Range, Last-Modified, etc.
  • Support in-memory/leveldb/readonly mode tuning for memory/performance balance.
  • Support rebalancing the writable and readonly volumes.
  • Customizable Multiple Storage Tiers: Customizable storage disk types to balance performance and cost.
  • Transparent cloud integration: unlimited capacity via tiered cloud storage for warm data.
  • Erasure Coding for warm storage Rack-Aware 10.4 erasure coding reduces storage cost and increases availability. Enterprise version can customize EC ratio.

Back to TOC

Filer Features

Kubernetes

Back to TOC

Example: Using Seaweed Blob Store

By default, the master node runs on port 9333, and the volume nodes run on port 8080. Let's start one master node, and two volume nodes on port 8080 and 8081. Ideally, they should be started from different machines. We'll use localhost as an example.

SeaweedFS uses HTTP REST operations to read, write, and delete. The responses are in JSON or JSONP format.

Start Master Server

> ./weed master

Start Volume Servers

> weed volume -dir="/tmp/data1" -max=5  -master="localhost:9333" -port=8080 &
> weed volume -dir="/tmp/data2" -max=10 -master="localhost:9333" -port=8081 &

Write A Blob

A blob, also referred as a needle, a chunk, or mistakenly as a file, is just a byte array. It can have attributes, such as name, mime type, create or update time, etc. But basically it is just a byte array of a relatively small size, such as 2 MB ~ 64 MB. The size is not fixed.

To upload a blob: first, send a HTTP POST, PUT, or GET request to /dir/assign to get an fid and a volume server URL:

> curl http://localhost:9333/dir/assign
{"count":1,"fid":"3,01637037d6","url":"127.0.0.1:8080","publicUrl":"localhost:8080"}

Second, to store the blob content, send a HTTP multi-part POST request to url + '/' + fid from the response:

> curl -F file=@/home/chris/myphoto.jpg http://127.0.0.1:8080/3,01637037d6
{"name":"myphoto.jpg","size":43234,"eTag":"1cc0118e"}

To update, send another POST request with updated blob content.

For deletion, send an HTTP DELETE request to the same url + '/' + fid URL:

> curl -X DELETE http://127.0.0.1:8080/3,01637037d6

Save Blob Id

Now, you can save the fid, 3,01637037d6 in this case, to a database field.

The number 3 at the start represents a volume id. After the comma, it's one file key, 01, and a file cookie, 637037d6.

The volume id is an unsigned 32-bit integer. The file key is an unsigned 64-bit integer. The file cookie is an unsigned 32-bit integer, used to prevent URL guessing.

The file key and file cookie are both coded in hex. You can store the <volume id, file key, file cookie> tuple in your own format, or simply store the fid as a string.

If stored as a string, in theory, you would need 8+1+16+8=33 bytes. A char(33) would be enough, if not more than enough, since most uses will not need 2^32 volumes.

If space is really a concern, you can store the file id in the binary format. You would need one 4-byte integer for volume id, 8-byte long number for file key, and a 4-byte integer for the file cookie. So 16 bytes are more than enough.

Read a Blob

Here is an example of how to render the URL.

First look up the volume server's URLs by the file's volumeId:

> curl http://localhost:9333/dir/lookup?volumeId=3
{"volumeId":"3","locations":[{"publicUrl":"localhost:8080","url":"localhost:8080"}]}

Since (usually) there are not too many volume servers, and volumes don't move often, you can cache the results most of the time. Depending on the replication type, one volume can have multiple replica locations. Just randomly pick one location to read.

Now you can take the public URL, render the URL or directly read from the volume server via URL:

 http://localhost:8080/3,01637037d6.jpg

Notice we add a file extension ".jpg" here. It's optional and just one way for the client to specify the file content type.

If you want a nicer URL, you can use one of these alternative URL formats:

 http://localhost:8080/3/01637037d6/my_preferred_name.jpg
 http://localhost:8080/3/01637037d6.jpg
 http://localhost:8080/3,01637037d6.jpg
 http://localhost:8080/3/01637037d6
 http://localhost:8080/3,01637037d6

If you want to get a scaled version of an image, you can add some params:

http://localhost:8080/3/01637037d6.jpg?height=200&width=200
http://localhost:8080/3/01637037d6.jpg?height=200&width=200&mode=fit
http://localhost:8080/3/01637037d6.jpg?height=200&width=200&mode=fill

Rack-Aware and Data Center-Aware Replication

SeaweedFS applies the replication strategy at a volume level. So, when you are getting a blob id, you can specify the replication strategy. For example:

curl http://localhost:9333/dir/assign?replication=001

The replication parameter options are:

000: no replication
001: replicate once on the same rack
010: replicate once on a different rack, but same data center
100: replicate once on a different data center
200: replicate twice on two different data center
110: replicate once on a different rack, and once on a different data center

More details about replication can be found on the wiki.

You can also set the default replication strategy when starting the master server.

Allocate Blob Key on Specific Data Center

Volume servers can be started with a specific data center name:

 weed volume -dir=/tmp/1 -port=8080 -dataCenter=dc1
 weed volume -dir=/tmp/2 -port=8081 -dataCenter=dc2

When requesting a blob key, an optional "dataCenter" parameter can limit the assigned volume to the specific data center. For example, this specifies that the assigned volume should be limited to 'dc1':

 http://localhost:9333/dir/assign?dataCenter=dc1

Other Features

Back to TOC

Blob Store Architecture

Usually distributed file systems split each file into chunks. A central server keeps a mapping of filenames to chunks, and also which chunks each chunk server has.

The main drawback is that the central server can't handle many small files efficiently, and since all read requests need to go through the central master, so it might not scale well for many concurrent users.

Instead of managing chunks, SeaweedFS manages data volumes in the master server. Each data volume is 32GB in size, and can hold a lot of blobs. And each storage node can have many data volumes. So the master node only needs to store the metadata about the volumes, which is a fairly small amount of data and is generally stable.

The actual blob metadata, which are the blob volume, offset, and size, is stored in each volume on volume servers. Since each volume server only manages metadata of blobs on its own disk, with only 16 bytes for each blob, all access can read the metadata just from memory and only needs one disk operation to actually read file data.

For comparison, consider that an xfs inode structure in Linux is 536 bytes.

Master Server and Volume Server

The architecture is fairly simple. The actual data is stored in volumes on storage nodes. One volume server can have multiple volumes, and can both support read and write access with basic authentication.

All volumes are managed by a master server. The master server contains the volume id to volume server mapping. This is fairly static information, and can be easily cached.

On each write request, the master server also generates a file key, which is a growing 64-bit unsigned integer. Since write requests are not generally as frequent as read requests, one master server should be able to handle the concurrency well.

Write and Read files

When a client sends a write request, the master server returns (volume id, file key, file cookie, volume node URL) for the blob. The client then contacts the volume node and POSTs the blob content.

When a client needs to read a blob based on (volume id, file key, file cookie), it asks the master server by the volume id for the (volume node URL, volume node public URL), or retrieves this from a cache. Then the client can GET the content, or just render the URL on web pages and let browsers fetch the content.

Saving memory

All blob metadata stored on a volume server is readable from memory without disk access. Each file takes just a 16-byte map entry of <64bit key, 32bit offset, 32bit size>. Of course, each map entry has its own space cost for the map. But usually the disk space runs out before the memory does.

Tiered Storage to the cloud

The local volume servers are much faster, while cloud storages have elastic capacity and are actually more cost-efficient if not accessed often (usually free to upload, but relatively costly to access). With the append-only structure and O(1) access time, SeaweedFS can take advantage of both local and cloud storage by offloading the warm data to the cloud.

Usually hot data are fresh and warm data are old. SeaweedFS puts the newly created volumes on local servers, and optionally upload the older volumes on the cloud. If the older data are accessed less often, this literally gives you unlimited capacity with limited local servers, and still fast for new data.

With the O(1) access time, the network latency cost is kept at minimum.

If the hot/warm data is split as 20/80, with 20 servers, you can achieve storage capacity of 100 servers. That's a cost saving of 80%! Or you can repurpose the 80 servers to store new data also, and get 5X storage throughput.

Back to TOC

SeaweedFS Filer

Built on top of the blob store, SeaweedFS Filer adds directory structure to create a file system. The directory sturcture is an interface that is implemented in many key-value stores or databases.

The content of a file is mapped to one or many blobs, distributed to multiple volumes on multiple volume servers.

Compared to Other File Systems

Most other distributed file systems seem more complicated than necessary.

SeaweedFS is meant to be fast and simple, in both setup and operation. If you do not understand how it works when you reach here, we've failed! Please raise an issue with any questions or update this file with clarifications.

SeaweedFS is constantly moving forward. Same with other systems. These comparisons can be outdated quickly. Please help to keep them updated.

Back to TOC

Compared to HDFS

HDFS uses the chunk approach for each file, and is ideal for storing large files.

SeaweedFS is ideal for serving relatively smaller files quickly and concurrently.

SeaweedFS can also store extra large files by splitting them into manageable data chunks, and store the file ids of the data chunks into a meta chunk. This is managed by "weed upload/download" tool, and the weed master or volume servers are agnostic about it.

Back to TOC

Compared to GlusterFS, Ceph

The architectures are mostly the same. SeaweedFS aims to store and read files fast, with a simple and flat architecture. The main differences are

  • SeaweedFS optimizes for small files, ensuring O(1) disk seek operation, and can also handle large files.
  • SeaweedFS statically assigns a volume id for a file. Locating file content becomes just a lookup of the volume id, which can be easily cached.
  • SeaweedFS Filer metadata store can be any well-known and proven data store, e.g., Redis, Cassandra, HBase, Mongodb, Elastic Search, MySql, Postgres, Sqlite, MemSql, TiDB, CockroachDB, Etcd, YDB etc, and is easy to customize.
  • SeaweedFS Volume server also communicates directly with clients via HTTP, supporting range queries, direct uploads, etc.
System File Metadata File Content Read POSIX REST API Optimized for large number of small files
SeaweedFS lookup volume id, cacheable O(1) disk seek Yes Yes
SeaweedFS Filer Linearly Scalable, Customizable O(1) disk seek FUSE Yes Yes
GlusterFS hashing FUSE, NFS
Ceph hashing + rules FUSE Yes
MooseFS in memory FUSE No
MinIO separate meta file for each file Yes No

Back to TOC

Compared to GlusterFS

GlusterFS stores files, both directories and content, in configurable volumes called "bricks".

GlusterFS hashes the path and filename into ids, and assigned to virtual volumes, and then mapped to "bricks".

Back to TOC

Compared to MooseFS

MooseFS chooses to neglect small file issue. From moosefs 3.0 manual, "even a small file will occupy 64KiB plus additionally 4KiB of checksums and 1KiB for the header", because it "was initially designed for keeping large amounts (like several thousands) of very big files"

MooseFS Master Server keeps all meta data in memory. Same issue as HDFS namenode.

Back to TOC

Compared to Ceph

Ceph can be setup similar to SeaweedFS as a key->blob store. It is much more complicated, with the need to support layers on top of it. Here is a more detailed comparison

SeaweedFS has a centralized master group to look up free volumes, while Ceph uses hashing and metadata servers to locate its objects. Having a centralized master makes it easy to code and manage.

Ceph, like SeaweedFS, is based on the object store RADOS. Ceph is rather complicated with mixed reviews.

Ceph uses CRUSH hashing to automatically manage data placement, which is efficient to locate the data. But the data has to be placed according to the CRUSH algorithm. Any wrong configuration would cause data loss. Topology changes, such as adding new servers to increase capacity, will cause data migration with high IO cost to fit the CRUSH algorithm. SeaweedFS places data by assigning them to any writable volumes. If writes to one volume failed, just pick another volume to write. Adding more volumes is also as simple as it can be.

SeaweedFS is optimized for small files. Small files are stored as one continuous block of content, with at most 8 unused bytes between files. Small file access is O(1) disk read.

SeaweedFS Filer uses off-the-shelf stores, such as MySql, Postgres, Sqlite, Mongodb, Redis, Elastic Search, Cassandra, HBase, MemSql, TiDB, CockroachCB, Etcd, YDB, to manage file directories. These stores are proven, scalable, and easier to manage.

SeaweedFS comparable to Ceph advantage
Master MDS simpler
Volume OSD optimized for small files
Filer Ceph FS linearly scalable, Customizable, O(1) or O(logN)

Back to TOC

Compared to MinIO

MinIO follows AWS S3 closely and is ideal for testing for S3 API. It has good UI, policies, versionings, etc. SeaweedFS is trying to catch up here. It is also possible to put MinIO as a gateway in front of SeaweedFS later.

MinIO metadata are in simple files. Each file write will incur extra writes to corresponding meta file.

MinIO does not have optimization for lots of small files. The files are simply stored as is to local disks. Plus the extra meta file and shards for erasure coding, it only amplifies the LOSF problem.

MinIO has multiple disk IO to read one file. SeaweedFS has O(1) disk reads, even for erasure coded files.

MinIO has full-time erasure coding. SeaweedFS uses replication on hot data for faster speed and optionally applies erasure coding on warm data.

MinIO does not have POSIX-like API support.

MinIO has specific requirements on storage layout. It is not flexible to adjust capacity. In SeaweedFS, just start one volume server pointing to the master. That's all.

Dev Plan

  • More tools and documentation, on how to manage and scale the system.
  • Read and write stream data.
  • Support structured data.

This is a super exciting project! And we need helpers and support!

Back to TOC

Installation Guide

Installation guide for users who are not familiar with golang

Step 1: install go on your machine and setup the environment by following the instructions at:

https://golang.org/doc/install

make sure to define your $GOPATH

Step 2: checkout this repo:

git clone https://github.com/seaweedfs/seaweedfs.git

Step 3: download, compile, and install the project by executing the following command

cd seaweedfs/weed && make install

Once this is done, you will find the executable "weed" in your $GOPATH/bin directory

For more installation options, including how to run with Docker, see the Getting Started guide.

Back to TOC

Hard Drive Performance

When testing read performance on SeaweedFS, it basically becomes a performance test of your hard drive's random read speed. Hard drives usually get 100MB/s~200MB/s.

Solid State Disk

To modify or delete small files, SSD must delete a whole block at a time, and move content in existing blocks to a new block. SSD is fast when brand new, but will get fragmented over time and you have to garbage collect, compacting blocks. SeaweedFS is friendly to SSD since it is append-only. Deletion and compaction are done on volume level in the background, not slowing reading and not causing fragmentation.

Back to TOC

Benchmark

My Own Unscientific Single Machine Results on Mac Book with Solid State Disk, CPU: 1 Intel Core i7 2.6GHz.

Write 1 million 1KB file:

Concurrency Level:      16
Time taken for tests:   66.753 seconds
Completed requests:      1048576
Failed requests:        0
Total transferred:      1106789009 bytes
Requests per second:    15708.23 [#/sec]
Transfer rate:          16191.69 [Kbytes/sec]

Connection Times (ms)
              min      avg        max      std
Total:        0.3      1.0       84.3      0.9

Percentage of the requests served within a certain time (ms)
   50%      0.8 ms
   66%      1.0 ms
   75%      1.1 ms
   80%      1.2 ms
   90%      1.4 ms
   95%      1.7 ms
   98%      2.1 ms
   99%      2.6 ms
  100%     84.3 ms

Randomly read 1 million files:

Concurrency Level:      16
Time taken for tests:   22.301 seconds
Completed requests:      1048576
Failed requests:        0
Total transferred:      1106812873 bytes
Requests per second:    47019.38 [#/sec]
Transfer rate:          48467.57 [Kbytes/sec]

Connection Times (ms)
              min      avg        max      std
Total:        0.0      0.3       54.1      0.2

Percentage of the requests served within a certain time (ms)
   50%      0.3 ms
   90%      0.4 ms
   98%      0.6 ms
   99%      0.7 ms
  100%     54.1 ms

Run WARP and launch a mixed benchmark.

make benchmark
warp: Benchmark data written to "warp-mixed-2025-12-05[194844]-kBpU.csv.zst"

Mixed operations.
Operation: DELETE, 10%, Concurrency: 20, Ran 42s.
 * Throughput: 55.13 obj/s

Operation: GET, 45%, Concurrency: 20, Ran 42s.
 * Throughput: 2477.45 MiB/s, 247.75 obj/s

Operation: PUT, 15%, Concurrency: 20, Ran 42s.
 * Throughput: 825.85 MiB/s, 82.59 obj/s

Operation: STAT, 30%, Concurrency: 20, Ran 42s.
 * Throughput: 165.27 obj/s

Cluster Total: 3302.88 MiB/s, 550.51 obj/s over 43s.

Back to TOC

Enterprise

For enterprise users, please visit seaweedfs.com for the SeaweedFS Enterprise Edition, which has a self-healing storage format with better data protection.

Back to TOC

License

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

The text of this page is available for modification and reuse under the terms of the Creative Commons Attribution-Sharealike 3.0 Unported License and the GNU Free Documentation License (unversioned, with no invariant sections, front-cover texts, or back-cover texts).

Back to TOC

Stargazers over time

Stargazers over time

Description
SeaweedFS is a distributed storage system for object storage (S3), file systems, and Iceberg tables, designed to handle billions of files with O(1) disk access and effortless horizontal scaling.
Readme Apache-2.0 163 MiB
Languages
Go 81.6%
Rust 6.3%
templ 4.3%
Java 3%
Shell 1.8%
Other 2.8%