Go to file

Chris Lu 2ec0a67ee3 master: return 503/Unavailable during topology warmup after leader change (#8529 )

* master: return 503/Unavailable during topology warmup after leader change

After a master restart or leader change, the topology is empty until
volume servers reconnect and send heartbeats. During this warmup window
(3 heartbeat intervals = 15 seconds), volume lookups that fail now
return 503 Service Unavailable (HTTP) or gRPC Unavailable instead of
404 Not Found, signaling clients to retry with other masters.

* master: skip warmup 503 on fresh start and single-master setups

- Check MaxVolumeId > 0 to distinguish restart from fresh start
  (MaxVolumeId is Raft-persisted, so 0 means no prior data)
- Check peer count > 1 so single-master deployments aren't affected
  (no point suggesting "retry with other masters" if there are none)

* master: address review feedback and block assigns during warmup

- Protect LastLeaderChangeTime with dedicated mutex (fix data race)
- Extract warmup multiplier as WarmupPulseMultiplier constant
- Derive Retry-After header from pulse config instead of hardcoding
- Only trigger warmup 503 for "not found" errors, not parse errors
- Return nil response (not partial) on gRPC Unavailable
- Add doc comments to IsWarmingUp, getter/setter, WarmupDuration
- Block volume assign requests (HTTP and gRPC) during warmup,
  since the topology is incomplete and assignments would be unreliable
- Skip warmup behavior for single-master setups (no peers to retry)

* master: apply warmup to all setups, skip only on fresh start

Single-master restarts still have an empty topology until heartbeats
arrive, so warmup protection should apply there too. The only case
to skip is a fresh cluster start (MaxVolumeId == 0), which already
has no volumes to look up.

- Remove GetMasterCount() > 1 guard from all warmup checks
- Remove now-unused GetMasterCount helper
- Update error messages to "topology is still loading" (not
  "retry with other masters" which doesn't apply to single-master)

* master: add client-side retry on Unavailable for lookup and assign

The server-side 503/Unavailable during warmup needs client cooperation.
Previously, LookupVolumeIds and Assign would immediately propagate the
error without retry.

Now both paths retry with exponential backoff (1s -> 1.5s -> ... up to
6s) when receiving Unavailable, respecting context cancellation. This
covers the warmup window where the master's topology is still loading
after a restart or leader change.

* master: seed warmup timestamp in legacy raft path at setup

The legacy raft path only set lastLeaderChangeTime inside the event
listener callback, which could fire after IsLeader() was already
observed as true in SetRaftServer. Seed the timestamp at setup time
(matching the hashicorp path) so IsWarmingUp() is active immediately.

* master: fix assign retry loop to cover full warmup window

The retry loop used waitTime <= maxWaitTime as a stop condition,
causing it to give up after ~13s while warmup lasts 15s. Now cap
each individual sleep at maxWaitTime but keep retrying until the
context is cancelled.

* master: preserve gRPC status in lookup retry and fix retry window

Return the raw gRPC error instead of wrapping with fmt.Errorf so
status.FromError() can extract the status code. Use proper gRPC
status check (codes.Unavailable) instead of string matching. Also
cap individual sleep at maxWaitTime while retrying until ctx is done.

* master: use gRPC status code instead of string matching in assign retry

Use status.FromError/codes.Unavailable instead of brittle
strings.Contains for detecting retriable gRPC errors in the
assign retry loop.

* master: use remaining warmup duration for Retry-After header

Set Retry-After to the remaining warmup time instead of the full
warmup duration, so clients don't wait longer than necessary.

* master: reset ret.Replicas before populating from assign response

Clear Replicas slice before appending to prevent duplicate entries
when the assign response is retried or when alternative requests
are attempted.

* master: add unit tests for warmup retry behavior

Test that Assign() and LookupVolumeIds() retry on codes.Unavailable
and stop promptly when the context is cancelled.

* master: record leader change time before initialization work

Move SetLastLeaderChangeTime() to fire immediately when the leader
change event is received, before DoBarrier(), EnsureTopologyId(),
and updatePeers(), so the warmup clock starts at the true moment
of leadership transition.

* master: use topology warmup duration in volume growth wait loop

Replace hardcoded constants.VolumePulsePeriod * 2 with
topo.IsWarmingUp() and topo.WarmupDuration() so the growth wait
stays in sync with the configured warmup window. Remove unused
constants import.

* master: resolve master before creating RPC timeout context

Move GetMaster() call before context.WithTimeout() so master
resolution blocking doesn't consume the gRPC call timeout.

* master: use NotFound flag instead of string matching for volume lookup

Add a NotFound field to LookupResult and set it in findVolumeLocation
when a volume is genuinely missing. Update HTTP and gRPC warmup
checks to use this flag instead of strings.Contains on the error
message.

* master: bound assign retry loop to 30s for deadline-free contexts

Without a context deadline, the Unavailable retry loop could spin
forever. Add a maxRetryDuration of 30s so the loop gives up even
when no context deadline is set.

* master: strengthen assign retry cancellation test

Verify the retry loop actually retried (callCount > 1) and that
the returned error is context.DeadlineExceeded, not just any error.

* master: extract shared retry-with-backoff utility

Add util.RetryWithBackoff for context-aware, bounded retry with
exponential backoff. Refactor both Assign() and LookupVolumeIds()
to use it instead of duplicating the retry/sleep/backoff logic.

* master: cap waitTime in RetryWithBackoff to prevent unbounded growth

Cap the backoff waitTime at maxWaitTime so it doesn't grow
indefinitely in long-running retry scenarios.

* master: only return Unavailable during warmup when all lookups failed

For batched LookupVolume requests, return partial results when some
volumes are found. Only return codes.Unavailable when no volumes
were successfully resolved, so clients benefit from partial results
instead of retrying unnecessarily.

* master: set retriable error message in 503 response body

When returning 503 during warmup, replace the "not found" error
in the JSON body with "service warming up, please retry" so
clients don't treat it as a permanent error.

* master: guard empty master address in LookupVolumeIds

If GetMaster() returns empty (no master found or ctx cancelled),
return an appropriate error instead of dialing an empty address.
Returns ctx.Err() if context is done, otherwise codes.Unavailable
to trigger retry.

* master: add comprehensive tests for RetryWithBackoff

Test success after retries, non-retryable error handling, context
cancellation, and maxDuration cap with context.Background().

* master: enforce hard maxDuration bound in RetryWithBackoff

Use a deadline instead of elapsed-time check so the last sleep is
capped to remaining time. This prevents the total retry duration
from overshooting maxDuration by up to one full backoff interval.

* master: respect fresh-start bypass in RemainingWarmupDuration

Check IsWarmingUp() first (which returns false when MaxVolumeId==0)
so RemainingWarmupDuration returns 0 on fresh clusters.

* master: round up Retry-After seconds to avoid underestimating

Use math.Ceil so fractional remaining seconds (e.g. 1.9s) round
up to the next integer (2) instead of flooring down (1).

* master: tighten batch lookup warmup to all-NotFound only

Only return codes.Unavailable when every requested volume ID was
a transient not-found. Mixed cases with non-NotFound errors now
return the response with per-volume error details preserved.

* master: reduce retry log noise and fix timer leak

Lower per-attempt retry log from V(0) to V(1) to reduce noise
during warmup. Replace time.After with time.NewTimer to avoid
lingering timers when context is cancelled.

* master: add per-attempt timeout for assign RPC

Use a 10s per-attempt timeout so a single slow RPC can't consume
the entire 30s retry budget when ctx has no deadline.

* master: share single 30s retry deadline across assign request entries

The Assign() function iterates over primary and fallback requests,
previously giving each its own 30s RetryWithBackoff budget. With a
primary + fallback, the total could reach 60s. Compute one deadline
up front and pass the remaining budget to each RetryWithBackoff call
so the entire Assign() call stays within a single 30s cap.

* master: strengthen context-cancel test with DeadlineExceeded and retry assertions

Assert errors.Is(err, context.DeadlineExceeded) to verify the error
is specifically from the context deadline, and check callCount > 1
to prove retries actually occurred before cancellation. Mirrors the
pattern used in TestAssignStopsOnContextCancel.

* master: bound GetMaster with per-attempt timeout in LookupVolumeIds

GetMaster() calls WaitUntilConnected() which can block indefinitely
if no master is available. Previously it used the outer ctx, so a
slow master resolution could consume the entire RetryWithBackoff
budget in a single attempt. Move the per-attempt timeoutCtx creation
before the GetMaster call so both master resolution and the gRPC
LookupVolume RPC share one grpcTimeout-bounded attempt.

* master: use deadline-aware context for assign retry budget

The shared 30s deadline only limited RetryWithBackoff's internal
wall-clock tracking, but per-attempt contexts were still derived
from the original ctx and could run for up to 10s even when the
budget was nearly exhausted. Create a deadlineCtx from the computed
deadline and derive both RetryWithBackoff and per-attempt timeouts
from it so all operations honor the shared 30s cap.

* master: skip warmup gate for empty lookup requests

When VolumeOrFileIds is empty, notFoundCount == len(req.VolumeOrFileIds)
is 0 == 0 which is true, causing empty lookup batches during warmup to
return codes.Unavailable and be retried endlessly. Add a
len(req.VolumeOrFileIds) > 0 guard so empty requests pass through.

* master: validate request fields before warmup gate in Assign

Move Replication and Ttl parsing before the IsWarmingUp() check so
invalid inputs get a proper validation error instead of being masked
by codes.Unavailable during warmup. Pure syntactic validation does
not depend on topology state and should run first.

* master: check deadline and context before starting retry attempt

RetryWithBackoff only checked the deadline and context after an
attempt completed or during the sleep select. If the deadline
expired or context was canceled during sleep, the next iteration
would still call operation() before detecting it. Add pre-operation
checks so no new attempt starts after the budget is exhausted.

* master: always return ctx.Err() on context cancellation in RetryWithBackoff

When ctx.Err() is non-nil, the pre-operation check was returning
lastErr instead of ctx.Err(). This broke callers checking
errors.Is(err, context.DeadlineExceeded) and contradicted the
documented contract. Always return ctx.Err() so the cancellation
reason is properly surfaced.

* master: handle warmup errors in StreamAssign without killing the stream

StreamAssign was returning codes.Unavailable errors from Assign
directly, which terminates the gRPC stream and breaks pooled
connections. Instead, return transient errors as in-band error
responses so the stream survives warmup periods.

Also reset assignClient in doAssign on Send/Recv failures so a
broken stream doesn't leave the proxy permanently dead.

* master: wait for warmup before slot search in findAndGrow

findEmptySlotsForOneVolume was called before the warmup wait loop,
selecting slots from an incomplete topology. Move the warmup wait
before slot search so volume placement uses the fully warmed-up
topology with all servers registered.

* master: add Retry-After header to /dir/assign warmup response

The /dir/lookup handler already sets Retry-After during warmup but
/dir/assign did not, leaving HTTP clients without guidance on when
to retry. Add the same header using RemainingWarmupDuration().

* master: only seed warmup timestamp on leader at startup

SetLastLeaderChangeTime was called unconditionally for both leader
and follower nodes. Followers don't need warmup state, and the
leader change event listener handles real elections. Move the seed
into the IsLeader() block so only the startup leader gets warmup
initialized.

* master: preserve codes.Unavailable for StreamAssign warmup errors in doAssign

StreamAssign returns transient warmup errors as in-band
AssignResponse.Error messages. doAssign was converting these to plain
fmt.Errorf, losing the codes.Unavailable classification needed for
the caller's retry logic. Detect warmup error messages and wrap them
as status.Error(codes.Unavailable) so RetryWithBackoff can retry.

2026-03-08 16:05:45 -07:00

.github

ci: build _full and _large_disk_full images for arm64 (#8548 )

2026-03-07 13:55:37 -08:00

cmd

Move SQL engine and PostgreSQL server to their own binaries (#8417 )

2026-02-23 16:27:08 -08:00

docker

Disable master maintenance scripts when admin server runs (#8499 )

2026-03-04 00:40:40 -08:00

k8s/charts

Update helm for support on OpenShift to have data replication and replicas for master,filer and volume (#8543 )

2026-03-07 05:30:23 -08:00

note

Update Wiki images (#8069 )

2026-01-20 14:12:14 -08:00

other

mount: make metadata cache rebuilds snapshot-consistent (#8531 )

2026-03-07 09:19:40 -08:00

postgres-examples

Message Queue: Add sql querying (#7185 )

2025-09-09 01:01:03 -07:00

seaweedfs-rdma-sidecar

build(deps): bump bytes from 1.10.1 to 1.11.1 in /seaweedfs-rdma-sidecar/rdma-engine (#8195 )

2026-02-03 13:40:42 -08:00

snap

move to https://github.com/seaweedfs/seaweedfs

2022-07-29 00:17:28 -07:00

telemetry

Make EC detection context aware (#8449 )

2026-02-25 18:02:35 -08:00

test

s3/iam: reuse one request id per request (#8538 )

2026-03-06 15:22:39 -08:00

unmaintained

go fix

2026-02-20 18:42:00 -08:00

util

util: added gostd script

2019-04-30 03:23:20 +00:00

weed

master: return 503/Unavailable during topology warmup after leader change (#8529 )

2026-03-08 16:05:45 -07:00

.gitignore

Fix jwt error in admin UI (#8140 )

2026-01-27 17:27:02 -08:00

backers.md

chore: add nimbus web services to backers.md (#4769 )

2023-08-20 15:31:23 -07:00

CODE_OF_CONDUCT.md

add code of conduct (#4109 )

2023-01-05 11:01:22 -08:00

go.mod

build(deps): bump gocloud.dev/pubsub/natspubsub from 0.44.0 to 0.45.0 (#8487 )

2026-03-02 10:28:52 -08:00

go.sum

fix: port in SNI address when using domainName instead of IP for master (#8500 )

2026-03-04 07:05:45 -08:00

LICENSE

Update LICENSE, fix copyright license year (#6405 )

2025-01-01 01:55:42 -08:00

Makefile

Move SQL engine and PostgreSQL server to their own binaries (#8417 )

2026-02-23 16:27:08 -08:00

README.md

Remove AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY exports

2026-01-14 13:09:11 -08:00

SeaweedFS

SeaweedFS is an independent Apache-licensed open source project with its ongoing development made possible entirely thanks to the support of these awesome backers. If you'd like to grow SeaweedFS even stronger, please consider joining our sponsors on Patreon.

Your support will be really appreciated by me and other supporters!

Quick Start

Quick Start with weed mini

The easiest way to get started with SeaweedFS for development and testing:

Download the latest binary from https://github.com/seaweedfs/seaweedfs/releases and unzip a single binary file weed or weed.exe.

Example:

# remove quarantine on macOS
# xattr -d com.apple.quarantine  ./weed

./weed mini -dir=/data

This single command starts a complete SeaweedFS setup with:

Master UI: http://localhost:9333
Volume Server: http://localhost:9340
Filer UI: http://localhost:8888
S3 Endpoint: http://localhost:8333
WebDAV: http://localhost:7333
Admin UI: http://localhost:23646

Perfect for development, testing, learning SeaweedFS, and single node deployments!

Quick Start for S3 API on Docker

docker run -p 8333:8333 chrislusf/seaweedfs server -s3

Quick Start with Single Binary

Download the latest binary from https://github.com/seaweedfs/seaweedfs/releases and unzip a single binary file weed or weed.exe. Or run go install github.com/seaweedfs/seaweedfs/weed@latest.
export AWS_ACCESS_KEY_ID=admin ; export AWS_SECRET_ACCESS_KEY=key as the admin credentials to access the object store.
Run weed server -dir=/some/data/dir -s3 to start one master, one volume server, one filer, and one S3 gateway. The difference with weed mini is that weed mini can auto configure based on the single host environment, while weed server requires manual configuration and are designed for production use.

Also, to increase capacity, just add more volume servers by running weed volume -dir="/some/data/dir2" -master="<master_host>:9333" -port=8081 locally, or on a different machine, or on thousands of machines. That is it!

Introduction

SeaweedFS is a simple and highly scalable distributed file system. There are two objectives:

to store billions of files!
to serve the files fast!

SeaweedFS started as a blob store to handle small files efficiently. Instead of managing all file metadata in a central master, the central master only manages volumes on volume servers, and these volume servers manage files and their metadata. This relieves concurrency pressure from the central master and spreads file metadata into volume servers, allowing faster file access (O(1), usually just one disk read operation).

There is only 40 bytes of disk storage overhead for each file's metadata. It is so simple with O(1) disk reads that you are welcome to challenge the performance with your actual use cases.

SeaweedFS started by implementing Facebook's Haystack design paper. Also, SeaweedFS implements erasure coding with ideas from f4: Facebook’s Warm BLOB Storage System, and has a lot of similarities with Facebook’s Tectonic Filesystem and Google's Colossus File System

On top of the blob store, optional Filer can support directories and POSIX attributes. Filer is a separate linearly-scalable stateless server with customizable metadata stores, e.g., MySql, Postgres, Redis, Cassandra, HBase, Mongodb, Elastic Search, LevelDB, RocksDB, Sqlite, MemSql, TiDB, Etcd, CockroachDB, YDB, etc.

SeaweedFS can transparently integrate with the cloud. With hot data on local cluster, and warm data on the cloud with O(1) access time, SeaweedFS can achieve both fast local access time and elastic cloud storage capacity. What's more, the cloud storage access API cost is minimized. Faster and cheaper than direct cloud storage!

System	File Metadata	File Content Read	POSIX	REST API	Optimized for large number of small files
SeaweedFS	lookup volume id, cacheable	O(1) disk seek		Yes	Yes
SeaweedFS Filer	Linearly Scalable, Customizable	O(1) disk seek	FUSE	Yes	Yes
GlusterFS	hashing		FUSE, NFS
Ceph	hashing + rules		FUSE	Yes
MooseFS	in memory		FUSE		No
MinIO	separate meta file for each file			Yes	No

SeaweedFS	comparable to Ceph	advantage
Master	MDS	simpler
Volume	OSD	optimized for small files
Filer	Ceph FS	linearly scalable, Customizable, O(1) or O(logN)

README.md Unescape Escape

SeaweedFS

Sponsor SeaweedFS via Patreon

Gold Sponsors

Table of Contents

Quick Start

Quick Start with weed mini

Quick Start for S3 API on Docker

Quick Start with Single Binary

Introduction

Features

Additional Blob Store Features

Filer Features

Kubernetes

Example: Using Seaweed Blob Store

Start Master Server

Start Volume Servers

Write A Blob

Save Blob Id

Read a Blob

Rack-Aware and Data Center-Aware Replication

Allocate Blob Key on Specific Data Center

Other Features

Blob Store Architecture

Master Server and Volume Server

Write and Read files

Saving memory

Tiered Storage to the cloud

SeaweedFS Filer

Compared to Other File Systems

Compared to HDFS

Compared to GlusterFS, Ceph

Compared to GlusterFS

Compared to MooseFS

Compared to Ceph

Compared to MinIO

Dev Plan

Installation Guide

Disk Related Topics

Hard Drive Performance

Solid State Disk

Benchmark

Run WARP and launch a mixed benchmark.

Enterprise

License

Stargazers over time

README.md