Files
seaweedFS/weed/mq
Chris Lu 7f3f61ea28 fix: resolve Kafka gateway response deadlocks causing Sarama client hangs (#8762)
* fix: resolve Kafka gateway response deadlocks causing Sarama client hangs

Fix three bugs in the Kafka protocol handler that caused sequential
clients (notably Sarama) to hang during E2E tests:

1. Race condition in correlation queue ordering: the correlation ID was
   added to the response ordering queue AFTER sending the request to
   the processing channel. A fast processor (e.g. ApiVersions) could
   finish and send its response before the ID was in the queue, causing
   the response writer to miss it — permanently deadlocking the
   connection. Now the ID is added BEFORE the channel send, with error
   response injection on send failure.

2. Silent error response drops: when processRequestSync returned an
   error, the response writer logged it but never sent anything back to
   the client. The client would block forever waiting for bytes that
   never arrived. Now sends a Kafka UNKNOWN_SERVER_ERROR response.

3. Produce V0/V1 missing timeout_ms parsing: the handler skipped the
   4-byte timeout field, reading it as topicsCount instead. This caused
   incorrect parsing of the entire produce request for V0/V1 clients.

* fix: API-versioned error responses, unsupported-version queue fix, V0V1 header alignment

1. errors.go — BuildAPIErrorResponse: emits a minimal-but-valid error
   body whose layout matches the schema the client expects for each API
   key and version (throttle_time position, array fields, etc.).  The
   old 2-byte generic body corrupted the protocol stream for APIs whose
   response begins with throttle_time_ms or an array.

2. handler.go — unsupported-version path: the correlationID was never
   added to correlationQueue before sending to responseChan, so the
   response writer could never match it and the client hung.  Now
   appends the ID under correlationQueueMu before the send.

3. produce.go — handleProduceV0V1: requestBody is already post-header
   (HandleConn strips client_id).  The handler was erroneously parsing
   acks bytes as a client_id length, misaligning all subsequent field
   reads.  Removed the client_id parsing; offset now starts at 0 with
   acks(2) + timeout_ms(4) + topicsCount(4), matching handleProduceV2Plus.

* fix: free pooled message buffer per-iteration instead of deferring

The read loop allocated messageBuf via mem.Allocate and deferred
mem.Free.  Since the defer only runs when HandleConn returns, pool
buffers accumulated for the entire connection lifetime — one per
request.  Worse, the deferred frees ran in LIFO order before
wg.Wait(), so processing goroutines could read from already-freed
pool buffers.

Now: read into a pooled buffer, immediately copy to Go-managed
memory, and return the pool buffer.  messageBuf is a regular slice
safe for async goroutine access with no defer accumulation.

* fix: cancel context before wg.Wait and on worker response-send timeout

Two related issues:

1. Cleanup defer ordering deadlock: defers run LIFO — the cleanup defer
   (close channels, wg.Wait) ran before the cancel() defer.  The
   response writer is in the WaitGroup and exits only on ctx.Done() or
   responseChan close, but both signals came after wg.Wait().  Deadlock
   on every normal connection close (EOF, read error, queue-full).
   Fix: call cancel() at the start of the cleanup defer, before
   wg.Wait().

2. Worker 5s response-send timeout: when the timeout fired, the
   response was silently dropped but the correlationID remained in the
   ordered queue.  The response writer could never advance past it,
   stalling all subsequent responses permanently.
   Fix: call cancel() to tear down the connection — if we cannot
   deliver a response in 5s the connection is irrecoverable.

* chore: remove empty no-op ListOffsets conditional

The `if apiKey == 2 {}` block had no body — leftover debug code.
ListOffsets routing is handled by isDataPlaneAPI (returns false,
sending it to the control channel). No behavior change.
2026-03-24 13:17:25 -07:00
..
2025-10-27 23:04:55 -07:00
2022-07-31 13:23:44 -07:00

SeaweedMQ Message Queue on SeaweedFS (WIP, not ready)

What are the use cases it is designed for?

Message queues are like water pipes. Messages flow in the pipes to their destinations.

However, what if a flood comes? Of course, you can increase the number of partitions, add more brokers, restart, and watch the traffic level closely.

Sometimes the flood is expected. For example, backfill some old data in batch, and switch to online messages. You may want to ensure enough brokers to handle the data and reduce them later to cut cost.

SeaweedMQ is designed for use cases that need to:

  • Receive and save large number of messages.
  • Handle spike traffic automatically.

What is special about SeaweedMQ?

  • Separate computation and storage nodes to scale independently.
    • Unlimited storage space by adding volume servers.
    • Unlimited message brokers to handle incoming messages.
    • Offline messages can be operated as normal files.
  • Scale up and down with auto split and merge message topics.
    • Topics can automatically split into segments when traffic increases, and vice verse.
  • Pass messages by reference instead of copying.
    • Clients can optionally upload the messages first and just submit the references.
    • Drastically reduce the broker load.
  • Stateless brokers
    • All brokers are equal. One broker is dynamically picked as the leader.
    • Add brokers at any time.
    • Allow rolling restart brokers or remove brokers at a pace.

Design

How it works?

Brokers are just computation nodes without storage. When a broker starts, it reports itself to masters. Among all the brokers, one of them will be selected as the leader by the masters.

A topic needs to define its partition key on its messages.

Messages for a topic are divided into segments. One segment can cover a range of partitions. A segment can be split into 2 segments, or 2 neighboring segments can be merged back to one segment.

During write time, the client will ask the broker leader for a few brokers to process the segment.

The broker leader will check whether the segment already has assigned the brokers. If not, select a few brokers based on their loads, save the selection into filer, and tell the client.

The client will write the messages for this segment to the selected brokers.

Failover

The broker leader does not contain any state. If it fails, the masters will select a different broker.

For a segment, if any one of the selected brokers is down, the remaining brokers should try to write received messages to the filer, and close the segment to the clients.

Then the clients should start a new segment. The masters should assign other healthy brokers to handle the new segment.

So any brokers can go down without losing data.

Auto Split or Merge

(The idea is learned from Pravega.)

The brokers should report its traffic load to the broker leader periodically.

If any segment has too much load, the broker leader will ask the brokers to tell the client to close current one and create two new segments.

If 2 neighboring segments have the combined load below average load per segment, the broker leader will ask the brokers to tell the client to close this 2 segments and create a new segment.