filer.sync: add exponential backoff on unexpected EOF during replication (#8557)

* filer.sync: add exponential backoff on unexpected EOF during replication

When the source volume server drops connections under high traffic,
filer.sync retries aggressively (every 1-6s), hammering the already
overloaded source. This adds a longer exponential backoff (10s to 2min)
specifically for "unexpected EOF" errors, reducing pressure on the
source while still retrying indefinitely until success.

Also adds more logging throughout the replication path:
- Log source URL and error at V(0) when ReadPart or io.ReadAll fails
- Log content-length and byte counts at V(4) on success
- Log backoff duration in retry messages

Fixes #8542

* filer.sync: extract backoff helper and fix 2-minute cap

- Extract nextEofBackoff() and isEofError() helpers to deduplicate
  the backoff logic between fetchAndWrite and uploadManifestChunk
- Fix the cap: previously 80s would double to 160s and pass the
  < 2min check uncapped. Now doubles first, then clamps to 2min.

* filer.sync: log source URL instead of empty upload URL on read errors

UploadUrl is not populated until after the reader is consumed, so the
V(0) and V(4) logs were printing an empty string. Add SourceUrl field
to UploadOption and populate it from the HTTP response in fetchAndWrite.

* filer.sync: guard isEofError against nil error

* filer.sync: use errors.Is for EOF detection, fix log wording

- Replace broad substring matching ("read input", "unexpected EOF")
  with errors.Is(err, io.ErrUnexpectedEOF) and errors.Is(err, io.EOF)
  so only actual EOF errors trigger the longer backoff
- Fix awkward log phrasing: "interrupted replicate" → "interrupted
  while replicating"

* filer.sync: remove EOF backoff from uploadManifestChunk

uploadManifestChunk reads from an in-memory bytes.Reader, so any EOF
errors there are from the destination side, not a broken source stream.
The long source-oriented backoff is inappropriate; let RetryUntil
handle destination retries at its normal cadence.

---------

Co-authored-by: Copilot <copilot@github.com>
This commit is contained in:
Chris Lu
2026-03-08 14:33:37 -07:00
committed by GitHub
parent ba66411337
commit 0647f66bb5
3 changed files with 53 additions and 4 deletions

View File

@@ -38,6 +38,7 @@ type UploadOption struct {
RetryForever bool
Md5 string
BytesBuffer *bytes.Buffer
SourceUrl string // optional: for logging when reading from a remote source
}
type UploadResult struct {
@@ -135,9 +136,11 @@ func (uploader *Uploader) UploadWithRetry(filerClient filer_pb.FilerClient, assi
} else {
data, err = io.ReadAll(reader)
if err != nil {
glog.V(0).Infof("upload read input %s: %v", uploadOption.SourceUrl, err)
err = fmt.Errorf("read input: %w", err)
return
}
glog.V(4).Infof("upload read %d bytes from %s", len(data), uploadOption.SourceUrl)
}
doUploadFunc := func() error {