Commit Graph

56 Commits

Author SHA1 Message Date
Chris Lu
597d383ca4 filer.sync: fix data races in ChunkTransferStatus
Add sync.RWMutex to ChunkTransferStatus and lock around all field
mutations in fetchAndWrite. ActiveTransfers now returns value copies
under RLock so callers get immutable snapshots.
2026-04-02 13:04:21 -07:00
Chris Lu
b5cdd71600 filer.sync: include last error in stall diagnostics 2026-04-02 12:18:56 -07:00
Chris Lu
2d4ea8c665 filer.sync: show active chunk transfers when sync progress stalls
When the sync watermark is not advancing, print each in-progress chunk
transfer with its file path, bytes received so far, and current status
(downloading, uploading, or waiting with backoff duration). This helps
diagnose which files are blocking progress during replication.

Closes #8542
2026-04-02 12:14:25 -07:00
Chris Lu
b665c329bc fix(replication): resume partial chunk reads on EOF instead of re-downloading (#8607)
* fix(replication): resume partial chunk reads on EOF instead of re-downloading

When replicating chunks and the source connection drops mid-transfer,
accumulate the bytes already received and retry with a Range header
to fetch only the remaining bytes. This avoids re-downloading
potentially large chunks from scratch on each retry, reducing load
on busy source servers and speeding up recovery.

* test(replication): add tests for downloadWithRange including gzip partial reads

Tests cover:
- No offset (no Range header sent)
- With offset (Range header verified)
- Content-Disposition filename extraction
- Partial read + resume: server drops connection mid-transfer, client
  resumes with Range from the offset of received bytes
- Gzip partial read + resume: first response is gzip-encoded (Go auto-
  decompresses), connection drops, resume request gets decompressed data
  (Go doesn't add Accept-Encoding when Range is set, so the server
  decompresses), combined bytes match original

* fix(replication): address PR review comments

- Consolidate downloadWithRange into DownloadFile with optional offset
  parameter (variadic), eliminating code duplication (DRY)
- Validate HTTP response status: require 206 + correct Content-Range
  when offset > 0, reject when server ignores Range header
- Use if/else for fullData assignment for clarity
- Add test for rejected Range (server returns 200 instead of 206)

* refactor(replication): remove unused ReplicationSource interface

The interface was never referenced and its signature didn't match
the actual FilerSource.ReadPart method.

---------

Co-authored-by: Copilot <copilot@github.com>
2026-03-11 22:38:22 -07:00
Chris Lu
0647f66bb5 filer.sync: add exponential backoff on unexpected EOF during replication (#8557)
* filer.sync: add exponential backoff on unexpected EOF during replication

When the source volume server drops connections under high traffic,
filer.sync retries aggressively (every 1-6s), hammering the already
overloaded source. This adds a longer exponential backoff (10s to 2min)
specifically for "unexpected EOF" errors, reducing pressure on the
source while still retrying indefinitely until success.

Also adds more logging throughout the replication path:
- Log source URL and error at V(0) when ReadPart or io.ReadAll fails
- Log content-length and byte counts at V(4) on success
- Log backoff duration in retry messages

Fixes #8542

* filer.sync: extract backoff helper and fix 2-minute cap

- Extract nextEofBackoff() and isEofError() helpers to deduplicate
  the backoff logic between fetchAndWrite and uploadManifestChunk
- Fix the cap: previously 80s would double to 160s and pass the
  < 2min check uncapped. Now doubles first, then clamps to 2min.

* filer.sync: log source URL instead of empty upload URL on read errors

UploadUrl is not populated until after the reader is consumed, so the
V(0) and V(4) logs were printing an empty string. Add SourceUrl field
to UploadOption and populate it from the HTTP response in fetchAndWrite.

* filer.sync: guard isEofError against nil error

* filer.sync: use errors.Is for EOF detection, fix log wording

- Replace broad substring matching ("read input", "unexpected EOF")
  with errors.Is(err, io.ErrUnexpectedEOF) and errors.Is(err, io.EOF)
  so only actual EOF errors trigger the longer backoff
- Fix awkward log phrasing: "interrupted replicate" → "interrupted
  while replicating"

* filer.sync: remove EOF backoff from uploadManifestChunk

uploadManifestChunk reads from an in-memory bytes.Reader, so any EOF
errors there are from the destination side, not a broken source stream.
The long source-oriented backoff is inappropriate; let RetryUntil
handle destination retries at its normal cadence.

---------

Co-authored-by: Copilot <copilot@github.com>
2026-03-08 14:33:37 -07:00
Chris Lu
7fcbffed7f filer.sync: support manifest chunks (#8299)
* filer.sync support manifest chunks

* filersink: address manifest sync review feedback
2026-02-10 20:18:35 -08:00
Chris Lu
be0379f6fd Fix filer.sync retry on stale chunk (#8298)
* Fix filer.sync stale chunk uploads

* Tweak filersink stale logging
2026-02-10 19:06:35 -08:00
Chris Lu
cc2edfaf68 fix: enable RetryForever for active-active cluster sync to prevent out-of-sync (#7840)
Fixes #7230

When a cluster goes down during file replication, the chunk upload process
would fail after a limited number of retries. Once the remote cluster came
back online, those failed uploads were never retried, leaving the clusters
out-of-sync.

This change enables the RetryForever flag in the UploadOption when
replicating chunks between filers. This ensures that upload operations
will keep retrying indefinitely, and once the remote cluster comes back
online, the pending uploads will automatically succeed.

Users no longer need to manually run fs.meta.save and fs.meta.load as
a workaround for out-of-sync clusters.
2025-12-22 00:58:23 -08:00
Chris Lu
69553e5ba6 convert error fromating to %w everywhere (#6995) 2025-07-16 23:39:27 -07:00
vadimartynov
86d92a42b4 Added tls for http clients (#5766)
* Added global http client

* Added Do func for global http client

* Changed the code to use the global http client

* Fix http client in volume uploader

* Fixed pkg name

* Fixed http util funcs

* Fixed http client for bench_filer_upload

* Fixed http client for stress_filer_upload

* Fixed http client for filer_server_handlers_proxy

* Fixed http client for command_fs_merge_volumes

* Fixed http client for command_fs_merge_volumes and command_volume_fsck

* Fixed http client for s3api_server

* Added init global client for main funcs

* Rename global_client to client

* Changed:
- fixed NewHttpClient;
- added CheckIsHttpsClientEnabled func
- updated security.toml in scaffold

* Reduce the visibility of some functions in the util/http/client pkg

* Added the loadSecurityConfig function

* Use util.LoadSecurityConfiguration() in NewHttpClient func
2024-07-16 23:14:09 -07:00
chrislu
81fdf3651b grpc connection to filer add sw-client-id header 2023-01-20 01:48:12 -08:00
chrislu
6ede19e825 add a simple file replication progress bar 2022-12-20 19:47:21 -08:00
chrislu
6c7fe40305 filer sink retries reading file chunks, skipping missing chunks
if the file chunk is not available during replication time, the file is skipped
2022-12-19 11:31:58 -08:00
chrislu
ea2637734a refactor filer proto chunk variable from mtime to modified_ts_ns 2022-10-28 12:53:19 -07:00
chrislu
ea271600ec fix parameters 2022-10-04 12:36:05 -07:00
chrislu
0452ae6a6c filer.sync: limit concurrency when fetching file chunks
fix https://github.com/seaweedfs/seaweedfs/issues/3787
2022-10-04 11:35:07 -07:00
askeipx
2e78a522ab remove old raft servers if they don't answer to pings for too long (#3398)
* remove old raft servers if they don't answer to pings for too long

add ping durations as options

rename ping fields

fix some todos

get masters through masterclient

raft remove server from leader

use raft servers to ping them

CheckMastersAlive for hashicorp raft only

* prepare blocking ping

* pass waitForReady as param

* pass waitForReady through all functions

* waitForReady works

* refactor

* remove unneeded params

* rollback unneeded changes

* fix
2022-08-23 23:18:21 -07:00
chrislu
4081d50607 filer sink: retryable data chunk uploading 2022-08-20 19:09:15 -07:00
Konstantin Lebedev
4d08393b7c filer prefer volume server in same data center (#3405)
* initial prefer same data center
https://github.com/seaweedfs/seaweedfs/issues/3404

* GetDataCenter

* prefer same data center for ReplicationSource

* GetDataCenterId

* remove glog
2022-08-04 17:35:00 -07:00
chrislu
26dbc6c905 move to https://github.com/seaweedfs/seaweedfs 2022-07-29 00:17:28 -07:00
chrislu
9f9ef1340c use streaming mode for long poll grpc calls
streaming mode would create separate grpc connections for each call.
this is to ensure the long poll connections are properly closed.
2021-12-26 00:15:03 -08:00
Chris Lu
e5fc35ed0c change server address from string to a type 2021-09-12 22:47:52 -07:00
Chris Lu
6923af7280 refactoring 2021-09-06 16:20:49 -07:00
Chris Lu
8f8738867f add retry to assign volume
fix https://github.com/chrislusf/seaweedfs/issues/2056
2021-05-07 07:29:26 -07:00
Chris Lu
821c46edf1 Merge branch 'master' into support_ssd_volume 2021-02-09 11:37:07 -08:00
Chris Lu
990fa69bfe add back AdjustedUrl() related code 2021-01-28 14:36:29 -08:00
Chris Lu
00707ec00f mount: outsideContainerClusterMode proxy through filer
Running mount outside of the cluster would not need to expose all the volume servers to outside of the cluster. The chunk read and write will go through the filer.
2021-01-24 19:01:58 -08:00
Chris Lu
6ca10725b8 Revert "mount: when outside cluster network, use filer as proxy to access volume servers"
This reverts commit 096e088d7b.
2021-01-24 03:15:19 -08:00
Chris Lu
096e088d7b mount: when outside cluster network, use filer as proxy to access volume servers 2021-01-24 01:41:38 -08:00
Chris Lu
80b8692688 filer.sync: replicate outside of either cluster, only need to see filers 2021-01-24 00:01:44 -08:00
Chris Lu
1bf22c0b5b go fmt 2020-12-16 09:14:05 -08:00
Chris Lu
0d2ec832e2 rename from volumeType to diskType 2020-12-13 11:59:32 -08:00
Chris Lu
e9cd798bd3 adding volume type 2020-12-13 00:58:58 -08:00
Chris Lu
e219c57849 passing full path when assign volume locations 2020-10-25 15:46:29 -07:00
Chris Lu
f375b93aef renaming 2020-10-25 15:32:43 -07:00
Chris Lu
723ae11db4 refactoring in order to adjust volume server url later 2020-10-11 20:15:10 -07:00
Chris Lu
4fc0bd1a81 return http response directly 2020-09-09 03:53:09 -07:00
Chris Lu
97239ce6f1 rename filechunk is_gzipped to is_compressed 2020-06-20 08:15:49 -07:00
Chris Lu
ed3cf811f5 refactoring 2020-04-29 13:26:02 -07:00
Chris Lu
eedd33dda3 refactoring 2020-03-28 13:41:58 -07:00
Chris Lu
b97768c51c refactoring 2020-03-23 01:30:22 -07:00
Chris Lu
bec6ec7db6 go fmt 2020-03-17 10:01:55 -07:00
HongyanShen
81610ed006 fix: #1226 2020-03-11 14:37:14 +08:00
Chris Lu
2e3f6ad3a9 filer: remember content is gzipped or not 2020-03-08 21:39:33 -07:00
Chris Lu
13e215ee5c filer: option to encrypt data on volume server 2020-03-06 00:49:47 -08:00
Chris Lu
f90c43635d refactoring 2020-03-04 00:39:47 -08:00
Chris Lu
892e726eb9 avoid reusing context object
fix https://github.com/chrislusf/seaweedfs/issues/1182
2020-02-25 21:50:12 -08:00
Chris Lu
0841bedb15 move filer assign volume grpc errror to response 2020-02-25 17:15:09 -08:00
Chris Lu
6ab7368ef2 filer: dynamically create bucket under /buckets folder 2020-02-24 22:28:45 -08:00
Chris Lu
72a64a5cf8 use the same context object in order to retry 2020-01-26 14:42:11 -08:00