Commit Graph

71 Commits

Author SHA1 Message Date
Chris Lu
5fa5507234 Add Prometheus metric to count upload errors (#8788)
Add Prometheus metric to count upload errors (#8775)

Add SeaweedFS_upload_error_total counter labeled by HTTP status code,
so operators can alert on write/replication failures. Code "0" indicates
a transport error (no HTTP response received).

Also add an "Upload Errors" panel to the Grafana dashboard.
2026-03-26 16:58:05 -07:00
Chris Lu
0a5c5ed4ce Persist S3 bucket counter metrics across idle periods (#8595)
* Stop deleting counter metrics during bucket TTL cleanup

Counter metrics (traffic bytes, request counts, object counts) are
monotonically increasing by design. Deleting them after 10 minutes of
bucket inactivity causes them to vanish from /metrics output and reset
to zero when traffic resumes, breaking Prometheus rate()/increase()
queries and making historical traffic reporting impossible.

Only delete gauges and histograms in the TTL cleanup loop, as these
represent current state and are safely re-populated on next activity.

Fixes https://github.com/seaweedfs/seaweedfs/issues/8521

* Clean up all bucket metrics on bucket deletion

Add DeleteBucketMetrics() to delete all metrics (including counters)
for a bucket when it is explicitly deleted. This prevents unbounded
label cardinality from accumulating for buckets that no longer exist.

Called from DeleteBucketHandler after successful bucket deletion.

* Reduce mutex scope in bucket metrics TTL sweep

Collect expired bucket names under the lock, then release before
calling DeletePartialMatch on Prometheus metrics. This prevents
RecordBucketActiveTime from blocking during the expensive cleanup.
2026-03-10 19:00:40 -07:00
Chris Lu
0a2dac1e56 Reduce mutex scope in bucket metrics TTL sweep
Collect expired bucket names under the lock, then release before
calling DeletePartialMatch on Prometheus metrics. This prevents
RecordBucketActiveTime from blocking during the expensive cleanup.
2026-03-10 18:43:35 -07:00
Chris Lu
f5c666052e feat: add S3 bucket size and object count metrics (#7776)
* feat: add S3 bucket size and object count metrics

Adds periodic collection of bucket size metrics:
- SeaweedFS_s3_bucket_size_bytes: logical size (deduplicated across replicas)
- SeaweedFS_s3_bucket_physical_size_bytes: physical size (including replicas)
- SeaweedFS_s3_bucket_object_count: object count (deduplicated)

Collection runs every 1 minute via background goroutine that queries
filer Statistics RPC for each bucket's collection.

Also adds Grafana dashboard panels for:
- S3 Bucket Size (logical vs physical)
- S3 Bucket Object Count

* address PR comments: fix bucket size metrics collection

1. Fix collectCollectionInfoFromMaster to use master VolumeList API
   - Now properly queries master for topology info
   - Uses WithMasterClient to get volume list from master
   - Correctly calculates logical vs physical size based on replication

2. Return error when filerClient is nil to trigger fallback
   - Changed from 'return nil, nil' to 'return nil, error'
   - Ensures fallback to filer stats is properly triggered

3. Implement pagination in listBucketNames
   - Added listBucketPageSize constant (1000)
   - Uses StartFromFileName for pagination
   - Continues fetching until fewer entries than limit returned

4. Handle NewReplicaPlacementFromByte error and prevent division by zero
   - Check error return from NewReplicaPlacementFromByte
   - Default to 1 copy if error occurs
   - Add explicit check for copyCount == 0

* simplify bucket size metrics: remove filer fallback, align with quota enforcement

- Remove fallback to filer Statistics RPC
- Use only master topology for collection info (same as s3.bucket.quota.enforce)
- Updated comments to clarify this runs the same collection logic as quota enforcement
- Simplified code by removing collectBucketSizeFromFilerStats

* use s3a.option.Masters directly instead of querying filer

* address PR comments: fix dashboard overlaps and improve metrics collection

Grafana dashboard fixes:
- Fix overlapping panels 55 and 59 in grafana_seaweedfs.json (moved 59 to y=30)
- Fix grid collision in k8s dashboard (moved panel 72 to y=48)
- Aggregate bucket metrics with max() by (bucket) for multi-instance S3 gateways

Go code improvements:
- Add graceful shutdown support via context cancellation
- Use ticker instead of time.Sleep for better shutdown responsiveness
- Distinguish EOF from actual errors in stream handling

* improve bucket size metrics: multi-master failover and proper error handling

- Initial delay now respects context cancellation using select with time.After
- Use WithOneOfGrpcMasterClients for multi-master failover instead of hardcoding Masters[0]
- Properly propagate stream errors instead of just logging them (EOF vs real errors)

* improve bucket size metrics: distributed lock and volume ID deduplication

- Add distributed lock (LiveLock) so only one S3 instance collects metrics at a time
- Add IsLocked() method to LiveLock for checking lock status
- Fix deduplication: use volume ID tracking instead of dividing by copyCount
  - Previous approach gave wrong results if replicas were missing
  - Now tracks seen volume IDs and counts each volume only once
- Physical size still includes all replicas for accurate disk usage reporting

* rename lock to s3.leader

* simplify: remove StartBucketSizeMetricsCollection wrapper function

* fix data race: use atomic operations for LiveLock.isLocked field

- Change isLocked from bool to int32
- Use atomic.LoadInt32/StoreInt32 for all reads/writes
- Sync shared isLocked field in StartLongLivedLock goroutine

* add nil check for topology info to prevent panic

* fix bucket metrics: use Ticker for consistent intervals, fix pagination logic

- Use time.Ticker instead of time.After for consistent interval execution
- Fix pagination: count all entries (not just directories) for proper termination
- Update lastFileName for all entries to prevent pagination issues

* address PR comments: remove redundant atomic store, propagate context

- Remove redundant atomic.StoreInt32 in StartLongLivedLock (AttemptToLock already sets it)
- Propagate context through metrics collection for proper cancellation on shutdown
  - collectAndUpdateBucketSizeMetrics now accepts ctx
  - collectCollectionInfoFromMaster uses ctx for VolumeList RPC
  - listBucketNames uses ctx for ListEntries RPC
2025-12-15 19:23:25 -08:00
Chris Lu
848bec6d24 Metrics: Add Prometheus metrics for concurrent upload tracking (#7555)
* metrics: add Prometheus metrics for concurrent upload tracking

Add Prometheus metrics to monitor concurrent upload activity for both
filer and S3 servers. This provides visibility into the upload limiting
feature added in the previous PR.

New Metrics:
- SeaweedFS_filer_in_flight_upload_bytes: Current bytes being uploaded to filer
- SeaweedFS_filer_in_flight_upload_count: Current number of uploads to filer
- SeaweedFS_s3_in_flight_upload_bytes: Current bytes being uploaded to S3
- SeaweedFS_s3_in_flight_upload_count: Current number of uploads to S3

The metrics are updated atomically whenever uploads start or complete,
providing real-time visibility into upload concurrency levels.

This helps operators:
- Monitor upload concurrency in real-time
- Set appropriate limits based on actual usage patterns
- Detect potential bottlenecks or capacity issues
- Track the effectiveness of upload limiting configuration

* grafana: add dashboard panels for concurrent upload metrics

Add 4 new panels to the Grafana dashboard to visualize the concurrent
upload metrics added in this PR:

Filer Section:
- Filer Concurrent Uploads: Shows current number of concurrent uploads
- Filer Concurrent Upload Bytes: Shows current bytes being uploaded

S3 Gateway Section:
- S3 Concurrent Uploads: Shows current number of concurrent uploads
- S3 Concurrent Upload Bytes: Shows current bytes being uploaded

These panels help operators monitor upload concurrency in real-time and
tune the upload limiting configuration based on actual usage patterns.

* more efficient
2025-11-26 15:51:38 -08:00
Chris Lu
5f7a292334 add build info metrics (#7525)
* add build info metrics

* unused

* metrics on build

* size limit

* once
2025-11-21 16:55:28 -08:00
Konstantin Lebedev
93007c1842 [volume] refactor and add metrics for flight upload and download data limit condition (#6920)
* refactor concurrentDownloadLimit

* fix loop

* fix cmdServer

* fix: resolve conversation pr 6920

* Changes logging function (#6919)

* updated logging methods for stores

* updated logging methods for stores

* updated logging methods for filer

* updated logging methods for uploader and http_util

* updated logging methods for weed server

---------

Co-authored-by: akosov <a.kosov@kryptonite.ru>

* Improve lock ring (#6921)

* fix flaky lock ring test

* add more tests

* fix: build

* fix: rm import util/version

* fix: serverOptions

* refactoring

---------

Co-authored-by: Aleksey Kosov <rusyak777@list.ru>
Co-authored-by: akosov <a.kosov@kryptonite.ru>
Co-authored-by: Chris Lu <chrislusf@users.noreply.github.com>
Co-authored-by: chrislu <chris.lu@gmail.com>
2025-07-02 18:03:49 -07:00
Aleksey Kosov
ef4eda0761 added re-generating and writing the Volume UUID if it is empty (#6568) 2025-02-24 07:58:43 -08:00
zouyixiong
8eab76c5db fix: record and delete bucket metrics after inactive (#6523)
Co-authored-by: XYZ <XYZ>
2025-02-07 10:26:39 -08:00
Hadi Zamani
a2330f624b Add metrics for uploaded and deleted s3 objects (#6475) 2025-01-25 21:55:06 -08:00
Hadi Zamani
c7ae969c06 Add bucket's traffic metrics (#6444)
* Add bucket's traffic metrics

* Add bucket traffic to dashboards

* Fix bucket metrics help messages

* Fix variable names
2025-01-16 08:23:35 -08:00
zouyixiong
d6f3e1970d fix: filer may crash by bucketLastActiveTsNs concurrency access. (#6350) 2024-12-13 05:30:21 -08:00
zouyixiong
9987a65e8a fix: record and delete bucket metrics after inactive (#6349) 2024-12-12 20:34:02 -08:00
Konstantin Lebedev
167b50be88 fix missing register master metric MasterPickForWriteErrorCounter (#6277) 2024-11-25 08:59:11 -08:00
wyang
a7973ed7d1 fix deadlock hang when broadcast to clients (#6184)
fix deadlock when broadcast to clients

when master thransfer leader, the old master will disconnect with all
filers and volumeServers, if the cluster is a big , the broadcast
messages may be more big than the max of the channel len 100, then if the
KeepConnect was not listen on the channel in disconnect, it will
deadlock. and the whole cluster will not serve!
2024-11-03 23:20:48 -08:00
steve.wei
cfbe45c765 feat: add in-flight metric for s3/file/volume-server (#6120) 2024-10-14 12:10:05 -07:00
Konstantin Lebedev
67a252ee8a [master] refactor func ShouldGrowVolumes (#5884) 2024-09-04 08:16:44 -07:00
Konstantin Lebedev
b2ffcdaab2 [master] do sync grow request only if absolutely necessary (#5821)
* do sync grow request only if absolutely necessary
https://github.com/seaweedfs/seaweedfs/pull/5819

* remove check VolumeGrowStrategy Threshold on PickForWrite

* fix fmt.Errorf
2024-07-30 13:21:35 -07:00
Konstantin Lebedev
33964fa292 metrics stats of volume layout depends on the data center (#5775)
stats volume layout depends on the data center
2024-07-12 12:32:25 -07:00
steve.wei
0bdf121e51 rename VolumeServerVolumeGauge (#5504) 2024-04-17 04:49:50 -07:00
Konstantin Lebedev
d42a04cceb [s3] fix s3 test_multipart_resend_first_finishes_last (#5471)
* try fix s3 test
https://github.com/seaweedfs/seaweedfs/pull/5466

* add error handler metrics

* refactor

* refactor multipartExt

* delete bad entry parts
2024-04-06 10:56:39 -07:00
Konstantin Lebedev
dc9568fc0d [master] add test for PickForWrite add metrics for volume layout (#5413) 2024-03-22 07:39:11 -07:00
Konstantin Lebedev
a7fc723ae0 chore: add status code for request_total metrics (#5188) 2024-01-10 10:05:27 -08:00
chrislu
81f11883e3 go fmt 2023-11-26 11:47:20 -08:00
chrislu
6ebe26a765 Revert "Revert "Revert "Add disk type to prometheus metrics" (#4777)""
This reverts commit 567d788928.
2023-10-03 08:28:52 -07:00
chrislu
567d788928 Revert "Revert "Add disk type to prometheus metrics" (#4777)"
This reverts commit 9215ba24be.
2023-10-02 11:49:54 -07:00
Nico D'Cotta
796b7508f3 Implement SRV lookups for filer (#4767) 2023-08-24 07:08:56 -07:00
SmsS4
f61490966f Add time to first byte metric for s3 (#4768)
* Add time to first byte metric for s3

* Change second to millisecond
2023-08-21 00:42:39 -07:00
Kevin Liu
244385bf0d Fix binding metrics to ipv6 (#4286)
* Fix binding metrics to ipv6

* Update weed/stats/metrics.go

---------

Co-authored-by: Chris Lu <chrislusf@users.noreply.github.com>
2023-03-08 01:04:02 -08:00
Jiffs Maverick
4b0430e71d [metrics] Add the ability to control bind ip (#4012) 2022-11-24 10:22:59 -08:00
Konstantin Lebedev
1f7e52c63e vacuum metrics and force sync dst files (#3832) 2022-10-13 00:51:20 -07:00
Konstantin Lebedev
5db25a8f2a metric shows who is currently blocking the cluster or not (#3799)
* master_admin_lock Shows whether cluster is locked now or not
https://github.com/seaweedfs/seaweedfs/issues/3452

* fix metric MasterAdminLock
2022-10-07 13:26:29 -07:00
Konstantin Lebedev
a522507f95 configure raft metrics (#3798) 2022-10-07 04:20:34 -07:00
chrislu
26dbc6c905 move to https://github.com/seaweedfs/seaweedfs 2022-07-29 00:17:28 -07:00
Evgeny Kuzhelev
ac5ce31278 leave notion to refactor after prometheus upgrade 2022-07-27 16:48:56 +05:00
Evgeny Kuzhelev
47c72e6f35 remove all (currently existing) collection volume metrics 2022-07-27 16:31:49 +05:00
zzq09494
62f74f5e3c add bucket label to s3 prometheus metrics 2022-06-16 13:21:25 +08:00
zhihao.qu
42d04c581b feat(filer.sync): add metricsServer in filer.sync.
Metrics include:
(1) the offset of the filer.sync
(2) the last send timestamp of the filer subscription
2022-06-15 11:33:18 +08:00
Konstantin Lebedev
fb57aa431c stats master_replica_placement_mismatch 2022-06-10 15:30:40 +05:00
Konstantin Lebedev
5c9259fa3c fix metrics master name 2022-01-25 14:42:47 +05:00
Konstantin Lebedev
c9952759c4 metrics master is leader 2022-01-24 20:13:07 +05:00
Konstantin Lebedev
28efe31524 new master metrics 2022-01-24 19:09:43 +05:00
Chris Lu
0128239c0f handle ipv6 addresses 2021-09-07 16:43:54 -07:00
Jonas Falck
829b195084 Add process metrics of weed itself 2021-06-22 13:09:42 +02:00
Chris Lu
575d7952a1 add available resource stats
fix https://github.com/chrislusf/seaweedfs/issues/1555
2020-10-22 09:13:47 -07:00
Konstantin Lebedev
dc2e13092d add number of read only volumes metric 2020-10-15 15:48:40 +05:00
Chris Lu
9ab98fa912 s3 metrics adjust the label 2020-10-01 07:17:47 -07:00
Konstantin Lebedev
86329bbf2b label name is statusCode 2020-10-01 01:22:38 +05:00
Konstantin Lebedev
68463e92c1 add status code in S3RequestCounter 2020-10-01 00:59:39 +05:00
Chris Lu
4856bce0ee adjust for metrics port 2020-09-24 10:21:23 -07:00