Files

Chris Lu 5b86d33c3c Fix worker reconnection race condition causing context canceled errors (#7825 )

* Fix worker reconnection race condition causing context canceled errors

Fixes #7824

This commit fixes critical connection stability issues between admin server
and workers that manifested as rapid reconnection cycles with 'context canceled'
errors, particularly after 24+ hours of operation in containerized environments.

Root Cause:
-----------
Race condition where TWO goroutines were calling stream.Recv() on the same
gRPC bidirectional stream concurrently:

1. sendRegistrationSync() started a goroutine that calls stream.Recv()
2. handleIncoming() also calls stream.Recv() in a loop

Per gRPC specification, only ONE goroutine can call Recv() on a stream at a
time. Concurrent Recv() calls cause undefined behavior, manifesting as
'context canceled' errors and stream corruption.

The race occurred during worker reconnection:
- Sometimes sendRegistrationSync goroutine read the registration response first (success)
- Sometimes handleIncoming read it first, causing sendRegistrationSync to timeout
- This left the stream in an inconsistent state, triggering 'context canceled' error
- The error triggered rapid reconnection attempts, creating a reconnection storm

Why it happened after 24 hours:
Container orchestration systems (Docker Swarm/Kubernetes) periodically restart
pods. Over time, workers reconnect multiple times. Each reconnection had a chance
of hitting the race condition. Eventually the race manifested and caused the
connection storm.

Changes:
--------

weed/worker/client.go:
- Start handleIncoming and handleOutgoing goroutines BEFORE sending registration
- Use sendRegistration() instead of sendRegistrationSync()
- Ensures only ONE goroutine (handleIncoming) calls stream.Recv()
- Eliminates race condition entirely

weed/admin/dash/worker_grpc_server.go:
- Clean up old connection when worker reconnects with same ID
- Cancel old connection context to stop its goroutines
- Prevents resource leaks and stale connection accumulation

Impact:
-------
Before: Random 'context canceled' errors during reconnection, rapid reconnection
        cycles, resource leaks, requires manual restart to recover
After:  Reliable reconnection, single Recv() goroutine, proper cleanup,
        stable operation over 24+ hours

Testing:
--------
Build verified successful with no compilation errors.

How to reproduce the bug:
1. Start admin server and worker
2. Restart admin server (simulates container recreation)
3. Worker reconnects
4. Race condition may manifest, causing 'context canceled' error
5. Observe rapid reconnection cycles in logs

The fix is backward compatible and requires no configuration changes.

* Add MaxConnectionAge to gRPC server for Docker Swarm DNS handling

- Configure MaxConnectionAge and MaxConnectionAgeGrace for gRPC server
- Expand error detection in shouldInvalidateConnection for better cache invalidation
- Add connection lifecycle logging for debugging

* Add topology validation and nil-safety checks

- Add validation guards in UpdateTopology to prevent invalid updates
- Add nil-safety checks in rebuildIndexes
- Add GetDiskCount method for diagnostic purposes

* Fix worker registration race condition

- Reorder goroutine startup in WorkerStream to prevent race conditions
- Add defensive cleanup in unregisterWorker with panic-safe channel closing

* Add comprehensive topology update logging

- Enhance UpdateTopologyInfo with detailed logging of datacenter/node/disk counts
- Add metrics logging for topology changes

* Add periodic diagnostic status logging

- Implement topologyStatusLoop running every 5 minutes
- Add logTopologyStatus function reporting system metrics
- Run as background goroutine in maintenance manager

* Enhance master client connection logging

- Add connection timing logs in tryConnectToMaster
- Add reconnection attempt counting in KeepConnectedToMaster
- Improve diagnostic visibility for connection issues

* Remove unused sendRegistrationSync function

- Function is no longer called after switching to asynchronous sendRegistration
- Contains the problematic concurrent stream.Recv() pattern that caused race conditions
- Cleanup as suggested in PR review

* Clarify comment for channel closing during disconnection

- Improve comment to explain why channels are closed and their effect
- Make the code more self-documenting as suggested in PR review

* Address code review feedback: refactor and improvements

- Extract topology counting logic to shared helper function
  CountTopologyResources() to eliminate duplication between
  topology_management.go and maintenance_integration.go

- Use gRPC status codes for more robust error detection in
  shouldInvalidateConnection(), falling back to string matching
  for transport-level errors

- Add recover wrapper for channel close consistency in
  cleanupStaleConnections() to match unregisterWorker() pattern

* Update grpc_client_server.go

* Fix data race on lastSeen field access

- Add mutex protection around conn.lastSeen = time.Now() in WorkerStream method
- Ensures thread-safe access consistent with cleanupStaleConnections

* Fix goroutine leaks in worker reconnection logic

- Close streamExit in reconnect() before creating new connection
- Close streamExit in attemptConnection() when sendRegistration fails
- Prevents orphaned handleOutgoing/handleIncoming goroutines from previous connections
- Ensures proper cleanup of goroutines competing for shared outgoing channel

* Minor cleanup improvements for consistency and clarity

- Remove redundant string checks in shouldInvalidateConnection that overlap with gRPC status codes
- Add recover block to Stop() method for consistency with other channel close operations
- Maintains valuable DNS and transport-specific error detection while eliminating redundancy

* Improve topology update error handling

- Return descriptive errors instead of silently preserving topology for invalid updates
- Change nil topologyInfo case to return 'rejected invalid topology update: nil topologyInfo'
- Change empty DataCenterInfos case to return 'rejected invalid topology update: empty DataCenterInfos (had X nodes, Y disks)'
- Keep existing glog.Warningf calls but append error details to logs before returning errors
- Allows callers to distinguish rejected updates and handle them appropriately

* Refactor safe channel closing into helper method

- Add safeCloseOutgoingChannel helper method to eliminate code duplication
- Replace repeated recover blocks in Stop, unregisterWorker, and cleanupStaleConnections
- Improves maintainability and ensures consistent error handling across all channel close operations
- Maintains same panic recovery behavior with contextual source identification

* Make connection invalidation string matching case-insensitive

- Convert error string to lowercase once for all string.Contains checks
- Improves robustness by catching error message variations from different sources
- Eliminates need for separate 'DNS resolution' and 'dns' checks
- Maintains same error detection coverage with better reliability

* Clean up warning logs in UpdateTopology to avoid duplicating error text

- Remove duplicated error phrases from glog.Warningf messages
- Keep concise contextual warnings that don't repeat the fmt.Errorf content
- Maintain same error returns for backward compatibility

* Add robust validation to prevent topology wipeout during master restart

- Reject topology updates with 0 nodes when current topology has nodes
- Prevents transient empty topology from overwriting valid state
- Improves resilience during master restart scenarios
- Maintains backward compatibility for legitimate empty topology updates

2025-12-19 19:02:56 -08:00

config

Admin: misc improvements on admin server and workers. EC now works. (#7055 )

2025-07-30 12:38:03 -07:00

dash

Fix worker reconnection race condition causing context canceled errors (#7825 )

2025-12-19 19:02:56 -08:00

handlers

shell: add -owner flag to s3.bucket.create command (#7728 )

2025-12-12 18:06:13 -08:00

maintenance

Fix worker reconnection race condition causing context canceled errors (#7825 )

2025-12-19 19:02:56 -08:00

static

muted texts

2025-11-04 22:17:21 -08:00

topology

Fix worker reconnection race condition causing context canceled errors (#7825 )

2025-12-19 19:02:56 -08:00

view

fix bucket link

2025-12-16 23:16:45 -08:00

Makefile

Add admin component (#6928 )

2025-07-01 01:28:09 -07:00

README.md

clean up s3 bucket references

2025-07-01 08:41:53 -07:00

static_embed.go

Admin UI: Add message queue to admin UI (#6958 )

2025-07-11 10:19:27 -07:00

README.md

SeaweedFS Admin Component

A modern web-based administration interface for SeaweedFS clusters built with Go, Gin, Templ, and Bootstrap.

Features

Dashboard: Real-time cluster status and metrics
Master Management: Monitor master nodes and leadership status
Volume Server Management: View volume servers, capacity, and health
Object Store Bucket Management: Create, delete, and manage Object Store buckets with web interface
System Health: Overall cluster health monitoring
Responsive Design: Bootstrap-based UI that works on all devices
Authentication: Optional user authentication with sessions
TLS Support: HTTPS support for production deployments

Building

Using the Admin Makefile

The admin component has its own Makefile for development and building:

# Navigate to admin directory
cd weed/admin

# View all available targets
make help

# Generate templates and build
make build

# Development mode with template watching
make dev

# Run the admin server
make run

# Clean build artifacts
make clean

Using the Root Makefile

The root SeaweedFS Makefile automatically integrates the admin component:

# From the root directory
make install          # Builds weed with admin component
make full_install     # Full build with all tags
make test            # Runs tests including admin component

# Admin-specific targets from root
make admin-generate  # Generate admin templates
make admin-build     # Build admin component
make admin-run       # Run admin server
make admin-dev       # Development mode
make admin-clean     # Clean admin artifacts

Manual Building

If you prefer to build manually:

# Install templ compiler
go install github.com/a-h/templ/cmd/templ@latest

# Generate templates
templ generate

# Build the main weed binary
cd ../../../
go build -o weed ./weed

Development

Template Development

The admin interface uses Templ for type-safe HTML templates:

# Watch for template changes and auto-regenerate
make watch

# Or manually generate templates
make generate

# Format templates
make fmt

File Structure

weed/admin/
├── Makefile              # Admin-specific build tasks
├── README.md             # This file
├── S3_BUCKETS.md         # Object Store bucket management documentation
├── admin.go              # Main application entry point
├── dash/                 # Server and handler logic
│   ├── admin_server.go   # HTTP server setup
│   ├── handler_admin.go  # Admin dashboard handlers
│   ├── handler_auth.go   # Authentication handlers
│   └── middleware.go     # HTTP middleware
├── static/               # Static assets
│   ├── css/admin.css     # Admin-specific styles
│   └── js/admin.js       # Admin-specific JavaScript
└── view/                 # Templates
    ├── app/              # Application templates
    │   ├── admin.templ   # Main dashboard template
    │   ├── s3_buckets.templ # Object Store bucket management template
    │   └── *_templ.go    # Generated Go code
    └── layout/           # Layout templates
        ├── layout.templ  # Base layout template
        └── layout_templ.go # Generated Go code

S3 Bucket Management

The admin interface includes comprehensive Object Store bucket management capabilities. See S3_BUCKETS.md for detailed documentation on:

Creating and deleting Object Store buckets
Viewing bucket contents and metadata
Managing bucket permissions and settings
API endpoints for programmatic access

Usage

Basic Usage

# Start admin interface on default port (23646)
weed admin

# Start with custom configuration
weed admin -port=8080 -masters="master1:9333,master2:9333"

# Start with authentication
weed admin -adminUser=admin -adminPassword=secret123

# Start with HTTPS
weed admin -port=443 -tlsCert=/path/to/cert.pem -tlsKey=/path/to/key.pem

Configuration Options

Option	Default	Description
`-port`	23646	Admin server port
`-masters`	localhost:9333	Comma-separated master servers
`-adminUser`	admin	Admin username (if auth enabled)
`-adminPassword`	""	Admin password (empty = no auth)
`-tlsCert`	""	Path to TLS certificate
`-tlsKey`	""	Path to TLS private key

Docker Usage

# Build Docker image with admin component
make docker-build

# Run with Docker
docker run -p 23646:23646 seaweedfs/seaweedfs:latest admin -masters=host.docker.internal:9333

Development Workflow

Quick Start

# Clone and setup
git clone <seaweedfs-repo>
cd seaweedfs/weed/admin

# Install dependencies and build
make install-deps
make build

# Start development server
make dev

Making Changes

Template Changes: Edit .templ files in view/
- Templates auto-regenerate in development mode
- Use make generate to manually regenerate
Go Code Changes: Edit .go files
- Restart the server to see changes
- Use make build to rebuild
Static Assets: Edit files in static/
- Changes are served immediately

Testing

# Run admin component tests
make test

# Run from root directory
make admin-test

# Lint code
make lint

# Format code
make fmt

Production Deployment

Security Considerations

Authentication: Always set adminPassword in production
HTTPS: Use TLS certificates for encrypted connections
Firewall: Restrict admin interface access to authorized networks

Example Production Setup

# Production deployment with security
weed admin \
  -port=443 \
  -masters="master1:9333,master2:9333,master3:9333" \
  -adminUser=admin \
  -adminPassword=your-secure-password \
  -tlsCert=/etc/ssl/certs/admin.crt \
  -tlsKey=/etc/ssl/private/admin.key

Monitoring

The admin interface provides endpoints for monitoring:

GET /health - Health check endpoint
GET /metrics - Prometheus metrics (if enabled)
GET /api/status - JSON status information

Troubleshooting

Common Issues

Templates not found: Run make generate to create template files
Build errors: Ensure templ is installed with make install-templ
Static files not loading: Check that static/ directory exists and has proper files
Connection errors: Verify master and filer addresses are correct

Debug Mode

# Enable debug logging
weed -v=2 admin

# Check generated templates
ls -la view/app/*_templ.go view/layout/*_templ.go

Contributing

Fork the repository
Create a feature branch
Make your changes
Run tests: make test
Format code: make fmt
Submit a pull request

Architecture

The admin component follows a clean architecture:

Presentation Layer: Templ templates + Bootstrap CSS
HTTP Layer: Gin router with middleware
Business Logic: Handler functions in dash/ package
Data Layer: Communicates with SeaweedFS masters and filers

This separation makes the code maintainable and testable.