cleanup
This commit is contained in:
@@ -1,242 +0,0 @@
|
|||||||
# Bucket Policy Engine Integration - Complete
|
|
||||||
|
|
||||||
## Summary
|
|
||||||
|
|
||||||
Successfully integrated the `policy_engine` package to evaluate bucket policies for **all requests** (both anonymous and authenticated). This provides comprehensive AWS S3-compatible bucket policy support.
|
|
||||||
|
|
||||||
## What Changed
|
|
||||||
|
|
||||||
### 1. **New File: `s3api_bucket_policy_engine.go`**
|
|
||||||
Created a wrapper around `policy_engine.PolicyEngine` to:
|
|
||||||
- Load bucket policies from filer entries
|
|
||||||
- Sync policies from the bucket config cache
|
|
||||||
- Evaluate policies for any request (bucket, object, action, principal)
|
|
||||||
- Return structured results (allowed, evaluated, error)
|
|
||||||
|
|
||||||
### 2. **Modified: `s3api_server.go`**
|
|
||||||
- Added `policyEngine *BucketPolicyEngine` field to `S3ApiServer` struct
|
|
||||||
- Initialized the policy engine in `NewS3ApiServerWithStore()`
|
|
||||||
- Linked `IdentityAccessManagement` back to `S3ApiServer` for policy evaluation
|
|
||||||
|
|
||||||
### 3. **Modified: `auth_credentials.go`**
|
|
||||||
- Added `s3ApiServer *S3ApiServer` field to `IdentityAccessManagement` struct
|
|
||||||
- Added `buildPrincipalARN()` helper to convert identities to AWS ARN format
|
|
||||||
- **Integrated bucket policy evaluation into the authentication flow:**
|
|
||||||
- Policies are now checked **before** IAM/identity-based permissions
|
|
||||||
- Explicit `Deny` in bucket policy blocks access immediately
|
|
||||||
- Explicit `Allow` in bucket policy grants access and **bypasses IAM checks** (enables cross-account access)
|
|
||||||
- If no policy exists, falls through to normal IAM checks
|
|
||||||
- Policy evaluation errors result in access denial (fail-close security)
|
|
||||||
|
|
||||||
### 4. **Modified: `s3api_bucket_config.go`**
|
|
||||||
- Added policy engine sync when bucket configs are loaded
|
|
||||||
- Ensures policies are loaded into the engine for evaluation
|
|
||||||
|
|
||||||
### 5. **Modified: `auth_credentials_subscribe.go`**
|
|
||||||
- Added policy engine sync when bucket metadata changes
|
|
||||||
- Keeps the policy engine up-to-date via event-driven updates
|
|
||||||
|
|
||||||
## How It Works
|
|
||||||
|
|
||||||
### Anonymous Requests
|
|
||||||
```
|
|
||||||
1. Request comes in (no credentials)
|
|
||||||
2. Check ACL-based public access → if public, allow
|
|
||||||
3. Check bucket policy for anonymous ("*") access → if allowed, allow
|
|
||||||
4. Otherwise, deny
|
|
||||||
```
|
|
||||||
|
|
||||||
### Authenticated Requests (NEW!)
|
|
||||||
```
|
|
||||||
1. Request comes in (with credentials)
|
|
||||||
2. Authenticate user → get Identity
|
|
||||||
3. Build principal ARN (e.g., "arn:aws:iam::123456:user/bob")
|
|
||||||
4. Check bucket policy:
|
|
||||||
- If DENY → reject immediately
|
|
||||||
- If ALLOW → grant access immediately (bypasses IAM checks)
|
|
||||||
- If no policy or no matching statements → continue to step 5
|
|
||||||
5. Check IAM/identity-based permissions (only if not already allowed by bucket policy)
|
|
||||||
6. Allow or deny based on identity permissions
|
|
||||||
```
|
|
||||||
|
|
||||||
## Policy Evaluation Flow
|
|
||||||
|
|
||||||
```
|
|
||||||
┌─────────────────────────────────────────────────────────┐
|
|
||||||
│ Request (GET /bucket/file) │
|
|
||||||
└───────────────────────────┬─────────────────────────────┘
|
|
||||||
│
|
|
||||||
┌───────────▼──────────┐
|
|
||||||
│ Authenticate User │
|
|
||||||
│ (or Anonymous) │
|
|
||||||
└───────────┬──────────┘
|
|
||||||
│
|
|
||||||
┌───────────▼──────────────────────────────┐
|
|
||||||
│ Build Principal ARN │
|
|
||||||
│ - Anonymous: "*" │
|
|
||||||
│ - User: "arn:aws:iam::123456:user/bob" │
|
|
||||||
└───────────┬──────────────────────────────┘
|
|
||||||
│
|
|
||||||
┌───────────▼──────────────────────────────┐
|
|
||||||
│ Evaluate Bucket Policy (PolicyEngine) │
|
|
||||||
│ - Action: "s3:GetObject" │
|
|
||||||
│ - Resource: "arn:aws:s3:::bucket/file" │
|
|
||||||
│ - Principal: (from above) │
|
|
||||||
└───────────┬──────────────────────────────┘
|
|
||||||
│
|
|
||||||
┌─────────────┼─────────────┐
|
|
||||||
│ │ │
|
|
||||||
DENY │ ALLOW │ NO POLICY
|
|
||||||
│ │ │
|
|
||||||
▼ ▼ ▼
|
|
||||||
Reject Request Grant Access Continue
|
|
||||||
│
|
|
||||||
┌───────────────────┘
|
|
||||||
│
|
|
||||||
┌────────────▼─────────────┐
|
|
||||||
│ IAM/Identity Check │
|
|
||||||
│ (identity.canDo) │
|
|
||||||
└────────────┬─────────────┘
|
|
||||||
│
|
|
||||||
┌─────────┴─────────┐
|
|
||||||
│ │
|
|
||||||
ALLOW │ DENY │
|
|
||||||
▼ ▼
|
|
||||||
Grant Access Reject Request
|
|
||||||
```
|
|
||||||
|
|
||||||
## Example Policies That Now Work
|
|
||||||
|
|
||||||
### 1. **Public Read Access** (Anonymous)
|
|
||||||
```json
|
|
||||||
{
|
|
||||||
"Version": "2012-10-17",
|
|
||||||
"Statement": [{
|
|
||||||
"Effect": "Allow",
|
|
||||||
"Principal": "*",
|
|
||||||
"Action": "s3:GetObject",
|
|
||||||
"Resource": "arn:aws:s3:::mybucket/*"
|
|
||||||
}]
|
|
||||||
}
|
|
||||||
```
|
|
||||||
- Anonymous users can read all objects
|
|
||||||
- Authenticated users are also evaluated against this policy. If they don't match an explicit `Allow` for this action, they will fall back to their own IAM permissions
|
|
||||||
|
|
||||||
### 2. **Grant Access to Specific User** (Authenticated)
|
|
||||||
```json
|
|
||||||
{
|
|
||||||
"Version": "2012-10-17",
|
|
||||||
"Statement": [{
|
|
||||||
"Effect": "Allow",
|
|
||||||
"Principal": {"AWS": "arn:aws:iam::123456789012:user/bob"},
|
|
||||||
"Action": ["s3:GetObject", "s3:PutObject"],
|
|
||||||
"Resource": "arn:aws:s3:::mybucket/shared/*"
|
|
||||||
}]
|
|
||||||
}
|
|
||||||
```
|
|
||||||
- User "bob" can read/write objects in `/shared/` prefix
|
|
||||||
- Other users cannot (unless granted by their IAM policies)
|
|
||||||
|
|
||||||
### 3. **Deny Access to Specific Path** (Both)
|
|
||||||
```json
|
|
||||||
{
|
|
||||||
"Version": "2012-10-17",
|
|
||||||
"Statement": [{
|
|
||||||
"Effect": "Deny",
|
|
||||||
"Principal": "*",
|
|
||||||
"Action": "s3:*",
|
|
||||||
"Resource": "arn:aws:s3:::mybucket/confidential/*"
|
|
||||||
}]
|
|
||||||
}
|
|
||||||
```
|
|
||||||
- **No one** can access `/confidential/` objects
|
|
||||||
- Denies override all other allows (AWS policy evaluation rules)
|
|
||||||
|
|
||||||
## Performance Characteristics
|
|
||||||
|
|
||||||
### Policy Loading
|
|
||||||
- **Cold start**: Policy loaded from filer → parsed → compiled → cached
|
|
||||||
- **Warm path**: Policy retrieved from `BucketConfigCache` (already parsed)
|
|
||||||
- **Updates**: Event-driven sync via metadata subscription (real-time)
|
|
||||||
|
|
||||||
### Policy Evaluation
|
|
||||||
- **Compiled policies**: Pre-compiled regex patterns and matchers
|
|
||||||
- **Pattern cache**: Regex patterns cached with LRU eviction (max 1000)
|
|
||||||
- **Fast path**: Common patterns (`*`, exact matches) optimized
|
|
||||||
- **Case sensitivity**: Actions case-insensitive, resources case-sensitive (AWS-compatible)
|
|
||||||
|
|
||||||
### Overhead
|
|
||||||
- **Anonymous requests**: Minimal (policy already checked, now using compiled engine)
|
|
||||||
- **Authenticated requests**: ~1-2ms added for policy evaluation (compiled patterns)
|
|
||||||
- **No policy**: Near-zero overhead (quick indeterminate check)
|
|
||||||
|
|
||||||
## Testing
|
|
||||||
|
|
||||||
All tests pass:
|
|
||||||
```bash
|
|
||||||
✅ TestBucketPolicyValidationBasics
|
|
||||||
✅ TestPrincipalMatchesAnonymous
|
|
||||||
✅ TestActionToS3Action
|
|
||||||
✅ TestResourceMatching
|
|
||||||
✅ TestMatchesPatternRegexEscaping (security tests)
|
|
||||||
✅ TestActionMatchingCaseInsensitive
|
|
||||||
✅ TestResourceMatchingCaseSensitive
|
|
||||||
✅ All policy_engine package tests (30+ tests)
|
|
||||||
```
|
|
||||||
|
|
||||||
## Security Improvements
|
|
||||||
|
|
||||||
1. **Regex Metacharacter Escaping**: Patterns like `*.json` properly match only files ending in `.json` (not `filexjson`)
|
|
||||||
2. **Case-Insensitive Actions**: S3 actions matched case-insensitively per AWS spec
|
|
||||||
3. **Case-Sensitive Resources**: Resource paths matched case-sensitively for security
|
|
||||||
4. **Pattern Cache Size Limit**: Prevents DoS attacks via unbounded cache growth
|
|
||||||
5. **Principal Validation**: Supports `[]string` for manually constructed policies
|
|
||||||
|
|
||||||
## AWS Compatibility
|
|
||||||
|
|
||||||
The implementation follows AWS S3 bucket policy evaluation rules:
|
|
||||||
1. **Explicit Deny** always wins (checked first)
|
|
||||||
2. **Explicit Allow** grants access (checked second)
|
|
||||||
3. **Default Deny** if no matching statements (implicit)
|
|
||||||
4. Bucket policies work alongside IAM policies (both are evaluated)
|
|
||||||
|
|
||||||
## Files Changed
|
|
||||||
|
|
||||||
```
|
|
||||||
Modified:
|
|
||||||
weed/s3api/auth_credentials.go (+47 lines)
|
|
||||||
weed/s3api/auth_credentials_subscribe.go (+8 lines)
|
|
||||||
weed/s3api/s3api_bucket_config.go (+8 lines)
|
|
||||||
weed/s3api/s3api_server.go (+5 lines)
|
|
||||||
|
|
||||||
New:
|
|
||||||
weed/s3api/s3api_bucket_policy_engine.go (115 lines)
|
|
||||||
```
|
|
||||||
|
|
||||||
## Migration Notes
|
|
||||||
|
|
||||||
- **Backward Compatible**: Existing setups without bucket policies work unchanged
|
|
||||||
- **No Breaking Changes**: All existing ACL and IAM-based authorization still works
|
|
||||||
- **Additive Feature**: Bucket policies are an additional layer of authorization
|
|
||||||
- **Performance**: Minimal impact on existing workloads
|
|
||||||
|
|
||||||
## Future Enhancements
|
|
||||||
|
|
||||||
Potential improvements (not implemented yet):
|
|
||||||
- [ ] Condition support (IP address, time-based, etc.) - already in policy_engine
|
|
||||||
- [ ] Cross-account policies (different AWS accounts)
|
|
||||||
- [ ] Policy validation API endpoint
|
|
||||||
- [ ] Policy simulation/testing tool
|
|
||||||
- [ ] Metrics for policy evaluations (allow/deny counts)
|
|
||||||
|
|
||||||
## Conclusion
|
|
||||||
|
|
||||||
Bucket policies now work for **all requests** in SeaweedFS S3 API:
|
|
||||||
- ✅ Anonymous requests (public access)
|
|
||||||
- ✅ Authenticated requests (user-specific policies)
|
|
||||||
- ✅ High performance (compiled policies, caching)
|
|
||||||
- ✅ AWS-compatible (follows AWS evaluation rules)
|
|
||||||
- ✅ Secure (proper escaping, case sensitivity)
|
|
||||||
|
|
||||||
The integration is complete, tested, and ready for use!
|
|
||||||
|
|
||||||
413
DESIGN.md
413
DESIGN.md
@@ -1,413 +0,0 @@
|
|||||||
# SeaweedFS Task Distribution System Design
|
|
||||||
|
|
||||||
## Overview
|
|
||||||
|
|
||||||
This document describes the design of a distributed task management system for SeaweedFS that handles Erasure Coding (EC) and vacuum operations through a scalable admin server and worker process architecture.
|
|
||||||
|
|
||||||
## System Architecture
|
|
||||||
|
|
||||||
### High-Level Components
|
|
||||||
|
|
||||||
```
|
|
||||||
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
|
|
||||||
│ Master │◄──►│ Admin Server │◄──►│ Workers │
|
|
||||||
│ │ │ │ │ │
|
|
||||||
│ - Volume Info │ │ - Task Discovery │ │ - Task Exec │
|
|
||||||
│ - Shard Status │ │ - Task Assign │ │ - Progress │
|
|
||||||
│ - Heartbeats │ │ - Progress Track │ │ - Error Report │
|
|
||||||
└─────────────────┘ └──────────────────┘ └─────────────────┘
|
|
||||||
│ │ │
|
|
||||||
│ │ │
|
|
||||||
▼ ▼ ▼
|
|
||||||
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
|
|
||||||
│ Volume Servers │ │ Volume Monitor │ │ Task Execution │
|
|
||||||
│ │ │ │ │ │
|
|
||||||
│ - Store Volumes │ │ - Health Check │ │ - EC Convert │
|
|
||||||
│ - EC Shards │ │ - Usage Stats │ │ - Vacuum Clean │
|
|
||||||
│ - Report Status │ │ - State Sync │ │ - Status Report │
|
|
||||||
└─────────────────┘ └──────────────────┘ └─────────────────┘
|
|
||||||
```
|
|
||||||
|
|
||||||
## 1. Admin Server Design
|
|
||||||
|
|
||||||
### 1.1 Core Responsibilities
|
|
||||||
|
|
||||||
- **Task Discovery**: Scan volumes to identify EC and vacuum candidates
|
|
||||||
- **Worker Management**: Track available workers and their capabilities
|
|
||||||
- **Task Assignment**: Match tasks to optimal workers
|
|
||||||
- **Progress Tracking**: Monitor in-progress tasks for capacity planning
|
|
||||||
- **State Reconciliation**: Sync with master server for volume state updates
|
|
||||||
|
|
||||||
### 1.2 Task Discovery Engine
|
|
||||||
|
|
||||||
```go
|
|
||||||
type TaskDiscoveryEngine struct {
|
|
||||||
masterClient MasterClient
|
|
||||||
volumeScanner VolumeScanner
|
|
||||||
taskDetectors map[TaskType]TaskDetector
|
|
||||||
scanInterval time.Duration
|
|
||||||
}
|
|
||||||
|
|
||||||
type VolumeCandidate struct {
|
|
||||||
VolumeID uint32
|
|
||||||
Server string
|
|
||||||
Collection string
|
|
||||||
TaskType TaskType
|
|
||||||
Priority TaskPriority
|
|
||||||
Reason string
|
|
||||||
DetectedAt time.Time
|
|
||||||
Parameters map[string]interface{}
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
**EC Detection Logic**:
|
|
||||||
- Find volumes >= 95% full and idle for > 1 hour
|
|
||||||
- Exclude volumes already in EC format
|
|
||||||
- Exclude volumes with ongoing operations
|
|
||||||
- Prioritize by collection and age
|
|
||||||
|
|
||||||
**Vacuum Detection Logic**:
|
|
||||||
- Find volumes with garbage ratio > 30%
|
|
||||||
- Exclude read-only volumes
|
|
||||||
- Exclude volumes with recent vacuum operations
|
|
||||||
- Prioritize by garbage percentage
|
|
||||||
|
|
||||||
### 1.3 Worker Registry & Management
|
|
||||||
|
|
||||||
```go
|
|
||||||
type WorkerRegistry struct {
|
|
||||||
workers map[string]*Worker
|
|
||||||
capabilities map[TaskType][]*Worker
|
|
||||||
lastHeartbeat map[string]time.Time
|
|
||||||
taskAssignment map[string]*Task
|
|
||||||
mutex sync.RWMutex
|
|
||||||
}
|
|
||||||
|
|
||||||
type Worker struct {
|
|
||||||
ID string
|
|
||||||
Address string
|
|
||||||
Capabilities []TaskType
|
|
||||||
MaxConcurrent int
|
|
||||||
CurrentLoad int
|
|
||||||
Status WorkerStatus
|
|
||||||
LastSeen time.Time
|
|
||||||
Performance WorkerMetrics
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
### 1.4 Task Assignment Algorithm
|
|
||||||
|
|
||||||
```go
|
|
||||||
type TaskScheduler struct {
|
|
||||||
registry *WorkerRegistry
|
|
||||||
taskQueue *PriorityQueue
|
|
||||||
inProgressTasks map[string]*InProgressTask
|
|
||||||
volumeReservations map[uint32]*VolumeReservation
|
|
||||||
}
|
|
||||||
|
|
||||||
// Worker Selection Criteria:
|
|
||||||
// 1. Has required capability (EC or Vacuum)
|
|
||||||
// 2. Available capacity (CurrentLoad < MaxConcurrent)
|
|
||||||
// 3. Best performance history for task type
|
|
||||||
// 4. Lowest current load
|
|
||||||
// 5. Geographically close to volume server (optional)
|
|
||||||
```
|
|
||||||
|
|
||||||
## 2. Worker Process Design
|
|
||||||
|
|
||||||
### 2.1 Worker Architecture
|
|
||||||
|
|
||||||
```go
|
|
||||||
type MaintenanceWorker struct {
|
|
||||||
id string
|
|
||||||
config *WorkerConfig
|
|
||||||
adminClient AdminClient
|
|
||||||
taskExecutors map[TaskType]TaskExecutor
|
|
||||||
currentTasks map[string]*RunningTask
|
|
||||||
registry *TaskRegistry
|
|
||||||
heartbeatTicker *time.Ticker
|
|
||||||
requestTicker *time.Ticker
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
### 2.2 Task Execution Framework
|
|
||||||
|
|
||||||
```go
|
|
||||||
type TaskExecutor interface {
|
|
||||||
Execute(ctx context.Context, task *Task) error
|
|
||||||
EstimateTime(task *Task) time.Duration
|
|
||||||
ValidateResources(task *Task) error
|
|
||||||
GetProgress() float64
|
|
||||||
Cancel() error
|
|
||||||
}
|
|
||||||
|
|
||||||
type ErasureCodingExecutor struct {
|
|
||||||
volumeClient VolumeServerClient
|
|
||||||
progress float64
|
|
||||||
cancelled bool
|
|
||||||
}
|
|
||||||
|
|
||||||
type VacuumExecutor struct {
|
|
||||||
volumeClient VolumeServerClient
|
|
||||||
progress float64
|
|
||||||
cancelled bool
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
### 2.3 Worker Capabilities & Registration
|
|
||||||
|
|
||||||
```go
|
|
||||||
type WorkerCapabilities struct {
|
|
||||||
SupportedTasks []TaskType
|
|
||||||
MaxConcurrent int
|
|
||||||
ResourceLimits ResourceLimits
|
|
||||||
PreferredServers []string // Affinity for specific volume servers
|
|
||||||
}
|
|
||||||
|
|
||||||
type ResourceLimits struct {
|
|
||||||
MaxMemoryMB int64
|
|
||||||
MaxDiskSpaceMB int64
|
|
||||||
MaxNetworkMbps int64
|
|
||||||
MaxCPUPercent float64
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
## 3. Task Lifecycle Management
|
|
||||||
|
|
||||||
### 3.1 Task States
|
|
||||||
|
|
||||||
```go
|
|
||||||
type TaskState string
|
|
||||||
|
|
||||||
const (
|
|
||||||
TaskStatePending TaskState = "pending"
|
|
||||||
TaskStateAssigned TaskState = "assigned"
|
|
||||||
TaskStateInProgress TaskState = "in_progress"
|
|
||||||
TaskStateCompleted TaskState = "completed"
|
|
||||||
TaskStateFailed TaskState = "failed"
|
|
||||||
TaskStateCancelled TaskState = "cancelled"
|
|
||||||
TaskStateStuck TaskState = "stuck" // Taking too long
|
|
||||||
TaskStateDuplicate TaskState = "duplicate" // Detected duplicate
|
|
||||||
)
|
|
||||||
```
|
|
||||||
|
|
||||||
### 3.2 Progress Tracking & Monitoring
|
|
||||||
|
|
||||||
```go
|
|
||||||
type InProgressTask struct {
|
|
||||||
Task *Task
|
|
||||||
WorkerID string
|
|
||||||
StartedAt time.Time
|
|
||||||
LastUpdate time.Time
|
|
||||||
Progress float64
|
|
||||||
EstimatedEnd time.Time
|
|
||||||
VolumeReserved bool // Reserved for capacity planning
|
|
||||||
}
|
|
||||||
|
|
||||||
type TaskMonitor struct {
|
|
||||||
inProgressTasks map[string]*InProgressTask
|
|
||||||
timeoutChecker *time.Ticker
|
|
||||||
stuckDetector *time.Ticker
|
|
||||||
duplicateChecker *time.Ticker
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
## 4. Volume Capacity Reconciliation
|
|
||||||
|
|
||||||
### 4.1 Volume State Tracking
|
|
||||||
|
|
||||||
```go
|
|
||||||
type VolumeStateManager struct {
|
|
||||||
masterClient MasterClient
|
|
||||||
inProgressTasks map[uint32]*InProgressTask // VolumeID -> Task
|
|
||||||
committedChanges map[uint32]*VolumeChange // Changes not yet in master
|
|
||||||
reconcileInterval time.Duration
|
|
||||||
}
|
|
||||||
|
|
||||||
type VolumeChange struct {
|
|
||||||
VolumeID uint32
|
|
||||||
ChangeType ChangeType // "ec_encoding", "vacuum_completed"
|
|
||||||
OldCapacity int64
|
|
||||||
NewCapacity int64
|
|
||||||
TaskID string
|
|
||||||
CompletedAt time.Time
|
|
||||||
ReportedToMaster bool
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
### 4.2 Shard Assignment Integration
|
|
||||||
|
|
||||||
When the master needs to assign shards, it must consider:
|
|
||||||
1. **Current volume state** from its own records
|
|
||||||
2. **In-progress capacity changes** from admin server
|
|
||||||
3. **Committed but unreported changes** from admin server
|
|
||||||
|
|
||||||
```go
|
|
||||||
type CapacityOracle struct {
|
|
||||||
adminServer AdminServerClient
|
|
||||||
masterState *MasterVolumeState
|
|
||||||
updateFreq time.Duration
|
|
||||||
}
|
|
||||||
|
|
||||||
func (o *CapacityOracle) GetAdjustedCapacity(volumeID uint32) int64 {
|
|
||||||
baseCapacity := o.masterState.GetCapacity(volumeID)
|
|
||||||
|
|
||||||
// Adjust for in-progress tasks
|
|
||||||
if task := o.adminServer.GetInProgressTask(volumeID); task != nil {
|
|
||||||
switch task.Type {
|
|
||||||
case TaskTypeErasureCoding:
|
|
||||||
// EC reduces effective capacity
|
|
||||||
return baseCapacity / 2 // Simplified
|
|
||||||
case TaskTypeVacuum:
|
|
||||||
// Vacuum may increase available space
|
|
||||||
return baseCapacity + int64(float64(baseCapacity) * 0.3)
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
// Adjust for completed but unreported changes
|
|
||||||
if change := o.adminServer.GetPendingChange(volumeID); change != nil {
|
|
||||||
return change.NewCapacity
|
|
||||||
}
|
|
||||||
|
|
||||||
return baseCapacity
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
## 5. Error Handling & Recovery
|
|
||||||
|
|
||||||
### 5.1 Worker Failure Scenarios
|
|
||||||
|
|
||||||
```go
|
|
||||||
type FailureHandler struct {
|
|
||||||
taskRescheduler *TaskRescheduler
|
|
||||||
workerMonitor *WorkerMonitor
|
|
||||||
alertManager *AlertManager
|
|
||||||
}
|
|
||||||
|
|
||||||
// Failure Scenarios:
|
|
||||||
// 1. Worker becomes unresponsive (heartbeat timeout)
|
|
||||||
// 2. Task execution fails (reported by worker)
|
|
||||||
// 3. Task gets stuck (progress timeout)
|
|
||||||
// 4. Duplicate task detection
|
|
||||||
// 5. Resource exhaustion
|
|
||||||
```
|
|
||||||
|
|
||||||
### 5.2 Recovery Strategies
|
|
||||||
|
|
||||||
**Worker Timeout Recovery**:
|
|
||||||
- Mark worker as inactive after 3 missed heartbeats
|
|
||||||
- Reschedule all assigned tasks to other workers
|
|
||||||
- Cleanup any partial state
|
|
||||||
|
|
||||||
**Task Stuck Recovery**:
|
|
||||||
- Detect tasks with no progress for > 2x estimated time
|
|
||||||
- Cancel stuck task and mark volume for cleanup
|
|
||||||
- Reschedule if retry count < max_retries
|
|
||||||
|
|
||||||
**Duplicate Task Prevention**:
|
|
||||||
```go
|
|
||||||
type DuplicateDetector struct {
|
|
||||||
activeFingerprints map[string]bool // VolumeID+TaskType
|
|
||||||
recentCompleted *LRUCache // Recently completed tasks
|
|
||||||
}
|
|
||||||
|
|
||||||
func (d *DuplicateDetector) IsTaskDuplicate(task *Task) bool {
|
|
||||||
fingerprint := fmt.Sprintf("%d-%s", task.VolumeID, task.Type)
|
|
||||||
return d.activeFingerprints[fingerprint] ||
|
|
||||||
d.recentCompleted.Contains(fingerprint)
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
## 6. Simulation & Testing Framework
|
|
||||||
|
|
||||||
### 6.1 Failure Simulation
|
|
||||||
|
|
||||||
```go
|
|
||||||
type TaskSimulator struct {
|
|
||||||
scenarios map[string]SimulationScenario
|
|
||||||
}
|
|
||||||
|
|
||||||
type SimulationScenario struct {
|
|
||||||
Name string
|
|
||||||
WorkerCount int
|
|
||||||
VolumeCount int
|
|
||||||
FailurePatterns []FailurePattern
|
|
||||||
Duration time.Duration
|
|
||||||
}
|
|
||||||
|
|
||||||
type FailurePattern struct {
|
|
||||||
Type FailureType // "worker_timeout", "task_stuck", "duplicate"
|
|
||||||
Probability float64 // 0.0 to 1.0
|
|
||||||
Timing TimingSpec // When during task execution
|
|
||||||
Duration time.Duration
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
### 6.2 Test Scenarios
|
|
||||||
|
|
||||||
**Scenario 1: Worker Timeout During EC**
|
|
||||||
- Start EC task on 30GB volume
|
|
||||||
- Kill worker at 50% progress
|
|
||||||
- Verify task reassignment
|
|
||||||
- Verify no duplicate EC operations
|
|
||||||
|
|
||||||
**Scenario 2: Stuck Vacuum Task**
|
|
||||||
- Start vacuum on high-garbage volume
|
|
||||||
- Simulate worker hanging at 75% progress
|
|
||||||
- Verify timeout detection and cleanup
|
|
||||||
- Verify volume state consistency
|
|
||||||
|
|
||||||
**Scenario 3: Duplicate Task Prevention**
|
|
||||||
- Submit same EC task from multiple sources
|
|
||||||
- Verify only one task executes
|
|
||||||
- Verify proper conflict resolution
|
|
||||||
|
|
||||||
**Scenario 4: Master-Admin State Divergence**
|
|
||||||
- Create in-progress EC task
|
|
||||||
- Simulate master restart
|
|
||||||
- Verify state reconciliation
|
|
||||||
- Verify shard assignment accounts for in-progress work
|
|
||||||
|
|
||||||
## 7. Performance & Scalability
|
|
||||||
|
|
||||||
### 7.1 Metrics & Monitoring
|
|
||||||
|
|
||||||
```go
|
|
||||||
type SystemMetrics struct {
|
|
||||||
TasksPerSecond float64
|
|
||||||
WorkerUtilization float64
|
|
||||||
AverageTaskTime time.Duration
|
|
||||||
FailureRate float64
|
|
||||||
QueueDepth int
|
|
||||||
VolumeStatesSync bool
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
### 7.2 Scalability Considerations
|
|
||||||
|
|
||||||
- **Horizontal Worker Scaling**: Add workers without admin server changes
|
|
||||||
- **Admin Server HA**: Master-slave admin servers for fault tolerance
|
|
||||||
- **Task Partitioning**: Partition tasks by collection or datacenter
|
|
||||||
- **Batch Operations**: Group similar tasks for efficiency
|
|
||||||
|
|
||||||
## 8. Implementation Plan
|
|
||||||
|
|
||||||
### Phase 1: Core Infrastructure
|
|
||||||
1. Admin server basic framework
|
|
||||||
2. Worker registration and heartbeat
|
|
||||||
3. Simple task assignment
|
|
||||||
4. Basic progress tracking
|
|
||||||
|
|
||||||
### Phase 2: Advanced Features
|
|
||||||
1. Volume state reconciliation
|
|
||||||
2. Sophisticated worker selection
|
|
||||||
3. Failure detection and recovery
|
|
||||||
4. Duplicate prevention
|
|
||||||
|
|
||||||
### Phase 3: Optimization & Monitoring
|
|
||||||
1. Performance metrics
|
|
||||||
2. Load balancing algorithms
|
|
||||||
3. Capacity planning integration
|
|
||||||
4. Comprehensive monitoring
|
|
||||||
|
|
||||||
This design provides a robust, scalable foundation for distributed task management in SeaweedFS while maintaining consistency with the existing architecture patterns.
|
|
||||||
@@ -1,145 +0,0 @@
|
|||||||
# SQL Query Engine Feature, Dev, and Test Plan
|
|
||||||
|
|
||||||
This document outlines the plan for adding SQL querying support to SeaweedFS, focusing on reading and analyzing data from Message Queue (MQ) topics.
|
|
||||||
|
|
||||||
## Feature Plan
|
|
||||||
|
|
||||||
**1. Goal**
|
|
||||||
|
|
||||||
To provide a SQL querying interface for SeaweedFS, enabling analytics on existing MQ topics. This enables:
|
|
||||||
- Basic querying with SELECT, WHERE, aggregations on MQ topics
|
|
||||||
- Schema discovery and metadata operations (SHOW DATABASES, SHOW TABLES, DESCRIBE)
|
|
||||||
- In-place analytics on Parquet-stored messages without data movement
|
|
||||||
|
|
||||||
**2. Key Features**
|
|
||||||
|
|
||||||
* **Schema Discovery and Metadata:**
|
|
||||||
* `SHOW DATABASES` - List all MQ namespaces
|
|
||||||
* `SHOW TABLES` - List all topics in a namespace
|
|
||||||
* `DESCRIBE table_name` - Show topic schema details
|
|
||||||
* Automatic schema detection from existing Parquet data
|
|
||||||
* **Basic Query Engine:**
|
|
||||||
* `SELECT` support with `WHERE`, `LIMIT`, `OFFSET`
|
|
||||||
* Aggregation functions: `COUNT()`, `SUM()`, `AVG()`, `MIN()`, `MAX()`
|
|
||||||
* Temporal queries with timestamp-based filtering
|
|
||||||
* **User Interfaces:**
|
|
||||||
* New CLI command `weed sql` with interactive shell mode
|
|
||||||
* Optional: Web UI for query execution and result visualization
|
|
||||||
* **Output Formats:**
|
|
||||||
* JSON (default), CSV, Parquet for result sets
|
|
||||||
* Streaming results for large queries
|
|
||||||
* Pagination support for result navigation
|
|
||||||
|
|
||||||
## Development Plan
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
**3. Data Source Integration**
|
|
||||||
|
|
||||||
* **MQ Topic Connector (Primary):**
|
|
||||||
* Build on existing `weed/mq/logstore/read_parquet_to_log.go`
|
|
||||||
* Implement efficient Parquet scanning with predicate pushdown
|
|
||||||
* Support schema evolution and backward compatibility
|
|
||||||
* Handle partition-based parallelism for scalable queries
|
|
||||||
* **Schema Registry Integration:**
|
|
||||||
* Extend `weed/mq/schema/schema.go` for SQL metadata operations
|
|
||||||
* Read existing topic schemas for query planning
|
|
||||||
* Handle schema evolution during query execution
|
|
||||||
|
|
||||||
**4. API & CLI Integration**
|
|
||||||
|
|
||||||
* **CLI Command:**
|
|
||||||
* New `weed sql` command with interactive shell mode (similar to `weed shell`)
|
|
||||||
* Support for script execution and result formatting
|
|
||||||
* Connection management for remote SeaweedFS clusters
|
|
||||||
* **gRPC API:**
|
|
||||||
* Add SQL service to existing MQ broker gRPC interface
|
|
||||||
* Enable efficient query execution with streaming results
|
|
||||||
|
|
||||||
## Example Usage Scenarios
|
|
||||||
|
|
||||||
**Scenario 1: Schema Discovery and Metadata**
|
|
||||||
```sql
|
|
||||||
-- List all namespaces (databases)
|
|
||||||
SHOW DATABASES;
|
|
||||||
|
|
||||||
-- List topics in a namespace
|
|
||||||
USE my_namespace;
|
|
||||||
SHOW TABLES;
|
|
||||||
|
|
||||||
-- View topic structure and discovered schema
|
|
||||||
DESCRIBE user_events;
|
|
||||||
```
|
|
||||||
|
|
||||||
**Scenario 2: Data Querying**
|
|
||||||
```sql
|
|
||||||
-- Basic filtering and projection
|
|
||||||
SELECT user_id, event_type, timestamp
|
|
||||||
FROM user_events
|
|
||||||
WHERE timestamp > 1640995200000
|
|
||||||
LIMIT 100;
|
|
||||||
|
|
||||||
-- Aggregation queries
|
|
||||||
SELECT COUNT(*) as event_count
|
|
||||||
FROM user_events
|
|
||||||
WHERE timestamp >= 1640995200000;
|
|
||||||
|
|
||||||
-- More aggregation examples
|
|
||||||
SELECT MAX(timestamp), MIN(timestamp)
|
|
||||||
FROM user_events;
|
|
||||||
```
|
|
||||||
|
|
||||||
**Scenario 3: Analytics & Monitoring**
|
|
||||||
```sql
|
|
||||||
-- Basic analytics
|
|
||||||
SELECT COUNT(*) as total_events
|
|
||||||
FROM user_events
|
|
||||||
WHERE timestamp >= 1640995200000;
|
|
||||||
|
|
||||||
-- Simple monitoring
|
|
||||||
SELECT AVG(response_time) as avg_response
|
|
||||||
FROM api_logs
|
|
||||||
WHERE timestamp >= 1640995200000;
|
|
||||||
|
|
||||||
## Architecture Overview
|
|
||||||
|
|
||||||
```
|
|
||||||
SQL Query Flow:
|
|
||||||
1. Parse SQL 2. Plan & Optimize 3. Execute Query
|
|
||||||
┌─────────────┐ ┌──────────────┐ ┌─────────────────┐ ┌──────────────┐
|
|
||||||
│ Client │ │ SQL Parser │ │ Query Planner │ │ Execution │
|
|
||||||
│ (CLI) │──→ │ PostgreSQL │──→ │ & Optimizer │──→ │ Engine │
|
|
||||||
│ │ │ (Custom) │ │ │ │ │
|
|
||||||
└─────────────┘ └──────────────┘ └─────────────────┘ └──────────────┘
|
|
||||||
│ │
|
|
||||||
│ Schema Lookup │ Data Access
|
|
||||||
▼ ▼
|
|
||||||
┌─────────────────────────────────────────────────────────────┐
|
|
||||||
│ Schema Catalog │
|
|
||||||
│ • Namespace → Database mapping │
|
|
||||||
│ • Topic → Table mapping │
|
|
||||||
│ • Schema version management │
|
|
||||||
└─────────────────────────────────────────────────────────────┘
|
|
||||||
▲
|
|
||||||
│ Metadata
|
|
||||||
│
|
|
||||||
┌─────────────────────────────────────────────────────────────────────────────┐
|
|
||||||
│ MQ Storage Layer │
|
|
||||||
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ▲ │
|
|
||||||
│ │ Topic A │ │ Topic B │ │ Topic C │ │ ... │ │ │
|
|
||||||
│ │ (Parquet) │ │ (Parquet) │ │ (Parquet) │ │ (Parquet) │ │ │
|
|
||||||
│ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ │ │
|
|
||||||
└──────────────────────────────────────────────────────────────────────────│──┘
|
|
||||||
│
|
|
||||||
Data Access
|
|
||||||
```
|
|
||||||
|
|
||||||
|
|
||||||
## Success Metrics
|
|
||||||
|
|
||||||
* **Feature Completeness:** Support for all specified SELECT operations and metadata commands
|
|
||||||
* **Performance:**
|
|
||||||
* **Simple SELECT queries**: < 100ms latency for single-table queries with up to 3 WHERE predicates on ≤ 100K records
|
|
||||||
* **Complex queries**: < 1s latency for queries involving aggregations (COUNT, SUM, MAX, MIN) on ≤ 1M records
|
|
||||||
* **Time-range queries**: < 500ms for timestamp-based filtering on ≤ 500K records within 24-hour windows
|
|
||||||
* **Scalability:** Handle topics with millions of messages efficiently
|
|
||||||
Reference in New Issue
Block a user