* Fix concurrent map writes in SSE-S3 key manager This commit fixes issue #7352 where parallel uploads to SSE-S3 enabled buckets were causing 'fatal error: concurrent map writes' crashes. The SSES3KeyManager struct had an unsynchronized map that was being accessed from multiple goroutines during concurrent PUT operations. Changes: - Added sync.RWMutex to SSES3KeyManager struct - Protected StoreKey() with write lock - Protected GetKey() with read lock - Updated GetOrCreateKey() with proper read/write locking pattern including double-check to prevent race conditions All existing SSE tests pass successfully. Fixes #7352 * Improve SSE-S3 key manager with envelope encryption Replace in-memory key storage with envelope encryption using a super key (KEK). Instead of storing DEKs in a map, the key manager now: - Uses a randomly generated 256-bit super key (KEK) - Encrypts each DEK with the super key using AES-GCM - Stores the encrypted DEK in object metadata - Decrypts the DEK on-demand when reading objects Benefits: - Eliminates unbounded memory growth from caching DEKs - Provides better security with authenticated encryption (AES-GCM) - Follows envelope encryption best practices (similar to AWS KMS) - No need for mutex-protected map lookups on reads - Each object's encrypted DEK is self-contained in its metadata This approach matches the design pattern used in the local KMS provider and is more suitable for production use. * Persist SSE-S3 KEK in filer for multi-server support Store the SSE-S3 super key (KEK) in the filer at /.seaweedfs/s3/kek instead of generating it per-server. This ensures: 1. **Multi-server consistency**: All S3 API servers use the same KEK 2. **Persistence across restarts**: KEK survives server restarts 3. **Centralized management**: KEK stored in filer, accessible to all servers 4. **Automatic initialization**: KEK is created on first startup if it doesn't exist The KEK is: - Stored as hex-encoded bytes in filer - Protected with file mode 0600 (read/write for owner only) - Located in /.seaweedfs/s3/ directory (mode 0700) - Loaded on S3 API server startup - Reused across all S3 API server instances This matches the architecture of centralized configuration in SeaweedFS and enables proper SSE-S3 support in multi-server deployments. * Change KEK storage location to /etc/s3/kek Move SSE-S3 KEK from /.seaweedfs/s3/kek to /etc/s3/kek for better organization and consistency with other SeaweedFS configuration files. The /etc directory is the standard location for configuration files in SeaweedFS. * use global sse-se key manager when copying * Update volume_growth_reservation_test.go * Rename KEK file to sse_kek for clarity Changed /etc/s3/kek to /etc/s3/sse_kek to make it clear this key is specifically for SSE-S3 encryption, not for other KMS purposes. This improves clarity and avoids potential confusion with the separate KMS provider system used for SSE-KMS. * Use constants for SSE-S3 KEK directory and file name Refactored to use named constants instead of string literals: - SSES3KEKDirectory = "/etc/s3" - SSES3KEKParentDir = "/etc" - SSES3KEKDirName = "s3" - SSES3KEKFileName = "sse_kek" This improves maintainability and makes it easier to change the storage location if needed in the future. * Address PR review: Improve error handling and robustness Addresses review comments from https://github.com/seaweedfs/seaweedfs/pull/7358#pullrequestreview-3367476264 Critical fixes: 1. Distinguish between 'not found' and other errors when loading KEK - Only generate new KEK if ErrNotFound - Fail fast on connectivity/permission errors to prevent data loss - Prevents creating new KEK that would make existing data undecryptable 2. Make SSE-S3 initialization failure fatal - Return error instead of warning when initialization fails - Prevents server from running in broken state 3. Improve directory creation error handling - Only ignore 'file exists' errors - Fail on permission/connectivity errors These changes ensure the SSE-S3 key manager is robust against transient errors and prevents accidental data loss. * Fix KEK path conflict with /etc/s3 file Changed KEK storage from /etc/s3/sse_kek to /etc/seaweedfs/s3_sse_kek to avoid conflict with the circuit breaker config at /etc/s3. The /etc/s3 path is used by CircuitBreakerConfigDir and may exist as a file (circuit_breaker.json), causing the error: 'CreateEntry /etc/s3/sse_kek: /etc/s3 should be a directory' New KEK location: /etc/seaweedfs/s3_sse_kek This uses the seaweedfs subdirectory which is more appropriate for internal SeaweedFS configuration files. Fixes startup failure when /etc/s3 exists as a file. * Revert KEK path back to /etc/s3/sse_kek Changed back from /etc/seaweedfs/s3_sse_kek to /etc/s3/sse_kek as requested. The /etc/s3 directory will be created properly when it doesn't exist. * Fix directory creation with proper ModeDir flag Set FileMode to uint32(0755 | os.ModeDir) when creating /etc/s3 directory to ensure it's created as a directory, not a file. Without the os.ModeDir flag, the entry was being created as a file, which caused the error 'CreateEntry: /etc/s3 is a file' when trying to create the KEK file inside it. Uses 0755 permissions (rwxr-xr-x) for the directory and adds os import for os.ModeDir constant.
285 lines
9.0 KiB
Go
285 lines
9.0 KiB
Go
package topology
|
|
|
|
import (
|
|
"sync"
|
|
"sync/atomic"
|
|
"testing"
|
|
"time"
|
|
|
|
"github.com/seaweedfs/seaweedfs/weed/sequence"
|
|
"github.com/seaweedfs/seaweedfs/weed/storage/needle"
|
|
"github.com/seaweedfs/seaweedfs/weed/storage/super_block"
|
|
"github.com/seaweedfs/seaweedfs/weed/storage/types"
|
|
)
|
|
|
|
// MockGrpcDialOption simulates grpc connection for testing
|
|
type MockGrpcDialOption struct{}
|
|
|
|
// simulateVolumeAllocation mocks the volume allocation process
|
|
func simulateVolumeAllocation(server *DataNode, vid needle.VolumeId, option *VolumeGrowOption) error {
|
|
// Simulate some processing time
|
|
time.Sleep(time.Millisecond * 10)
|
|
return nil
|
|
}
|
|
|
|
func TestVolumeGrowth_ReservationBasedAllocation(t *testing.T) {
|
|
// Create test topology with single server for predictable behavior
|
|
topo := NewTopology("weedfs", sequence.NewMemorySequencer(), 32*1024, 5, false)
|
|
|
|
// Create data center and rack
|
|
dc := NewDataCenter("dc1")
|
|
topo.LinkChildNode(dc)
|
|
rack := NewRack("rack1")
|
|
dc.LinkChildNode(rack)
|
|
|
|
// Create single data node with limited capacity
|
|
dn := NewDataNode("server1")
|
|
rack.LinkChildNode(dn)
|
|
|
|
// Set up disk with limited capacity (only 5 volumes)
|
|
disk := NewDisk(types.HardDriveType.String())
|
|
disk.diskUsages.getOrCreateDisk(types.HardDriveType).maxVolumeCount = 5
|
|
dn.LinkChildNode(disk)
|
|
|
|
// Test volume growth with reservation
|
|
vg := NewDefaultVolumeGrowth()
|
|
rp, _ := super_block.NewReplicaPlacementFromString("000") // Single copy (no replicas)
|
|
|
|
option := &VolumeGrowOption{
|
|
Collection: "test",
|
|
ReplicaPlacement: rp,
|
|
DiskType: types.HardDriveType,
|
|
}
|
|
|
|
// Try to create volumes and verify reservations work
|
|
for i := 0; i < 5; i++ {
|
|
servers, reservation, err := vg.findEmptySlotsForOneVolume(topo, option, true)
|
|
if err != nil {
|
|
t.Errorf("Failed to find slots with reservation on iteration %d: %v", i, err)
|
|
continue
|
|
}
|
|
|
|
if len(servers) != 1 {
|
|
t.Errorf("Expected 1 server for replica placement 000, got %d", len(servers))
|
|
}
|
|
|
|
if len(reservation.reservationIds) != 1 {
|
|
t.Errorf("Expected 1 reservation ID, got %d", len(reservation.reservationIds))
|
|
}
|
|
|
|
// Verify the reservation is on our expected server
|
|
server := servers[0]
|
|
if server != dn {
|
|
t.Errorf("Expected volume to be allocated on server1, got %s", server.Id())
|
|
}
|
|
|
|
// Check available space before and after reservation
|
|
availableBeforeCreation := server.AvailableSpaceFor(option)
|
|
expectedBefore := int64(5 - i)
|
|
if availableBeforeCreation != expectedBefore {
|
|
t.Errorf("Iteration %d: Expected %d base available space, got %d", i, expectedBefore, availableBeforeCreation)
|
|
}
|
|
|
|
// Simulate successful volume creation
|
|
// Must acquire lock before accessing children map to prevent race condition
|
|
dn.Lock()
|
|
disk := dn.children[NodeId(types.HardDriveType.String())].(*Disk)
|
|
deltaDiskUsage := &DiskUsageCounts{
|
|
volumeCount: 1,
|
|
}
|
|
disk.UpAdjustDiskUsageDelta(types.HardDriveType, deltaDiskUsage)
|
|
dn.Unlock()
|
|
|
|
// Release reservation after successful creation
|
|
reservation.releaseAllReservations()
|
|
|
|
// Verify available space after creation
|
|
availableAfterCreation := server.AvailableSpaceFor(option)
|
|
expectedAfter := int64(5 - i - 1)
|
|
if availableAfterCreation != expectedAfter {
|
|
t.Errorf("Iteration %d: Expected %d available space after creation, got %d", i, expectedAfter, availableAfterCreation)
|
|
}
|
|
}
|
|
|
|
// After 5 volumes, should have no more capacity
|
|
_, _, err := vg.findEmptySlotsForOneVolume(topo, option, true)
|
|
if err == nil {
|
|
t.Error("Expected volume allocation to fail when server is at capacity")
|
|
}
|
|
}
|
|
|
|
func TestVolumeGrowth_ConcurrentAllocationPreventsRaceCondition(t *testing.T) {
|
|
// Create test topology with very limited capacity
|
|
topo := NewTopology("weedfs", sequence.NewMemorySequencer(), 32*1024, 5, false)
|
|
|
|
dc := NewDataCenter("dc1")
|
|
topo.LinkChildNode(dc)
|
|
rack := NewRack("rack1")
|
|
dc.LinkChildNode(rack)
|
|
|
|
// Single data node with capacity for only 5 volumes
|
|
dn := NewDataNode("server1")
|
|
rack.LinkChildNode(dn)
|
|
|
|
disk := NewDisk(types.HardDriveType.String())
|
|
disk.diskUsages.getOrCreateDisk(types.HardDriveType).maxVolumeCount = 5
|
|
dn.LinkChildNode(disk)
|
|
|
|
vg := NewDefaultVolumeGrowth()
|
|
rp, _ := super_block.NewReplicaPlacementFromString("000") // Single copy (no replicas)
|
|
|
|
option := &VolumeGrowOption{
|
|
Collection: "test",
|
|
ReplicaPlacement: rp,
|
|
DiskType: types.HardDriveType,
|
|
}
|
|
|
|
// Simulate concurrent volume creation attempts
|
|
const concurrentRequests = 10
|
|
var wg sync.WaitGroup
|
|
var successCount, failureCount atomic.Int32
|
|
|
|
for i := 0; i < concurrentRequests; i++ {
|
|
wg.Add(1)
|
|
go func(requestId int) {
|
|
defer wg.Done()
|
|
|
|
_, reservation, err := vg.findEmptySlotsForOneVolume(topo, option, true)
|
|
|
|
if err != nil {
|
|
failureCount.Add(1)
|
|
t.Logf("Request %d failed as expected: %v", requestId, err)
|
|
} else {
|
|
successCount.Add(1)
|
|
t.Logf("Request %d succeeded, got reservation", requestId)
|
|
|
|
// Simulate completion: increment volume count BEFORE releasing reservation
|
|
if reservation != nil {
|
|
// First, increment the volume count to reflect the created volume
|
|
// Must acquire lock before accessing children map to prevent race condition
|
|
dn.Lock()
|
|
disk := dn.children[NodeId(types.HardDriveType.String())].(*Disk)
|
|
deltaDiskUsage := &DiskUsageCounts{
|
|
volumeCount: 1,
|
|
}
|
|
disk.UpAdjustDiskUsageDelta(types.HardDriveType, deltaDiskUsage)
|
|
dn.Unlock()
|
|
|
|
// Then release the reservation
|
|
reservation.releaseAllReservations()
|
|
}
|
|
}
|
|
}(i)
|
|
}
|
|
|
|
wg.Wait()
|
|
|
|
// With reservation system, only 5 requests should succeed (capacity limit)
|
|
// The rest should fail due to insufficient capacity
|
|
if successCount.Load() != 5 {
|
|
t.Errorf("Expected exactly 5 successful reservations, got %d", successCount.Load())
|
|
}
|
|
|
|
if failureCount.Load() != 5 {
|
|
t.Errorf("Expected exactly 5 failed reservations, got %d", failureCount.Load())
|
|
}
|
|
|
|
// Verify final state
|
|
finalAvailable := dn.AvailableSpaceFor(option)
|
|
if finalAvailable != 0 {
|
|
t.Errorf("Expected 0 available space after all allocations, got %d", finalAvailable)
|
|
}
|
|
|
|
t.Logf("Concurrent test completed: %d successes, %d failures", successCount.Load(), failureCount.Load())
|
|
}
|
|
|
|
func TestVolumeGrowth_ReservationFailureRollback(t *testing.T) {
|
|
// Create topology with multiple servers, but limited total capacity
|
|
topo := NewTopology("weedfs", sequence.NewMemorySequencer(), 32*1024, 5, false)
|
|
|
|
dc := NewDataCenter("dc1")
|
|
topo.LinkChildNode(dc)
|
|
rack := NewRack("rack1")
|
|
dc.LinkChildNode(rack)
|
|
|
|
// Create two servers with different available capacity
|
|
dn1 := NewDataNode("server1")
|
|
dn2 := NewDataNode("server2")
|
|
rack.LinkChildNode(dn1)
|
|
rack.LinkChildNode(dn2)
|
|
|
|
// Server 1: 5 available slots
|
|
disk1 := NewDisk(types.HardDriveType.String())
|
|
disk1.diskUsages.getOrCreateDisk(types.HardDriveType).maxVolumeCount = 5
|
|
dn1.LinkChildNode(disk1)
|
|
|
|
// Server 2: 0 available slots (full)
|
|
disk2 := NewDisk(types.HardDriveType.String())
|
|
diskUsage2 := disk2.diskUsages.getOrCreateDisk(types.HardDriveType)
|
|
diskUsage2.maxVolumeCount = 5
|
|
diskUsage2.volumeCount = 5
|
|
dn2.LinkChildNode(disk2)
|
|
|
|
vg := NewDefaultVolumeGrowth()
|
|
rp, _ := super_block.NewReplicaPlacementFromString("010") // requires 2 replicas
|
|
|
|
option := &VolumeGrowOption{
|
|
Collection: "test",
|
|
ReplicaPlacement: rp,
|
|
DiskType: types.HardDriveType,
|
|
}
|
|
|
|
// This should fail because we can't satisfy replica requirements
|
|
// (need 2 servers but only 1 has space)
|
|
_, _, err := vg.findEmptySlotsForOneVolume(topo, option, true)
|
|
if err == nil {
|
|
t.Error("Expected reservation to fail due to insufficient replica capacity")
|
|
}
|
|
|
|
// Verify no reservations are left hanging
|
|
available1 := dn1.AvailableSpaceForReservation(option)
|
|
if available1 != 5 {
|
|
t.Errorf("Expected server1 to have all capacity available after failed reservation, got %d", available1)
|
|
}
|
|
|
|
available2 := dn2.AvailableSpaceForReservation(option)
|
|
if available2 != 0 {
|
|
t.Errorf("Expected server2 to have no capacity available, got %d", available2)
|
|
}
|
|
}
|
|
|
|
func TestVolumeGrowth_ReservationTimeout(t *testing.T) {
|
|
dn := NewDataNode("server1")
|
|
diskType := types.HardDriveType
|
|
|
|
// Set up capacity
|
|
diskUsage := dn.diskUsages.getOrCreateDisk(diskType)
|
|
diskUsage.maxVolumeCount = 5
|
|
|
|
// Create a reservation
|
|
reservationId, success := dn.TryReserveCapacity(diskType, 2)
|
|
if !success {
|
|
t.Fatal("Expected successful reservation")
|
|
}
|
|
|
|
// Manually set the reservation time to simulate old reservation
|
|
dn.capacityReservations.Lock()
|
|
if reservation, exists := dn.capacityReservations.reservations[reservationId]; exists {
|
|
reservation.createdAt = time.Now().Add(-10 * time.Minute)
|
|
}
|
|
dn.capacityReservations.Unlock()
|
|
|
|
// Try another reservation - this should trigger cleanup and succeed
|
|
_, success = dn.TryReserveCapacity(diskType, 3)
|
|
if !success {
|
|
t.Error("Expected reservation to succeed after cleanup of expired reservation")
|
|
}
|
|
|
|
// Original reservation should be cleaned up
|
|
option := &VolumeGrowOption{DiskType: diskType}
|
|
available := dn.AvailableSpaceForReservation(option)
|
|
if available != 2 { // 5 - 3 = 2
|
|
t.Errorf("Expected 2 available slots after cleanup and new reservation, got %d", available)
|
|
}
|
|
}
|