Files
seaweedFS/weed/topology/volume_growth_reservation_test.go
Chris Lu f7bd75ef3b S3: Avoid in-memory map concurrent writes in SSE-S3 key manager (#7358)
* Fix concurrent map writes in SSE-S3 key manager

This commit fixes issue #7352 where parallel uploads to SSE-S3 enabled
buckets were causing 'fatal error: concurrent map writes' crashes.

The SSES3KeyManager struct had an unsynchronized map that was being
accessed from multiple goroutines during concurrent PUT operations.

Changes:
- Added sync.RWMutex to SSES3KeyManager struct
- Protected StoreKey() with write lock
- Protected GetKey() with read lock
- Updated GetOrCreateKey() with proper read/write locking pattern
  including double-check to prevent race conditions

All existing SSE tests pass successfully.

Fixes #7352

* Improve SSE-S3 key manager with envelope encryption

Replace in-memory key storage with envelope encryption using a super key (KEK).

Instead of storing DEKs in a map, the key manager now:
- Uses a randomly generated 256-bit super key (KEK)
- Encrypts each DEK with the super key using AES-GCM
- Stores the encrypted DEK in object metadata
- Decrypts the DEK on-demand when reading objects

Benefits:
- Eliminates unbounded memory growth from caching DEKs
- Provides better security with authenticated encryption (AES-GCM)
- Follows envelope encryption best practices (similar to AWS KMS)
- No need for mutex-protected map lookups on reads
- Each object's encrypted DEK is self-contained in its metadata

This approach matches the design pattern used in the local KMS provider
and is more suitable for production use.

* Persist SSE-S3 KEK in filer for multi-server support

Store the SSE-S3 super key (KEK) in the filer at /.seaweedfs/s3/kek
instead of generating it per-server. This ensures:

1. **Multi-server consistency**: All S3 API servers use the same KEK
2. **Persistence across restarts**: KEK survives server restarts
3. **Centralized management**: KEK stored in filer, accessible to all servers
4. **Automatic initialization**: KEK is created on first startup if it doesn't exist

The KEK is:
- Stored as hex-encoded bytes in filer
- Protected with file mode 0600 (read/write for owner only)
- Located in /.seaweedfs/s3/ directory (mode 0700)
- Loaded on S3 API server startup
- Reused across all S3 API server instances

This matches the architecture of centralized configuration in SeaweedFS
and enables proper SSE-S3 support in multi-server deployments.

* Change KEK storage location to /etc/s3/kek

Move SSE-S3 KEK from /.seaweedfs/s3/kek to /etc/s3/kek for better
organization and consistency with other SeaweedFS configuration files.

The /etc directory is the standard location for configuration files
in SeaweedFS.

* use global sse-se key manager when copying

* Update volume_growth_reservation_test.go

* Rename KEK file to sse_kek for clarity

Changed /etc/s3/kek to /etc/s3/sse_kek to make it clear this key
is specifically for SSE-S3 encryption, not for other KMS purposes.

This improves clarity and avoids potential confusion with the
separate KMS provider system used for SSE-KMS.

* Use constants for SSE-S3 KEK directory and file name

Refactored to use named constants instead of string literals:
- SSES3KEKDirectory = "/etc/s3"
- SSES3KEKParentDir = "/etc"
- SSES3KEKDirName = "s3"
- SSES3KEKFileName = "sse_kek"

This improves maintainability and makes it easier to change
the storage location if needed in the future.

* Address PR review: Improve error handling and robustness

Addresses review comments from https://github.com/seaweedfs/seaweedfs/pull/7358#pullrequestreview-3367476264

Critical fixes:
1. Distinguish between 'not found' and other errors when loading KEK
   - Only generate new KEK if ErrNotFound
   - Fail fast on connectivity/permission errors to prevent data loss
   - Prevents creating new KEK that would make existing data undecryptable

2. Make SSE-S3 initialization failure fatal
   - Return error instead of warning when initialization fails
   - Prevents server from running in broken state

3. Improve directory creation error handling
   - Only ignore 'file exists' errors
   - Fail on permission/connectivity errors

These changes ensure the SSE-S3 key manager is robust against
transient errors and prevents accidental data loss.

* Fix KEK path conflict with /etc/s3 file

Changed KEK storage from /etc/s3/sse_kek to /etc/seaweedfs/s3_sse_kek
to avoid conflict with the circuit breaker config at /etc/s3.

The /etc/s3 path is used by CircuitBreakerConfigDir and may exist as
a file (circuit_breaker.json), causing the error:
  'CreateEntry /etc/s3/sse_kek: /etc/s3 should be a directory'

New KEK location: /etc/seaweedfs/s3_sse_kek
This uses the seaweedfs subdirectory which is more appropriate
for internal SeaweedFS configuration files.

Fixes startup failure when /etc/s3 exists as a file.

* Revert KEK path back to /etc/s3/sse_kek

Changed back from /etc/seaweedfs/s3_sse_kek to /etc/s3/sse_kek
as requested. The /etc/s3 directory will be created properly
when it doesn't exist.

* Fix directory creation with proper ModeDir flag

Set FileMode to uint32(0755 | os.ModeDir) when creating /etc/s3 directory
to ensure it's created as a directory, not a file.

Without the os.ModeDir flag, the entry was being created as a file,
which caused the error 'CreateEntry: /etc/s3 is a file' when trying
to create the KEK file inside it.

Uses 0755 permissions (rwxr-xr-x) for the directory and adds os import
for os.ModeDir constant.
2025-10-22 14:12:31 -07:00

285 lines
9.0 KiB
Go

package topology
import (
"sync"
"sync/atomic"
"testing"
"time"
"github.com/seaweedfs/seaweedfs/weed/sequence"
"github.com/seaweedfs/seaweedfs/weed/storage/needle"
"github.com/seaweedfs/seaweedfs/weed/storage/super_block"
"github.com/seaweedfs/seaweedfs/weed/storage/types"
)
// MockGrpcDialOption simulates grpc connection for testing
type MockGrpcDialOption struct{}
// simulateVolumeAllocation mocks the volume allocation process
func simulateVolumeAllocation(server *DataNode, vid needle.VolumeId, option *VolumeGrowOption) error {
// Simulate some processing time
time.Sleep(time.Millisecond * 10)
return nil
}
func TestVolumeGrowth_ReservationBasedAllocation(t *testing.T) {
// Create test topology with single server for predictable behavior
topo := NewTopology("weedfs", sequence.NewMemorySequencer(), 32*1024, 5, false)
// Create data center and rack
dc := NewDataCenter("dc1")
topo.LinkChildNode(dc)
rack := NewRack("rack1")
dc.LinkChildNode(rack)
// Create single data node with limited capacity
dn := NewDataNode("server1")
rack.LinkChildNode(dn)
// Set up disk with limited capacity (only 5 volumes)
disk := NewDisk(types.HardDriveType.String())
disk.diskUsages.getOrCreateDisk(types.HardDriveType).maxVolumeCount = 5
dn.LinkChildNode(disk)
// Test volume growth with reservation
vg := NewDefaultVolumeGrowth()
rp, _ := super_block.NewReplicaPlacementFromString("000") // Single copy (no replicas)
option := &VolumeGrowOption{
Collection: "test",
ReplicaPlacement: rp,
DiskType: types.HardDriveType,
}
// Try to create volumes and verify reservations work
for i := 0; i < 5; i++ {
servers, reservation, err := vg.findEmptySlotsForOneVolume(topo, option, true)
if err != nil {
t.Errorf("Failed to find slots with reservation on iteration %d: %v", i, err)
continue
}
if len(servers) != 1 {
t.Errorf("Expected 1 server for replica placement 000, got %d", len(servers))
}
if len(reservation.reservationIds) != 1 {
t.Errorf("Expected 1 reservation ID, got %d", len(reservation.reservationIds))
}
// Verify the reservation is on our expected server
server := servers[0]
if server != dn {
t.Errorf("Expected volume to be allocated on server1, got %s", server.Id())
}
// Check available space before and after reservation
availableBeforeCreation := server.AvailableSpaceFor(option)
expectedBefore := int64(5 - i)
if availableBeforeCreation != expectedBefore {
t.Errorf("Iteration %d: Expected %d base available space, got %d", i, expectedBefore, availableBeforeCreation)
}
// Simulate successful volume creation
// Must acquire lock before accessing children map to prevent race condition
dn.Lock()
disk := dn.children[NodeId(types.HardDriveType.String())].(*Disk)
deltaDiskUsage := &DiskUsageCounts{
volumeCount: 1,
}
disk.UpAdjustDiskUsageDelta(types.HardDriveType, deltaDiskUsage)
dn.Unlock()
// Release reservation after successful creation
reservation.releaseAllReservations()
// Verify available space after creation
availableAfterCreation := server.AvailableSpaceFor(option)
expectedAfter := int64(5 - i - 1)
if availableAfterCreation != expectedAfter {
t.Errorf("Iteration %d: Expected %d available space after creation, got %d", i, expectedAfter, availableAfterCreation)
}
}
// After 5 volumes, should have no more capacity
_, _, err := vg.findEmptySlotsForOneVolume(topo, option, true)
if err == nil {
t.Error("Expected volume allocation to fail when server is at capacity")
}
}
func TestVolumeGrowth_ConcurrentAllocationPreventsRaceCondition(t *testing.T) {
// Create test topology with very limited capacity
topo := NewTopology("weedfs", sequence.NewMemorySequencer(), 32*1024, 5, false)
dc := NewDataCenter("dc1")
topo.LinkChildNode(dc)
rack := NewRack("rack1")
dc.LinkChildNode(rack)
// Single data node with capacity for only 5 volumes
dn := NewDataNode("server1")
rack.LinkChildNode(dn)
disk := NewDisk(types.HardDriveType.String())
disk.diskUsages.getOrCreateDisk(types.HardDriveType).maxVolumeCount = 5
dn.LinkChildNode(disk)
vg := NewDefaultVolumeGrowth()
rp, _ := super_block.NewReplicaPlacementFromString("000") // Single copy (no replicas)
option := &VolumeGrowOption{
Collection: "test",
ReplicaPlacement: rp,
DiskType: types.HardDriveType,
}
// Simulate concurrent volume creation attempts
const concurrentRequests = 10
var wg sync.WaitGroup
var successCount, failureCount atomic.Int32
for i := 0; i < concurrentRequests; i++ {
wg.Add(1)
go func(requestId int) {
defer wg.Done()
_, reservation, err := vg.findEmptySlotsForOneVolume(topo, option, true)
if err != nil {
failureCount.Add(1)
t.Logf("Request %d failed as expected: %v", requestId, err)
} else {
successCount.Add(1)
t.Logf("Request %d succeeded, got reservation", requestId)
// Simulate completion: increment volume count BEFORE releasing reservation
if reservation != nil {
// First, increment the volume count to reflect the created volume
// Must acquire lock before accessing children map to prevent race condition
dn.Lock()
disk := dn.children[NodeId(types.HardDriveType.String())].(*Disk)
deltaDiskUsage := &DiskUsageCounts{
volumeCount: 1,
}
disk.UpAdjustDiskUsageDelta(types.HardDriveType, deltaDiskUsage)
dn.Unlock()
// Then release the reservation
reservation.releaseAllReservations()
}
}
}(i)
}
wg.Wait()
// With reservation system, only 5 requests should succeed (capacity limit)
// The rest should fail due to insufficient capacity
if successCount.Load() != 5 {
t.Errorf("Expected exactly 5 successful reservations, got %d", successCount.Load())
}
if failureCount.Load() != 5 {
t.Errorf("Expected exactly 5 failed reservations, got %d", failureCount.Load())
}
// Verify final state
finalAvailable := dn.AvailableSpaceFor(option)
if finalAvailable != 0 {
t.Errorf("Expected 0 available space after all allocations, got %d", finalAvailable)
}
t.Logf("Concurrent test completed: %d successes, %d failures", successCount.Load(), failureCount.Load())
}
func TestVolumeGrowth_ReservationFailureRollback(t *testing.T) {
// Create topology with multiple servers, but limited total capacity
topo := NewTopology("weedfs", sequence.NewMemorySequencer(), 32*1024, 5, false)
dc := NewDataCenter("dc1")
topo.LinkChildNode(dc)
rack := NewRack("rack1")
dc.LinkChildNode(rack)
// Create two servers with different available capacity
dn1 := NewDataNode("server1")
dn2 := NewDataNode("server2")
rack.LinkChildNode(dn1)
rack.LinkChildNode(dn2)
// Server 1: 5 available slots
disk1 := NewDisk(types.HardDriveType.String())
disk1.diskUsages.getOrCreateDisk(types.HardDriveType).maxVolumeCount = 5
dn1.LinkChildNode(disk1)
// Server 2: 0 available slots (full)
disk2 := NewDisk(types.HardDriveType.String())
diskUsage2 := disk2.diskUsages.getOrCreateDisk(types.HardDriveType)
diskUsage2.maxVolumeCount = 5
diskUsage2.volumeCount = 5
dn2.LinkChildNode(disk2)
vg := NewDefaultVolumeGrowth()
rp, _ := super_block.NewReplicaPlacementFromString("010") // requires 2 replicas
option := &VolumeGrowOption{
Collection: "test",
ReplicaPlacement: rp,
DiskType: types.HardDriveType,
}
// This should fail because we can't satisfy replica requirements
// (need 2 servers but only 1 has space)
_, _, err := vg.findEmptySlotsForOneVolume(topo, option, true)
if err == nil {
t.Error("Expected reservation to fail due to insufficient replica capacity")
}
// Verify no reservations are left hanging
available1 := dn1.AvailableSpaceForReservation(option)
if available1 != 5 {
t.Errorf("Expected server1 to have all capacity available after failed reservation, got %d", available1)
}
available2 := dn2.AvailableSpaceForReservation(option)
if available2 != 0 {
t.Errorf("Expected server2 to have no capacity available, got %d", available2)
}
}
func TestVolumeGrowth_ReservationTimeout(t *testing.T) {
dn := NewDataNode("server1")
diskType := types.HardDriveType
// Set up capacity
diskUsage := dn.diskUsages.getOrCreateDisk(diskType)
diskUsage.maxVolumeCount = 5
// Create a reservation
reservationId, success := dn.TryReserveCapacity(diskType, 2)
if !success {
t.Fatal("Expected successful reservation")
}
// Manually set the reservation time to simulate old reservation
dn.capacityReservations.Lock()
if reservation, exists := dn.capacityReservations.reservations[reservationId]; exists {
reservation.createdAt = time.Now().Add(-10 * time.Minute)
}
dn.capacityReservations.Unlock()
// Try another reservation - this should trigger cleanup and succeed
_, success = dn.TryReserveCapacity(diskType, 3)
if !success {
t.Error("Expected reservation to succeed after cleanup of expired reservation")
}
// Original reservation should be cleaned up
option := &VolumeGrowOption{DiskType: diskType}
available := dn.AvailableSpaceForReservation(option)
if available != 2 { // 5 - 3 = 2
t.Errorf("Expected 2 available slots after cleanup and new reservation, got %d", available)
}
}