Files
seaweedFS/weed/s3api/s3api_object_handlers_delete.go
Chris Lu 39ba19eea6 filer: async empty folder cleanup via metadata events (#7614)
* filer: async empty folder cleanup via metadata events

Implements asynchronous empty folder cleanup when files are deleted in S3.

Key changes:

1. EmptyFolderCleaner - New component that handles folder cleanup:
   - Uses consistent hashing (LockRing) to determine folder ownership
   - Each filer owns specific folders, avoiding duplicate cleanup work
   - Debounces delete events (10s delay) to batch multiple deletes
   - Caches rough folder counts to skip unnecessary checks
   - Cancels pending cleanup when new files are created
   - Handles both file and subdirectory deletions

2. Integration with metadata events:
   - Listens to both local and remote filer metadata events
   - Processes create/delete/rename events to track folder state
   - Only processes folders under /buckets/<bucket>/...

3. Removed synchronous empty folder cleanup from S3 handlers:
   - DeleteObjectHandler no longer calls DoDeleteEmptyParentDirectories
   - DeleteMultipleObjectsHandler no longer tracks/cleans directories
   - Cleanup now happens asynchronously via metadata events

Benefits:
- Non-blocking: S3 delete requests return immediately
- Coordinated: Only one filer (the owner) cleans each folder
- Efficient: Batching and caching reduce unnecessary checks
- Event-driven: Folder deletion triggers parent folder check automatically

* filer: add CleanupQueue data structure for deduplicated folder cleanup

CleanupQueue uses a linked list for FIFO ordering and a hashmap for O(1)
deduplication. Processing is triggered when:
- Queue size reaches maxSize (default 1000), OR
- Oldest item exceeds maxAge (default 10 minutes)

Key features:
- O(1) Add, Remove, Pop, Contains operations
- Duplicate folders are ignored (keeps original position/time)
- Testable with injectable time function
- Thread-safe with mutex protection

* filer: use CleanupQueue for empty folder cleanup

Replace timer-per-folder approach with queue-based processing:
- Use CleanupQueue for deduplication and ordered processing
- Process queue when full (1000 items) or oldest item exceeds 10 minutes
- Background processor checks queue every 10 seconds
- Remove from queue on create events to cancel pending cleanup

Benefits:
- Bounded memory: queue has max size, not unlimited timers
- Efficient: O(1) add/remove/contains operations
- Batch processing: handle many folders efficiently
- Better for high-volume delete scenarios

* filer: CleanupQueue.Add moves duplicate to back with updated time

When adding a folder that already exists in the queue:
- Remove it from its current position
- Add it to the back of the queue
- Update the queue time to current time

This ensures that folders with recent delete activity are processed
later, giving more time for additional deletes to occur.

* filer: CleanupQueue uses event time and inserts in sorted order

Changes:
- Add() now takes eventTime parameter instead of using current time
- Insert items in time-sorted order (oldest at front) to handle out-of-order events
- When updating duplicate with newer time, reposition to maintain sort order
- Ignore updates with older time (keep existing later time)

This ensures proper ordering when processing events from distributed filers
where event arrival order may not match event occurrence order.

* filer: remove unused CleanupQueue functions (SetNowFunc, GetAll)

Removed test-only functions:
- SetNowFunc: tests now use real time with past event times
- GetAll: tests now use Pop() to verify order

Kept functions used in production:
- Peek: used in filer_notify_read.go
- OldestAge: used in empty_folder_cleaner.go logging

* filer: initialize cache entry on first delete/create event

Previously, roughCount was only updated if the cache entry already
existed, but entries were only created during executeCleanup. This
meant delete/create events before the first cleanup didn't track
the count.

Now create the cache entry on first event, so roughCount properly
tracks all changes from the start.

* filer: skip adding to cleanup queue if roughCount > 0

If the cached roughCount indicates there are still items in the
folder, don't bother adding it to the cleanup queue. This avoids
unnecessary queue entries and reduces wasted cleanup checks.

* filer: don't create cache entry on create event

Only update roughCount if the folder is already being tracked.
New folders don't need tracking until we see a delete event.

* filer: move empty folder cleanup to its own package

- Created weed/filer/empty_folder_cleanup package
- Defined FilerOperations interface to break circular dependency
- Added CountDirectoryEntries method to Filer
- Exported IsUnderPath and IsUnderBucketPath helper functions

* filer: make isUnderPath and isUnderBucketPath private

These helpers are only used within the empty_folder_cleanup package.
2025-12-03 21:12:19 -08:00

375 lines
13 KiB
Go

package s3api
import (
"encoding/xml"
"fmt"
"io"
"net/http"
"strings"
"github.com/seaweedfs/seaweedfs/weed/filer"
"github.com/seaweedfs/seaweedfs/weed/glog"
"github.com/seaweedfs/seaweedfs/weed/pb/filer_pb"
"github.com/seaweedfs/seaweedfs/weed/s3api/s3_constants"
"github.com/seaweedfs/seaweedfs/weed/s3api/s3err"
stats_collect "github.com/seaweedfs/seaweedfs/weed/stats"
"github.com/seaweedfs/seaweedfs/weed/util"
)
const (
deleteMultipleObjectsLimit = 1000
)
func (s3a *S3ApiServer) DeleteObjectHandler(w http.ResponseWriter, r *http.Request) {
bucket, object := s3_constants.GetBucketAndObject(r)
glog.V(3).Infof("DeleteObjectHandler %s %s", bucket, object)
// Check for specific version ID in query parameters
versionId := r.URL.Query().Get("versionId")
// Get detailed versioning state for proper handling of suspended vs enabled versioning
versioningState, err := s3a.getVersioningState(bucket)
if err != nil {
if err == filer_pb.ErrNotFound {
s3err.WriteErrorResponse(w, r, s3err.ErrNoSuchBucket)
return
}
glog.Errorf("Error checking versioning status for bucket %s: %v", bucket, err)
s3err.WriteErrorResponse(w, r, s3err.ErrInternalError)
return
}
versioningEnabled := (versioningState == s3_constants.VersioningEnabled)
versioningSuspended := (versioningState == s3_constants.VersioningSuspended)
versioningConfigured := (versioningState != "")
var auditLog *s3err.AccessLog
if s3err.Logger != nil {
auditLog = s3err.GetAccessLog(r, http.StatusNoContent, s3err.ErrNone)
}
if versioningConfigured {
// Handle versioned delete based on specific versioning state
if versionId != "" {
// Delete specific version (same for both enabled and suspended)
// Check object lock permissions before deleting specific version
governanceBypassAllowed := s3a.evaluateGovernanceBypassRequest(r, bucket, object)
if err := s3a.enforceObjectLockProtections(r, bucket, object, versionId, governanceBypassAllowed); err != nil {
glog.V(2).Infof("DeleteObjectHandler: object lock check failed for %s/%s: %v", bucket, object, err)
s3err.WriteErrorResponse(w, r, s3err.ErrAccessDenied)
return
}
// Delete specific version
err := s3a.deleteSpecificObjectVersion(bucket, object, versionId)
if err != nil {
glog.Errorf("Failed to delete specific version %s: %v", versionId, err)
s3err.WriteErrorResponse(w, r, s3err.ErrInternalError)
return
}
// Set version ID in response header
w.Header().Set("x-amz-version-id", versionId)
} else {
// Delete without version ID - behavior depends on versioning state
if versioningEnabled {
// Enabled versioning: Create delete marker (logical delete)
// AWS S3 behavior: Delete marker creation is NOT blocked by object retention
// because it's a logical delete that doesn't actually remove the retained version
deleteMarkerVersionId, err := s3a.createDeleteMarker(bucket, object)
if err != nil {
glog.Errorf("Failed to create delete marker: %v", err)
s3err.WriteErrorResponse(w, r, s3err.ErrInternalError)
return
}
// Set delete marker version ID in response header
w.Header().Set("x-amz-version-id", deleteMarkerVersionId)
w.Header().Set("x-amz-delete-marker", "true")
} else if versioningSuspended {
// Suspended versioning: Actually delete the "null" version object
glog.V(2).Infof("DeleteObjectHandler: deleting null version for suspended versioning %s/%s", bucket, object)
// Check object lock permissions before deleting "null" version
governanceBypassAllowed := s3a.evaluateGovernanceBypassRequest(r, bucket, object)
if err := s3a.enforceObjectLockProtections(r, bucket, object, "null", governanceBypassAllowed); err != nil {
glog.V(2).Infof("DeleteObjectHandler: object lock check failed for %s/%s: %v", bucket, object, err)
s3err.WriteErrorResponse(w, r, s3err.ErrAccessDenied)
return
}
// Delete the "null" version (the regular file)
err := s3a.deleteSpecificObjectVersion(bucket, object, "null")
if err != nil {
glog.Errorf("Failed to delete null version: %v", err)
s3err.WriteErrorResponse(w, r, s3err.ErrInternalError)
return
}
// Note: According to AWS S3 spec, suspended versioning should NOT return version ID headers
// The object is deleted but no version information is returned
}
}
} else {
// Handle regular delete (non-versioned)
// Check object lock permissions before deleting object
governanceBypassAllowed := s3a.evaluateGovernanceBypassRequest(r, bucket, object)
if err := s3a.enforceObjectLockProtections(r, bucket, object, "", governanceBypassAllowed); err != nil {
glog.V(2).Infof("DeleteObjectHandler: object lock check failed for %s/%s: %v", bucket, object, err)
s3err.WriteErrorResponse(w, r, s3err.ErrAccessDenied)
return
}
target := util.FullPath(fmt.Sprintf("%s/%s%s", s3a.option.BucketsPath, bucket, object))
dir, name := target.DirAndName()
err := s3a.WithFilerClient(false, func(client filer_pb.SeaweedFilerClient) error {
return doDeleteEntry(client, dir, name, true, false)
// Note: Empty folder cleanup is now handled asynchronously by EmptyFolderCleaner
// which listens to metadata events and uses consistent hashing for coordination
})
if err != nil {
s3err.WriteErrorResponse(w, r, s3err.ErrInternalError)
return
}
}
if auditLog != nil {
auditLog.Key = strings.TrimPrefix(object, "/")
s3err.PostAccessLog(*auditLog)
}
stats_collect.RecordBucketActiveTime(bucket)
stats_collect.S3DeletedObjectsCounter.WithLabelValues(bucket).Inc()
w.WriteHeader(http.StatusNoContent)
}
// ObjectIdentifier represents an object to be deleted with its key name and optional version ID.
type ObjectIdentifier struct {
Key string `xml:"Key"`
VersionId string `xml:"VersionId,omitempty"`
DeleteMarker bool `xml:"DeleteMarker,omitempty"`
DeleteMarkerVersionId string `xml:"DeleteMarkerVersionId,omitempty"`
}
// DeleteObjectsRequest - xml carrying the object key names which needs to be deleted.
type DeleteObjectsRequest struct {
// Element to enable quiet mode for the request
Quiet bool
// List of objects to be deleted
Objects []ObjectIdentifier `xml:"Object"`
}
// DeleteError structure.
type DeleteError struct {
Code string `xml:"Code"`
Message string `xml:"Message"`
Key string `xml:"Key"`
VersionId string `xml:"VersionId,omitempty"`
}
// DeleteObjectsResponse container for multiple object deletes.
type DeleteObjectsResponse struct {
XMLName xml.Name `xml:"http://s3.amazonaws.com/doc/2006-03-01/ DeleteResult" json:"-"`
// Collection of all deleted objects
DeletedObjects []ObjectIdentifier `xml:"Deleted,omitempty"`
// Collection of errors deleting certain objects.
Errors []DeleteError `xml:"Error,omitempty"`
}
// DeleteMultipleObjectsHandler - Delete multiple objects
func (s3a *S3ApiServer) DeleteMultipleObjectsHandler(w http.ResponseWriter, r *http.Request) {
bucket, _ := s3_constants.GetBucketAndObject(r)
glog.V(3).Infof("DeleteMultipleObjectsHandler %s", bucket)
deleteXMLBytes, err := io.ReadAll(r.Body)
if err != nil {
s3err.WriteErrorResponse(w, r, s3err.ErrInternalError)
return
}
deleteObjects := &DeleteObjectsRequest{}
if err := xml.Unmarshal(deleteXMLBytes, deleteObjects); err != nil {
s3err.WriteErrorResponse(w, r, s3err.ErrMalformedXML)
return
}
if len(deleteObjects.Objects) > deleteMultipleObjectsLimit {
s3err.WriteErrorResponse(w, r, s3err.ErrInvalidMaxDeleteObjects)
return
}
var deletedObjects []ObjectIdentifier
var deleteErrors []DeleteError
var auditLog *s3err.AccessLog
if s3err.Logger != nil {
auditLog = s3err.GetAccessLog(r, http.StatusNoContent, s3err.ErrNone)
}
// Get detailed versioning state for proper handling of suspended vs enabled versioning
versioningState, err := s3a.getVersioningState(bucket)
if err != nil {
if err == filer_pb.ErrNotFound {
s3err.WriteErrorResponse(w, r, s3err.ErrNoSuchBucket)
return
}
glog.Errorf("Error checking versioning status for bucket %s: %v", bucket, err)
s3err.WriteErrorResponse(w, r, s3err.ErrInternalError)
return
}
versioningEnabled := (versioningState == s3_constants.VersioningEnabled)
versioningSuspended := (versioningState == s3_constants.VersioningSuspended)
versioningConfigured := (versioningState != "")
s3a.WithFilerClient(false, func(client filer_pb.SeaweedFilerClient) error {
// delete file entries
for _, object := range deleteObjects.Objects {
if object.Key == "" {
continue
}
// Check object lock permissions before deletion (only for versioned buckets)
if versioningConfigured {
// Validate governance bypass for this specific object
governanceBypassAllowed := s3a.evaluateGovernanceBypassRequest(r, bucket, object.Key)
if err := s3a.enforceObjectLockProtections(r, bucket, object.Key, object.VersionId, governanceBypassAllowed); err != nil {
glog.V(2).Infof("DeleteMultipleObjectsHandler: object lock check failed for %s/%s (version: %s): %v", bucket, object.Key, object.VersionId, err)
deleteErrors = append(deleteErrors, DeleteError{
Code: s3err.GetAPIError(s3err.ErrAccessDenied).Code,
Message: s3err.GetAPIError(s3err.ErrAccessDenied).Description,
Key: object.Key,
VersionId: object.VersionId,
})
continue
}
}
var deleteVersionId string
var isDeleteMarker bool
if versioningConfigured {
// Handle versioned delete based on specific versioning state
if object.VersionId != "" {
// Delete specific version (same for both enabled and suspended)
err := s3a.deleteSpecificObjectVersion(bucket, object.Key, object.VersionId)
if err != nil {
deleteErrors = append(deleteErrors, DeleteError{
Code: "",
Message: err.Error(),
Key: object.Key,
VersionId: object.VersionId,
})
continue
}
deleteVersionId = object.VersionId
} else {
// Delete without version ID - behavior depends on versioning state
if versioningEnabled {
// Enabled versioning: Create delete marker (logical delete)
deleteMarkerVersionId, err := s3a.createDeleteMarker(bucket, object.Key)
if err != nil {
deleteErrors = append(deleteErrors, DeleteError{
Code: "",
Message: err.Error(),
Key: object.Key,
VersionId: object.VersionId,
})
continue
}
deleteVersionId = deleteMarkerVersionId
isDeleteMarker = true
} else if versioningSuspended {
// Suspended versioning: Actually delete the "null" version object
glog.V(2).Infof("DeleteMultipleObjectsHandler: deleting null version for suspended versioning %s/%s", bucket, object.Key)
err := s3a.deleteSpecificObjectVersion(bucket, object.Key, "null")
if err != nil {
deleteErrors = append(deleteErrors, DeleteError{
Code: "",
Message: err.Error(),
Key: object.Key,
VersionId: "null",
})
continue
}
deleteVersionId = "null"
// Note: For suspended versioning, we don't set isDeleteMarker=true
// because we actually deleted the object, not created a delete marker
}
}
// Add to successful deletions with version info
deletedObject := ObjectIdentifier{
Key: object.Key,
VersionId: deleteVersionId,
DeleteMarker: isDeleteMarker,
}
// For delete markers, also set DeleteMarkerVersionId field
if isDeleteMarker {
deletedObject.DeleteMarkerVersionId = deleteVersionId
// Don't set VersionId for delete markers, use DeleteMarkerVersionId instead
deletedObject.VersionId = ""
}
if !deleteObjects.Quiet {
deletedObjects = append(deletedObjects, deletedObject)
}
if isDeleteMarker {
// For delete markers, we don't need to track directories for cleanup
continue
}
} else {
// Handle non-versioned delete (original logic)
lastSeparator := strings.LastIndex(object.Key, "/")
parentDirectoryPath, entryName, isDeleteData, isRecursive := "", object.Key, true, false
if lastSeparator > 0 && lastSeparator+1 < len(object.Key) {
entryName = object.Key[lastSeparator+1:]
parentDirectoryPath = "/" + object.Key[:lastSeparator]
}
parentDirectoryPath = fmt.Sprintf("%s/%s%s", s3a.option.BucketsPath, bucket, parentDirectoryPath)
err := doDeleteEntry(client, parentDirectoryPath, entryName, isDeleteData, isRecursive)
if err == nil {
deletedObjects = append(deletedObjects, object)
} else if strings.Contains(err.Error(), filer.MsgFailDelNonEmptyFolder) {
deletedObjects = append(deletedObjects, object)
} else {
deleteErrors = append(deleteErrors, DeleteError{
Code: "",
Message: err.Error(),
Key: object.Key,
VersionId: object.VersionId,
})
}
}
if auditLog != nil {
auditLog.Key = object.Key
s3err.PostAccessLog(*auditLog)
}
}
// Note: Empty folder cleanup is now handled asynchronously by EmptyFolderCleaner
// which listens to metadata events and uses consistent hashing for coordination
return nil
})
deleteResp := DeleteObjectsResponse{}
if !deleteObjects.Quiet {
deleteResp.DeletedObjects = deletedObjects
}
deleteResp.Errors = deleteErrors
stats_collect.RecordBucketActiveTime(bucket)
stats_collect.S3DeletedObjectsCounter.WithLabelValues(bucket).Add(float64(len(deletedObjects)))
writeSuccessResponseXML(w, r, deleteResp)
}