fix: reduce N+1 queries in S3 versioned object list operations (#7814)

* fix: achieve single-scan efficiency for S3 versioned object listing

When listing objects in a versioning-enabled bucket, the original code
triggered multiple getEntry calls per versioned object (up to 12 with
retries), causing excessive 'find' operations visible in Grafana and
leading to high memory usage.

This fix achieves single-scan efficiency by caching list metadata
(size, ETag, mtime, owner) directly in the .versions directory:

1. Add new Extended keys for caching list metadata in .versions dir
2. Update upload/copy/multipart paths to cache metadata when creating versions
3. Update getLatestVersionEntryFromDirectoryEntry to use cached metadata
   (zero getEntry calls when cache is available)
4. Update updateLatestVersionAfterDeletion to maintain cache consistency

Performance improvement for N versioned objects:
- Before: N×1 to N×12 find operations per list request
- After: 0 extra find operations (all metadata from single scan)

This matches the efficiency of normal (non-versioned) object listing.

* Update s3api_object_versioning.go

* s3api: fix ETag handling for versioned objects and simplify delete marker creation

- Add Md5 attribute to synthetic logicalEntry for single-part uploads to ensure
  filer.ETag() returns correct value in ListObjects response
- Simplify delete marker creation by initializing entry directly in mkFile callback
- Add bytes and encoding/hex imports for ETag parsing

* s3api: preserve default attributes in delete marker mkFile callback

Only modify Mtime field instead of replacing the entire Attributes struct,
preserving default values like Crtime, FileMode, Uid, and Gid that mkFile
initializes.

* s3api: fix ETag handling in newListEntry for multipart uploads

Prioritize ExtETagKey from Extended attributes before falling back to
filer.ETag(). This properly handles multipart upload ETags (format: md5-parts)
for versioned objects, where the synthetic entry has cached ETag metadata
but no chunks to calculate from.

* s3api: reduce code duplication in delete marker creation

Extract deleteMarkerExtended map to be reused in both mkFile callback
and deleteMarkerEntry construction.

* test: add multipart upload versioning tests for ETag verification

Add tests to verify that multipart uploaded objects in versioned buckets
have correct ETags when listed:

- TestMultipartUploadVersioningListETag: Basic multipart upload with 2 parts
- TestMultipartUploadMultipleVersionsListETag: Multiple multipart versions
- TestMixedSingleAndMultipartVersionsListETag: Mix of single-part and multipart

These tests cover a bug where synthetic entries for versioned objects
didn't include proper ETag handling for multipart uploads.

* test: add delete marker test for multipart uploaded versioned objects

TestMultipartUploadDeleteMarkerListBehavior verifies:
- Delete marker creation hides object from ListObjectsV2
- ListObjectVersions shows both version and delete marker
- Version ETag (multipart format) is preserved after delete marker
- Object can be accessed by version ID after delete marker
- Removing delete marker restores object visibility

* refactor: address code review feedback

- test: use assert.ElementsMatch for ETag verification (more idiomatic)
- s3api: optimize newListEntry ETag logic (check ExtETagKey first)
- s3api: fix edge case in ETag parsing (>= 2 instead of > 2)

* s3api: prevent stale cached metadata and preserve existing extended attrs

- setCachedListMetadata: clear old cached keys before setting new values
  to prevent stale data when new version lacks certain fields (e.g., owner)
- createDeleteMarker: merge extended attributes instead of overwriting
  to preserve any existing metadata on the entry

* s3api: extract clearCachedVersionMetadata to reduce code duplication

- clearCachedVersionMetadata: clears only metadata fields (size, mtime, etag, owner, deleteMarker)
- clearCachedListMetadata: now reuses clearCachedVersionMetadata + clears ID/filename
- setCachedListMetadata: uses clearCachedVersionMetadata (not clearCachedListMetadata
  because caller has already set ID/filename)

* s3api: share timestamp between version entry and cache entry

Capture versionMtime once before mkFile and reuse for both:
- versionEntry.Attributes.Mtime in the mkFile callback
- versionEntryForCache.Attributes.Mtime for list caching

This keeps list vs. HEAD LastModified timestamps aligned.

* s3api: remove amzAccountId variable shadowing in multipart upload

Extract amzAccountId before mkFile callback and reuse in both places,
similar to how versionMtime is handled. Avoids confusion from
redeclaring the same variable.
This commit is contained in:
Chris Lu
2025-12-18 17:44:27 -08:00
committed by GitHub
parent 414cda4215
commit bccef78082
8 changed files with 794 additions and 55 deletions

View File

@@ -204,7 +204,7 @@ func (s3a *S3ApiServer) completeMultipartUpload(r *http.Request, input *s3.Compl
// Location uses the S3 endpoint that the client connected to
// Format: scheme://s3-endpoint/bucket/object (following AWS S3 API)
return &CompleteMultipartUploadResult{
Location: aws.String(fmt.Sprintf("%s://%s/%s/%s", getRequestScheme(r), r.Host, url.PathEscape(*input.Bucket), urlPathEscape(*input.Key))),
Location: aws.String(fmt.Sprintf("%s://%s/%s/%s", getRequestScheme(r), r.Host, url.PathEscape(*input.Bucket), urlPathEscape(*input.Key))),
Bucket: input.Bucket,
ETag: aws.String("\"" + filer.ETagChunks(entry.GetChunks()) + "\""),
Key: objectKey(input.Key),
@@ -367,6 +367,10 @@ func (s3a *S3ApiServer) completeMultipartUpload(r *http.Request, input *s3.Compl
versionFileName := s3a.getVersionFileName(versionId)
versionDir := dirName + "/" + entryName + s3_constants.VersionsFolder
// Capture timestamp and owner once for consistency between version entry and cache entry
versionMtime := time.Now().Unix()
amzAccountId := r.Header.Get(s3_constants.AmzAccountId)
// Create the version file in the .versions directory
err = s3a.mkFile(versionDir, versionFileName, finalParts, func(versionEntry *filer_pb.Entry) {
if versionEntry.Extended == nil {
@@ -382,7 +386,6 @@ func (s3a *S3ApiServer) completeMultipartUpload(r *http.Request, input *s3.Compl
}
// Set object owner for versioned multipart objects
amzAccountId := r.Header.Get(s3_constants.AmzAccountId)
if amzAccountId != "" {
versionEntry.Extended[s3_constants.ExtAmzOwnerKey] = []byte(amzAccountId)
}
@@ -405,6 +408,7 @@ func (s3a *S3ApiServer) completeMultipartUpload(r *http.Request, input *s3.Compl
versionEntry.Attributes.Mime = mime
}
versionEntry.Attributes.FileSize = uint64(offset)
versionEntry.Attributes.Mtime = versionMtime
})
if err != nil {
@@ -412,8 +416,25 @@ func (s3a *S3ApiServer) completeMultipartUpload(r *http.Request, input *s3.Compl
return nil, s3err.ErrInternalError
}
// Construct entry with metadata for caching in .versions directory
// Reuse versionMtime to keep list vs. HEAD timestamps aligned
etag := "\"" + filer.ETagChunks(finalParts) + "\""
versionEntryForCache := &filer_pb.Entry{
Attributes: &filer_pb.FuseAttributes{
FileSize: uint64(offset),
Mtime: versionMtime,
},
Extended: map[string][]byte{
s3_constants.ExtETagKey: []byte(etag),
},
}
if amzAccountId != "" {
versionEntryForCache.Extended[s3_constants.ExtAmzOwnerKey] = []byte(amzAccountId)
}
// Update the .versions directory metadata to indicate this is the latest version
err = s3a.updateLatestVersionInDirectory(*input.Bucket, *input.Key, versionId, versionFileName)
// Pass entry to cache its metadata for single-scan list efficiency
err = s3a.updateLatestVersionInDirectory(*input.Bucket, *input.Key, versionId, versionFileName, versionEntryForCache)
if err != nil {
glog.Errorf("completeMultipartUpload: failed to update latest version in directory: %v", err)
return nil, s3err.ErrInternalError