Files
seaweedFS/weed/s3api/s3api_sosapi.go
Chris Lu e6ee293c17 Add table operations test (#8241)
* Add Trino blog operations test

* Update test/s3tables/catalog_trino/trino_blog_operations_test.go

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* feat: add table bucket path helpers and filer operations

- Add table object root and table location mapping directories
- Implement ensureDirectory, upsertFile, deleteEntryIfExists helpers
- Support table location bucket mapping for S3 access

* feat: manage table bucket object roots on creation/deletion

- Create .objects directory for table buckets on creation
- Clean up table object bucket paths on deletion
- Enable S3 operations on table bucket object roots

* feat: add table location mapping for Iceberg REST

- Track table location bucket mappings when tables are created/updated/deleted
- Enable location-based routing for S3 operations on table data

* feat: route S3 operations to table bucket object roots

- Route table-s3 bucket names to mapped table paths
- Route table buckets to object root directories
- Support table location bucket mapping lookup

* feat: emit table-s3 locations from Iceberg REST

- Generate unique table-s3 bucket names with UUID suffix
- Store table metadata under table bucket paths
- Return table-s3 locations for Trino compatibility

* fix: handle missing directories in S3 list operations

- Propagate ErrNotFound from ListEntries for non-existent directories
- Treat missing directories as empty results for list operations
- Fixes Trino non-empty location checks on table creation

* test: improve Trino CSV parsing for single-value results

- Sanitize Trino output to skip jline warnings
- Handle single-value CSV results without header rows
- Strip quotes from numeric values in tests

* refactor: use bucket path helpers throughout S3 API

- Replace direct bucket path operations with helper functions
- Leverage centralized table bucket routing logic
- Improve maintainability with consistent path resolution

* fix: add table bucket cache and improve filer error handling

- Cache table bucket lookups to reduce filer overhead on repeated checks
- Use filer_pb.CreateEntry and filer_pb.UpdateEntry helpers to check resp.Error
- Fix delete order in handler_bucket_get_list_delete: delete table object before directory
- Make location mapping errors best-effort: log and continue, don't fail API
- Update table location mappings to delete stale prior bucket mappings on update
- Add 1-second sleep before timestamp time travel query to ensure timestamps are in past
- Fix CSV parsing: examine all lines, not skip first; handle single-value rows

* fix: properly handle stale metadata location mapping cleanup

- Capture oldMetadataLocation before mutation in handleUpdateTable
- Update updateTableLocationMapping to accept both old and new locations
- Use passed-in oldMetadataLocation to detect location changes
- Delete stale mapping only when location actually changes
- Pass empty string for oldLocation in handleCreateTable (new tables have no prior mapping)
- Improve logging to show old -> new location transitions

* refactor: cleanup imports and cache design

- Remove unused 'sync' import from bucket_paths.go
- Use filer_pb.UpdateEntry helper in setExtendedAttribute and deleteExtendedAttribute for consistent error handling
- Add dedicated tableBucketCache map[string]bool to BucketRegistry instead of mixing concerns with metadataCache
- Improve cache separation: table buckets cache is now separate from bucket metadata cache

* fix: improve cache invalidation and add transient error handling

Cache invalidation (critical fix):
- Add tableLocationCache to BucketRegistry for location mapping lookups
- Clear tableBucketCache and tableLocationCache in RemoveBucketMetadata
- Prevents stale cache entries when buckets are deleted/recreated

Transient error handling:
- Only cache table bucket lookups when conclusive (found or ErrNotFound)
- Skip caching on transient errors (network, permission, etc)
- Prevents marking real table buckets as non-table due to transient failures

Performance optimization:
- Cache tableLocationDir results to avoid repeated filer RPCs on hot paths
- tableLocationDir now checks cache before making expensive filer lookups
- Cache stores empty string for 'not found' to avoid redundant lookups

Code clarity:
- Add comment to deleteDirectory explaining DeleteEntry response lacks Error field

* go fmt

* fix: mirror transient error handling in tableLocationDir and optimize bucketDir

Transient error handling:
- tableLocationDir now only caches definitive results
- Mirrors isTableBucket behavior to prevent treating transient errors as permanent misses
- Improves reliability on flaky systems or during recovery

Performance optimization:
- bucketDir avoids redundant isTableBucket call via bucketRoot
- Directly use s3a.option.BucketsPath for regular buckets
- Saves one cache lookup for every non-table bucket operation

* fix: revert bucketDir optimization to preserve bucketRoot logic

The optimization to directly use BucketsPath bypassed bucketRoot's logic
and caused issues with S3 list operations on delimiter+prefix cases.

Revert to using path.Join(s3a.bucketRoot(bucket), bucket) which properly
handles all bucket types and ensures consistent path resolution across
the codebase.

The slight performance cost of an extra cache lookup is worth the correctness
and consistency benefits.

* feat: move table buckets under /buckets

Add a table-bucket marker attribute, reuse bucket metadata cache for table bucket detection, and update list/validation/UI/test paths to treat table buckets as /buckets entries.

* Fix S3 Tables code review issues

- handler_bucket_create.go: Fix bucket existence check to properly validate
  entryResp.Entry before setting s3BucketExists flag (nil Entry should not
  indicate existing bucket)
- bucket_paths.go: Add clarifying comment to bucketRoot() explaining unified
  buckets root path for all bucket types
- file_browser_data.go: Optimize by extracting table bucket check early to
  avoid redundant WithFilerClient call

* Fix list prefix delimiter handling

* Handle list errors conservatively

* Fix Trino FOR TIMESTAMP query - use past timestamp

Iceberg requires the timestamp to be strictly in the past.
Use current_timestamp - interval '1' second instead of current_timestamp.

---------

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-02-07 13:27:47 -08:00

343 lines
11 KiB
Go

// Package s3api implements the S3 API for SeaweedFS.
// This file implements the Smart Object Storage API (SOSAPI) which enables
// enterprise backup software to automatically discover storage system
// capabilities and capacity information.
package s3api
import (
"bytes"
"context"
"crypto/md5"
"encoding/hex"
"encoding/xml"
"errors"
"fmt"
"net/http"
"strconv"
"strings"
"time"
"github.com/seaweedfs/seaweedfs/weed/glog"
"github.com/seaweedfs/seaweedfs/weed/pb"
"github.com/seaweedfs/seaweedfs/weed/pb/filer_pb"
"github.com/seaweedfs/seaweedfs/weed/pb/master_pb"
"github.com/seaweedfs/seaweedfs/weed/s3api/s3err"
"github.com/seaweedfs/seaweedfs/weed/util/version"
)
// SOSAPI constants
const (
// sosAPISystemFolder is the well-known folder path for SOSAPI system files.
// This UUID-based path is part of the SOSAPI specification.
sosAPISystemFolder = ".system-d26a9498-cb7c-4a87-a44a-8ae204f5ba6c"
// sosAPISystemXML is the path to the system capabilities XML file.
sosAPISystemXML = sosAPISystemFolder + "/system.xml"
// sosAPICapacityXML is the path to the capacity information XML file.
sosAPICapacityXML = sosAPISystemFolder + "/capacity.xml"
// sosAPIClientUserAgent is a substring to detect SOSAPI-compatible backup clients.
sosAPIClientUserAgent = "APN/1.0 Veeam/1.0"
// sosAPIProtocolVersion is the SOSAPI protocol version supported.
sosAPIProtocolVersion = `"1.0"`
// sosAPIDefaultBlockSizeKB is the recommended block size in KB.
// 4096 KB (4MB) is optimal for object storage workloads.
sosAPIDefaultBlockSizeKB = 4096
)
// SystemInfo represents the system.xml response structure for SOSAPI.
// It describes the storage system's capabilities and recommendations.
type SystemInfo struct {
XMLName xml.Name `xml:"SystemInfo"`
ProtocolVersion string `xml:"ProtocolVersion"`
ModelName string `xml:"ModelName"`
ProtocolCapabilities struct {
CapacityInfo bool `xml:"CapacityInfo"`
UploadSessions bool `xml:"UploadSessions"`
IAMSTS bool `xml:"IAMSTS"`
} `xml:"ProtocolCapabilities"`
APIEndpoints *APIEndpoints `xml:"APIEndpoints,omitempty"`
SystemRecommendations *SystemRecommendations `xml:"SystemRecommendations,omitempty"`
}
// APIEndpoints contains optional IAM and STS endpoint information.
type APIEndpoints struct {
IAMEndpoint string `xml:"IAMEndpoint,omitempty"`
STSEndpoint string `xml:"STSEndpoint,omitempty"`
}
// SystemRecommendations contains storage system performance recommendations.
type SystemRecommendations struct {
S3ConcurrentTaskLimit int `xml:"S3ConcurrentTaskLimit,omitempty"`
S3MultiObjectDeleteLimit int `xml:"S3MultiObjectDeleteLimit,omitempty"`
StorageCurrentTaskLimit int `xml:"StorageCurrentTaskLimit,omitempty"`
KBBlockSize int `xml:"KbBlockSize"`
}
// CapacityInfo represents the capacity.xml response structure for SOSAPI.
// It provides real-time storage capacity information.
type CapacityInfo struct {
XMLName xml.Name `xml:"CapacityInfo"`
Capacity int64 `xml:"Capacity"`
Available int64 `xml:"Available"`
Used int64 `xml:"Used"`
}
// isSOSAPIObject checks if the given object path is a SOSAPI virtual object.
// These objects don't physically exist but are generated on-demand.
func isSOSAPIObject(object string) bool {
switch object {
case sosAPISystemXML, sosAPICapacityXML:
return true
default:
return false
}
}
// isSOSAPIClient checks if the request comes from a SOSAPI-compatible client
// by examining the User-Agent header.
func isSOSAPIClient(r *http.Request) bool {
userAgent := r.Header.Get("User-Agent")
return strings.Contains(userAgent, sosAPIClientUserAgent)
}
// generateSystemXML creates the system.xml response containing storage system
// capabilities and recommendations.
func generateSystemXML() ([]byte, error) {
si := SystemInfo{
ProtocolVersion: sosAPIProtocolVersion,
ModelName: "\"SeaweedFS " + version.VERSION_NUMBER + "\"",
}
// Enable capacity reporting capability
si.ProtocolCapabilities.CapacityInfo = true
si.ProtocolCapabilities.UploadSessions = false
si.ProtocolCapabilities.IAMSTS = false
// Set recommended block size for optimal performance
si.SystemRecommendations = &SystemRecommendations{
KBBlockSize: sosAPIDefaultBlockSizeKB,
}
return xml.Marshal(&si)
}
// generateCapacityXML creates the capacity.xml response containing real-time
// storage capacity information.
func (s3a *S3ApiServer) generateCapacityXML(ctx context.Context, bucket string) ([]byte, error) {
total, available, used, err := s3a.getCapacityInfo(ctx, bucket)
if err != nil {
glog.Warningf("SOSAPI: failed to get capacity info for bucket %s: %v, using defaults", bucket, err)
// Return zero capacity on error
total, available, used = 0, 0, 0
}
ci := CapacityInfo{
Capacity: total,
Available: available,
Used: used,
}
return xml.Marshal(&ci)
}
// getCapacityInfo retrieves capacity information for the specific bucket.
// It checks bucket quota first, then falls back to cluster topology information.
// Returns capacity, available, and used bytes.
func (s3a *S3ApiServer) getCapacityInfo(ctx context.Context, bucket string) (capacity, available, used int64, err error) {
// 1. Check if bucket has a quota
// We use s3a.getEntry which is a helper in s3api_bucket_handlers.go
var quota int64
// getEntry communicates with filer, so errors here might mean filer connectivity issues or bucket not found
// If bucket not found, we probably shouldn't be here (checked in handler), but safe to ignore
if entry, getErr := s3a.getBucketEntry(bucket); getErr == nil && entry != nil {
quota = entry.Quota
}
// 2. Get cluster topology from master
if len(s3a.option.Masters) == 0 {
return 0, 0, 0, fmt.Errorf("no master servers configured")
}
masterMap := make(map[string]pb.ServerAddress)
for _, master := range s3a.option.Masters {
masterMap[string(master)] = master
}
// Connect to any available master and get volume list (topology)
err = pb.WithOneOfGrpcMasterClients(false, masterMap, s3a.option.GrpcDialOption, func(client master_pb.SeaweedClient) error {
resp, vErr := client.VolumeList(ctx, &master_pb.VolumeListRequest{})
if vErr != nil {
return vErr
}
if resp.TopologyInfo == nil {
return nil
}
// Calculate used size for the bucket by summing up volumes in the collection
used = collectBucketUsageFromTopology(resp.TopologyInfo, s3a.getCollectionName(bucket))
// Calculate cluster capacity if no quota
if quota > 0 {
capacity = quota
available = quota - used
if available < 0 {
available = 0
}
} else {
// No quota - use cluster capacity
clusterTotal, clusterAvailable := calculateClusterCapacity(resp.TopologyInfo, resp.VolumeSizeLimitMb)
capacity = clusterTotal
available = clusterAvailable
}
return nil
})
return capacity, available, used, err
}
// collectBucketUsageFromTopology sums up the size of all volumes belonging to the specified collection.
func collectBucketUsageFromTopology(t *master_pb.TopologyInfo, collectionName string) (used int64) {
seenVolumes := make(map[uint32]bool)
for _, dc := range t.DataCenterInfos {
for _, r := range dc.RackInfos {
for _, dn := range r.DataNodeInfos {
for _, disk := range dn.DiskInfos {
for _, vi := range disk.VolumeInfos {
if vi.Collection == collectionName {
if !seenVolumes[vi.Id] {
used += int64(vi.Size)
seenVolumes[vi.Id] = true
}
}
}
}
}
}
}
return
}
// calculateClusterCapacity sums up the total and available capacity of the entire cluster.
func calculateClusterCapacity(t *master_pb.TopologyInfo, volumeSizeLimitMb uint64) (total, available int64) {
volumeSize := int64(volumeSizeLimitMb) * 1024 * 1024
for _, dc := range t.DataCenterInfos {
for _, r := range dc.RackInfos {
for _, dn := range r.DataNodeInfos {
for _, disk := range dn.DiskInfos {
total += int64(disk.MaxVolumeCount) * volumeSize
available += int64(disk.FreeVolumeCount) * volumeSize
}
}
}
}
return
}
// handleSOSAPIGetObject handles GET requests for SOSAPI virtual objects.
// Returns true if the request was handled, false if it should proceed normally.
func (s3a *S3ApiServer) handleSOSAPIGetObject(w http.ResponseWriter, r *http.Request, bucket, object string) bool {
if !isSOSAPIObject(object) {
return false
}
xmlData, err := s3a.generateSOSAPIContent(r.Context(), bucket, object)
if err != nil {
if errors.Is(err, filer_pb.ErrNotFound) {
s3err.WriteErrorResponse(w, r, s3err.ErrNoSuchBucket)
} else {
glog.Errorf("SOSAPI: failed to generate %s: %v", object, err)
s3err.WriteErrorResponse(w, r, s3err.ErrInternalError)
}
return true
}
// Calculate ETag from content
hash := md5.Sum(xmlData)
etag := hex.EncodeToString(hash[:])
// Set ETag header manually as ServeContent doesn't calculate it automatically
w.Header().Set("ETag", "\""+etag+"\"")
w.Header().Set("Content-Type", "application/xml")
// Use http.ServeContent to handle Content-Length, Range, and Last-Modified
http.ServeContent(w, r, object, time.Now().UTC(), bytes.NewReader(xmlData))
return true
}
// handleSOSAPIHeadObject handles HEAD requests for SOSAPI virtual objects.
// Returns true if the request was handled, false if it should proceed normally.
func (s3a *S3ApiServer) handleSOSAPIHeadObject(w http.ResponseWriter, r *http.Request, bucket, object string) bool {
if !isSOSAPIObject(object) {
return false
}
xmlData, err := s3a.generateSOSAPIContent(r.Context(), bucket, object)
if err != nil {
if errors.Is(err, filer_pb.ErrNotFound) {
s3err.WriteErrorResponse(w, r, s3err.ErrNoSuchBucket)
} else {
glog.Errorf("SOSAPI: failed to generate %s for HEAD: %v", object, err)
s3err.WriteErrorResponse(w, r, s3err.ErrInternalError)
}
return true
}
// Calculate ETag from content
hash := md5.Sum(xmlData)
etag := hex.EncodeToString(hash[:])
// Set response headers (no body for HEAD)
w.Header().Set("Content-Type", "application/xml")
w.Header().Set("ETag", "\""+etag+"\"")
w.Header().Set("Content-Length", strconv.Itoa(len(xmlData)))
w.Header().Set("Last-Modified", time.Now().UTC().Format(http.TimeFormat))
w.WriteHeader(http.StatusOK)
return true
}
// generateSOSAPIContent generates the XML content for SOSAPI virtual objects.
// Returns the complete XML with declaration prepended.
func (s3a *S3ApiServer) generateSOSAPIContent(ctx context.Context, bucket, object string) ([]byte, error) {
// Verify bucket exists
if _, errCode := s3a.getBucketConfig(bucket); errCode != s3err.ErrNone {
if errCode == s3err.ErrNoSuchBucket {
return nil, filer_pb.ErrNotFound
}
return nil, fmt.Errorf("bucket config error: %v", errCode)
}
var xmlData []byte
var err error
switch object {
case sosAPISystemXML:
xmlData, err = generateSystemXML()
if err != nil {
return nil, err
}
glog.V(4).Infof("SOSAPI: generated system.xml for bucket %s", bucket)
case sosAPICapacityXML:
xmlData, err = s3a.generateCapacityXML(ctx, bucket)
if err != nil {
return nil, err
}
glog.V(4).Infof("SOSAPI: generated capacity.xml for bucket %s", bucket)
default:
return nil, fmt.Errorf("unknown SOSAPI object: %s", object)
}
// Prepend XML declaration
xmlData = append([]byte(xml.Header), xmlData...)
return xmlData, nil
}