Files
seaweedFS/weed/s3api/s3tables/iceberg_layout_test.go
Chris Lu 621834d96a s3tables: add Iceberg file layout validation for table buckets (#8176)
* s3tables: add Iceberg file layout validation for table buckets

This PR adds file layout validation for table buckets to enforce Apache
Iceberg table structure. Files uploaded to table buckets must conform
to the expected Iceberg layout:

- metadata/ directory: contains metadata files (*.json, *.avro)
  - v*.metadata.json (table metadata)
  - snap-*.avro (snapshot manifests)
  - *-m*.avro (manifest files)
  - version-hint.text

- data/ directory: contains data files (*.parquet, *.orc, *.avro)
  - Supports partition paths (e.g., year=2024/month=01/)
  - Supports bucket subdirectories

The validator exports functions for use by the S3 API:
- IsTableBucketPath: checks if a path is under /table-buckets/
- GetTableInfoFromPath: extracts bucket/namespace/table from path
- ValidateTableBucketUpload: validates file layout for table bucket uploads
- ValidateTableBucketUploadWithClient: validates with filer client access

Invalid uploads receive InvalidIcebergLayout error response.

* Address review comments: regex performance, error handling, stricter patterns

* Fix validateMetadataFile and validateDataFile to handle subdirectories and directory creation

* Fix error handling, metadata validation, reduce code duplication

* Fix empty remainingPath handling for directory paths

* Refactor: unify validateMetadataFile and validateDataFile

* Refactor: extract UUID pattern constant

* fix: allow Iceberg partition and directory paths without trailing slashes

Modified validateFile to correctly handle directory paths that do not end with a trailing slash.
This ensures that paths like 'data/year=2024' are validated as directories if they match
partition or subdirectory patterns, rather than being incorrectly rejected as invalid files.
Added comprehensive test cases for various directory and partition path combinations.

* refactor: use standard path package and idiomatic returns

Simplified directory and filename extraction in validateFile by using the standard
path package (aliased as pathpkg). This improves readability and avoids manual string
manipulation. Also updated GetTableInfoFromPath to use naked returns for named
return values, aligning with Go conventions for short functions.

* feat: enforce strict Iceberg top-level directories and metadata restrictions

Implemented strict validation for Iceberg layout:
- Bare top-level keys like 'metadata' and 'data' are now rejected; they must have
  a trailing slash or a subpath.
- Subdirectories under 'metadata/' are now prohibited to enforce the flat structure
  required by Iceberg.
- Updated the test suite with negative test cases and ensured proper formatting.

* feat: allow table root directory markers in ValidateTableBucketUpload

Modified ValidateTableBucketUpload to short-circuit and return nil when the
relative path within a table is empty. This occurs when a trailing slash is
used on the table directory (e.g., /table-buckets/mybucket/myns/mytable/).
Added a test case 'table dir with slash' to verify this behavior.

* test: add regression cases for metadata subdirs and table markers

Enforced a strictly flat structure for the metadata directory by removing the
"directory without trailing slash" fallback in validateFile for metadata.
Added regression test cases:
- metadata/nested (must fail)
- /table-buckets/.../mytable/ (must pass)
Verified all tests pass.

* feat: reject double slashes in Iceberg table paths

Modified validateDirectoryPath to return an error when encountering empty path
segments, effectively rejecting double slashes like 'data//file.parquet'.
Updated validateFile to use manual path splitting instead of the 'path' package
for intermediate directories to ensure redundant slashes are not auto-cleaned
before validation. Added regression tests for various double slash scenarios.

* refactor: separate isMetadata logic in validateDirectoryPath

Following reviewer feedback, refactored validateDirectoryPath to explicitly
separate the handling of metadata and data paths. This improves readability
and clarifies the function's intent while maintaining the strict validation rules
and double-slash rejection previously implemented.

* feat: validate bucket, namespace, and table path segments

Updated ValidateTableBucketUpload to ensure that bucket, namespace, and table
segments in the path are non-empty. This prevents invalid paths like
'/table-buckets//myns/mytable/...' from being accepted during upload.
Added regression tests for various empty segment scenarios.

* Update weed/s3api/s3tables/iceberg_layout.go

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* feat: block double-slash bypass in table relative paths

Added a guard in ValidateTableBucketUpload to reject tableRelativePath if it
starts with a '/' or contains '//'. This ensures that paths like
'/table-buckets/b/ns/t//data/file.parquet' are properly rejected and cannot
bypass the layout validation. Added regression tests to verify.

---------

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-02-02 10:05:04 -08:00

186 lines
6.8 KiB
Go

package s3tables
import (
"testing"
)
func TestIcebergLayoutValidator_ValidateFilePath(t *testing.T) {
v := NewIcebergLayoutValidator()
tests := []struct {
name string
path string
wantErr bool
}{
// Valid metadata files
{"valid metadata v1", "metadata/v1.metadata.json", false},
{"valid metadata v123", "metadata/v123.metadata.json", false},
{"valid snapshot manifest", "metadata/snap-123-1-abc12345-1234-5678-9abc-def012345678.avro", false},
{"valid manifest file", "metadata/abc12345-1234-5678-9abc-def012345678-m0.avro", false},
{"valid general manifest", "metadata/abc12345-1234-5678-9abc-def012345678.avro", false},
{"valid version hint", "metadata/version-hint.text", false},
{"valid uuid metadata", "metadata/abc12345-1234-5678-9abc-def012345678.metadata.json", false},
// Valid data files
{"valid parquet file", "data/file.parquet", false},
{"valid orc file", "data/file.orc", false},
{"valid avro data file", "data/file.avro", false},
{"valid parquet with path", "data/00000-0-abc12345.parquet", false},
// Valid partitioned data
{"valid partitioned parquet", "data/year=2024/file.parquet", false},
{"valid multi-partition", "data/year=2024/month=01/file.parquet", false},
{"valid bucket subdirectory", "data/bucket0/file.parquet", false},
// Directories only
{"metadata directory bare", "metadata", true},
{"data directory bare", "data", true},
{"metadata directory with slash", "metadata/", false},
{"data directory with slash", "data/", false},
// Invalid paths
{"empty path", "", true},
{"invalid top dir", "invalid/file.parquet", true},
{"root file", "file.parquet", true},
{"invalid metadata file", "metadata/random.txt", true},
{"nested metadata directory", "metadata/nested/v1.metadata.json", true},
{"nested metadata directory no file", "metadata/nested/", true},
{"metadata subdir no slash", "metadata/nested", true},
{"invalid data file", "data/file.csv", true},
{"invalid data file json", "data/file.json", true},
// Partition/subdirectory without trailing slashes
{"partition directory no slash", "data/year=2024", false},
{"data subdirectory no slash", "data/my_subdir", false},
{"multi-level partition", "data/event_date=2025-01-01/hour=00/file.parquet", false},
{"multi-level partition directory", "data/event_date=2025-01-01/hour=00/", false},
{"multi-level partition directory no slash", "data/event_date=2025-01-01/hour=00", false},
// Double slashes
{"data double slash", "data//file.parquet", true},
{"data redundant slash", "data/year=2024//file.parquet", true},
{"metadata redundant slash", "metadata//v1.metadata.json", true},
}
for _, tt := range tests {
t.Run(tt.name, func(t *testing.T) {
err := v.ValidateFilePath(tt.path)
if (err != nil) != tt.wantErr {
t.Errorf("ValidateFilePath(%q) error = %v, wantErr %v", tt.path, err, tt.wantErr)
}
})
}
}
func TestIcebergLayoutValidator_PartitionPaths(t *testing.T) {
v := NewIcebergLayoutValidator()
validPaths := []string{
"data/year=2024/file.parquet",
"data/date=2024-01-15/file.parquet",
"data/category=electronics/file.parquet",
"data/user_id=12345/file.parquet",
"data/region=us-east-1/file.parquet",
"data/year=2024/month=01/day=15/file.parquet",
}
for _, path := range validPaths {
if err := v.ValidateFilePath(path); err != nil {
t.Errorf("ValidateFilePath(%q) should be valid, got error: %v", path, err)
}
}
}
func TestTableBucketFileValidator_ValidateTableBucketUpload(t *testing.T) {
v := NewTableBucketFileValidator()
tests := []struct {
name string
path string
wantErr bool
}{
// Non-table bucket paths should pass (no validation)
{"regular bucket path", "/buckets/mybucket/file.txt", false},
{"filer path", "/home/user/file.txt", false},
// Table bucket structure paths (creating directories)
{"table bucket root", "/table-buckets/mybucket", false},
{"namespace dir", "/table-buckets/mybucket/myns", false},
{"table dir", "/table-buckets/mybucket/myns/mytable", false},
{"table dir trailing slash", "/table-buckets/mybucket/myns/mytable/", false},
// Valid table bucket file uploads
{"valid parquet upload", "/table-buckets/mybucket/myns/mytable/data/file.parquet", false},
{"valid metadata upload", "/table-buckets/mybucket/myns/mytable/metadata/v1.metadata.json", false},
{"valid partitioned data", "/table-buckets/mybucket/myns/mytable/data/year=2024/file.parquet", false},
// Invalid table bucket file uploads
{"invalid file type", "/table-buckets/mybucket/myns/mytable/data/file.csv", true},
{"invalid top-level dir", "/table-buckets/mybucket/myns/mytable/invalid/file.parquet", true},
{"root file in table", "/table-buckets/mybucket/myns/mytable/file.parquet", true},
// Empty segment cases
{"empty bucket", "/table-buckets//myns/mytable/data/file.parquet", true},
{"empty namespace", "/table-buckets/mybucket//mytable/data/file.parquet", true},
{"empty table", "/table-buckets/mybucket/myns//data/file.parquet", true},
{"empty bucket dir", "/table-buckets//", true},
{"empty namespace dir", "/table-buckets/mybucket//", true},
{"table double slash bypass", "/table-buckets/mybucket/myns/mytable//data/file.parquet", true},
}
for _, tt := range tests {
t.Run(tt.name, func(t *testing.T) {
err := v.ValidateTableBucketUpload(tt.path)
if (err != nil) != tt.wantErr {
t.Errorf("ValidateTableBucketUpload(%q) error = %v, wantErr %v", tt.path, err, tt.wantErr)
}
})
}
}
func TestIsTableBucketPath(t *testing.T) {
tests := []struct {
path string
want bool
}{
{"/table-buckets/mybucket", true},
{"/table-buckets/mybucket/ns/table/data/file.parquet", true},
{"/buckets/mybucket", false},
{"/home/user/file.txt", false},
{"table-buckets/mybucket", false}, // missing leading slash
}
for _, tt := range tests {
t.Run(tt.path, func(t *testing.T) {
if got := IsTableBucketPath(tt.path); got != tt.want {
t.Errorf("IsTableBucketPath(%q) = %v, want %v", tt.path, got, tt.want)
}
})
}
}
func TestGetTableInfoFromPath(t *testing.T) {
tests := []struct {
path string
wantBucket string
wantNamespace string
wantTable string
}{
{"/table-buckets/mybucket/myns/mytable/data/file.parquet", "mybucket", "myns", "mytable"},
{"/table-buckets/mybucket/myns/mytable", "mybucket", "myns", "mytable"},
{"/table-buckets/mybucket/myns", "mybucket", "myns", ""},
{"/table-buckets/mybucket", "mybucket", "", ""},
{"/buckets/mybucket", "", "", ""},
}
for _, tt := range tests {
t.Run(tt.path, func(t *testing.T) {
bucket, namespace, table := GetTableInfoFromPath(tt.path)
if bucket != tt.wantBucket || namespace != tt.wantNamespace || table != tt.wantTable {
t.Errorf("GetTableInfoFromPath(%q) = (%q, %q, %q), want (%q, %q, %q)",
tt.path, bucket, namespace, table, tt.wantBucket, tt.wantNamespace, tt.wantTable)
}
})
}
}