330ba7d9dcfbd6a30e9d83783d3c9e73cc28c12b
3 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
e6ee293c17 |
Add table operations test (#8241)
* Add Trino blog operations test * Update test/s3tables/catalog_trino/trino_blog_operations_test.go Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * feat: add table bucket path helpers and filer operations - Add table object root and table location mapping directories - Implement ensureDirectory, upsertFile, deleteEntryIfExists helpers - Support table location bucket mapping for S3 access * feat: manage table bucket object roots on creation/deletion - Create .objects directory for table buckets on creation - Clean up table object bucket paths on deletion - Enable S3 operations on table bucket object roots * feat: add table location mapping for Iceberg REST - Track table location bucket mappings when tables are created/updated/deleted - Enable location-based routing for S3 operations on table data * feat: route S3 operations to table bucket object roots - Route table-s3 bucket names to mapped table paths - Route table buckets to object root directories - Support table location bucket mapping lookup * feat: emit table-s3 locations from Iceberg REST - Generate unique table-s3 bucket names with UUID suffix - Store table metadata under table bucket paths - Return table-s3 locations for Trino compatibility * fix: handle missing directories in S3 list operations - Propagate ErrNotFound from ListEntries for non-existent directories - Treat missing directories as empty results for list operations - Fixes Trino non-empty location checks on table creation * test: improve Trino CSV parsing for single-value results - Sanitize Trino output to skip jline warnings - Handle single-value CSV results without header rows - Strip quotes from numeric values in tests * refactor: use bucket path helpers throughout S3 API - Replace direct bucket path operations with helper functions - Leverage centralized table bucket routing logic - Improve maintainability with consistent path resolution * fix: add table bucket cache and improve filer error handling - Cache table bucket lookups to reduce filer overhead on repeated checks - Use filer_pb.CreateEntry and filer_pb.UpdateEntry helpers to check resp.Error - Fix delete order in handler_bucket_get_list_delete: delete table object before directory - Make location mapping errors best-effort: log and continue, don't fail API - Update table location mappings to delete stale prior bucket mappings on update - Add 1-second sleep before timestamp time travel query to ensure timestamps are in past - Fix CSV parsing: examine all lines, not skip first; handle single-value rows * fix: properly handle stale metadata location mapping cleanup - Capture oldMetadataLocation before mutation in handleUpdateTable - Update updateTableLocationMapping to accept both old and new locations - Use passed-in oldMetadataLocation to detect location changes - Delete stale mapping only when location actually changes - Pass empty string for oldLocation in handleCreateTable (new tables have no prior mapping) - Improve logging to show old -> new location transitions * refactor: cleanup imports and cache design - Remove unused 'sync' import from bucket_paths.go - Use filer_pb.UpdateEntry helper in setExtendedAttribute and deleteExtendedAttribute for consistent error handling - Add dedicated tableBucketCache map[string]bool to BucketRegistry instead of mixing concerns with metadataCache - Improve cache separation: table buckets cache is now separate from bucket metadata cache * fix: improve cache invalidation and add transient error handling Cache invalidation (critical fix): - Add tableLocationCache to BucketRegistry for location mapping lookups - Clear tableBucketCache and tableLocationCache in RemoveBucketMetadata - Prevents stale cache entries when buckets are deleted/recreated Transient error handling: - Only cache table bucket lookups when conclusive (found or ErrNotFound) - Skip caching on transient errors (network, permission, etc) - Prevents marking real table buckets as non-table due to transient failures Performance optimization: - Cache tableLocationDir results to avoid repeated filer RPCs on hot paths - tableLocationDir now checks cache before making expensive filer lookups - Cache stores empty string for 'not found' to avoid redundant lookups Code clarity: - Add comment to deleteDirectory explaining DeleteEntry response lacks Error field * go fmt * fix: mirror transient error handling in tableLocationDir and optimize bucketDir Transient error handling: - tableLocationDir now only caches definitive results - Mirrors isTableBucket behavior to prevent treating transient errors as permanent misses - Improves reliability on flaky systems or during recovery Performance optimization: - bucketDir avoids redundant isTableBucket call via bucketRoot - Directly use s3a.option.BucketsPath for regular buckets - Saves one cache lookup for every non-table bucket operation * fix: revert bucketDir optimization to preserve bucketRoot logic The optimization to directly use BucketsPath bypassed bucketRoot's logic and caused issues with S3 list operations on delimiter+prefix cases. Revert to using path.Join(s3a.bucketRoot(bucket), bucket) which properly handles all bucket types and ensures consistent path resolution across the codebase. The slight performance cost of an extra cache lookup is worth the correctness and consistency benefits. * feat: move table buckets under /buckets Add a table-bucket marker attribute, reuse bucket metadata cache for table bucket detection, and update list/validation/UI/test paths to treat table buckets as /buckets entries. * Fix S3 Tables code review issues - handler_bucket_create.go: Fix bucket existence check to properly validate entryResp.Entry before setting s3BucketExists flag (nil Entry should not indicate existing bucket) - bucket_paths.go: Add clarifying comment to bucketRoot() explaining unified buckets root path for all bucket types - file_browser_data.go: Optimize by extracting table bucket check early to avoid redundant WithFilerClient call * Fix list prefix delimiter handling * Handle list errors conservatively * Fix Trino FOR TIMESTAMP query - use past timestamp Iceberg requires the timestamp to be strictly in the past. Use current_timestamp - interval '1' second instead of current_timestamp. --------- Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> |
||
|
|
b244bb58aa |
s3tables: redesign Iceberg REST Catalog using iceberg-go and automate integration tests (#8197)
* full integration with iceberg-go * Table Commit Operations (handleUpdateTable) * s3tables: fix Iceberg v2 compliance and namespace properties This commit ensures SeaweedFS Iceberg REST Catalog is compliant with Iceberg Format Version 2 by: - Using iceberg-go's table.NewMetadataWithUUID for strict v2 compliance. - Explicitly initializing namespace properties to empty maps. - Removing omitempty from required Iceberg response fields. - Fixing CommitTableRequest unmarshaling using table.Requirements and table.Updates. * s3tables: automate Iceberg integration tests - Added Makefile for local test execution and cluster management. - Added docker-compose for PyIceberg compatibility kit. - Added Go integration test harness for PyIceberg. - Updated GitHub CI to run Iceberg catalog tests automatically. * s3tables: update PyIceberg test suite for compatibility - Updated test_rest_catalog.py to use latest PyIceberg transaction APIs. - Updated Dockerfile to include pyarrow and pandas dependencies. - Improved namespace and table handling in integration tests. * s3tables: address review feedback on Iceberg Catalog - Implemented robust metadata version parsing and incrementing. - Ensured table metadata changes are persisted during commit (handleUpdateTable). - Standardized namespace property initialization for consistency. - Fixed unused variable and incorrect struct field build errors. * s3tables: finalize Iceberg REST Catalog and optimize tests - Implemented robust metadata versioning and persistence. - Standardized namespace property initialization. - Optimized integration tests using pre-built Docker image. - Added strict property persistence validation to test suite. - Fixed build errors from previous partial updates. * Address PR review: fix Table UUID stability, implement S3Tables UpdateTable, and support full metadata persistence individually * fix: Iceberg catalog stable UUIDs, metadata persistence, and file writing - Ensure table UUIDs are stable (do not regenerate on load). - Persist full table metadata (Iceberg JSON) in s3tables extended attributes. - Add `MetadataVersion` to explicitly track version numbers, replacing regex parsing. - Implement `saveMetadataFile` to persist metadata JSON files to the Filer on commit. - Update `CreateTable` and `UpdateTable` handlers to use the new logic. * test: bind weed mini to 0.0.0.0 in integration tests to fix Docker connectivity * Iceberg: fix metadata handling in REST catalog - Add nil guard in createTable - Fix updateTable to correctly load existing metadata from storage - Ensure full metadata persistence on updates - Populate loadTable result with parsed metadata * S3Tables: add auth checks and fix response fields in UpdateTable - Add CheckPermissionWithContext to UpdateTable handler - Include TableARN and MetadataLocation in UpdateTable response - Use ErrCodeConflict (409) for version token mismatches * Tests: improve Iceberg catalog test infrastructure and cleanup - Makefile: use PID file for precise process killing - test_rest_catalog.py: remove unused variables and fix f-strings * Iceberg: fix variable shadowing in UpdateTable - Rename inner loop variable `req` to `requirement` to avoid shadowing outer request variable * S3Tables: simplify MetadataVersion initialization - Use `max(req.MetadataVersion, 1)` instead of anonymous function * Tests: remove unicode characters from S3 tables integration test logs - Remove unicode checkmarks from test output for cleaner logs * Iceberg: improve metadata persistence robustness - Fix MetadataLocation in LoadTableResult to fallback to generated location - Improve saveMetadataFile to ensure directory hierarchy existence and robust error handling |
||
|
|
621834d96a |
s3tables: add Iceberg file layout validation for table buckets (#8176)
* s3tables: add Iceberg file layout validation for table buckets This PR adds file layout validation for table buckets to enforce Apache Iceberg table structure. Files uploaded to table buckets must conform to the expected Iceberg layout: - metadata/ directory: contains metadata files (*.json, *.avro) - v*.metadata.json (table metadata) - snap-*.avro (snapshot manifests) - *-m*.avro (manifest files) - version-hint.text - data/ directory: contains data files (*.parquet, *.orc, *.avro) - Supports partition paths (e.g., year=2024/month=01/) - Supports bucket subdirectories The validator exports functions for use by the S3 API: - IsTableBucketPath: checks if a path is under /table-buckets/ - GetTableInfoFromPath: extracts bucket/namespace/table from path - ValidateTableBucketUpload: validates file layout for table bucket uploads - ValidateTableBucketUploadWithClient: validates with filer client access Invalid uploads receive InvalidIcebergLayout error response. * Address review comments: regex performance, error handling, stricter patterns * Fix validateMetadataFile and validateDataFile to handle subdirectories and directory creation * Fix error handling, metadata validation, reduce code duplication * Fix empty remainingPath handling for directory paths * Refactor: unify validateMetadataFile and validateDataFile * Refactor: extract UUID pattern constant * fix: allow Iceberg partition and directory paths without trailing slashes Modified validateFile to correctly handle directory paths that do not end with a trailing slash. This ensures that paths like 'data/year=2024' are validated as directories if they match partition or subdirectory patterns, rather than being incorrectly rejected as invalid files. Added comprehensive test cases for various directory and partition path combinations. * refactor: use standard path package and idiomatic returns Simplified directory and filename extraction in validateFile by using the standard path package (aliased as pathpkg). This improves readability and avoids manual string manipulation. Also updated GetTableInfoFromPath to use naked returns for named return values, aligning with Go conventions for short functions. * feat: enforce strict Iceberg top-level directories and metadata restrictions Implemented strict validation for Iceberg layout: - Bare top-level keys like 'metadata' and 'data' are now rejected; they must have a trailing slash or a subpath. - Subdirectories under 'metadata/' are now prohibited to enforce the flat structure required by Iceberg. - Updated the test suite with negative test cases and ensured proper formatting. * feat: allow table root directory markers in ValidateTableBucketUpload Modified ValidateTableBucketUpload to short-circuit and return nil when the relative path within a table is empty. This occurs when a trailing slash is used on the table directory (e.g., /table-buckets/mybucket/myns/mytable/). Added a test case 'table dir with slash' to verify this behavior. * test: add regression cases for metadata subdirs and table markers Enforced a strictly flat structure for the metadata directory by removing the "directory without trailing slash" fallback in validateFile for metadata. Added regression test cases: - metadata/nested (must fail) - /table-buckets/.../mytable/ (must pass) Verified all tests pass. * feat: reject double slashes in Iceberg table paths Modified validateDirectoryPath to return an error when encountering empty path segments, effectively rejecting double slashes like 'data//file.parquet'. Updated validateFile to use manual path splitting instead of the 'path' package for intermediate directories to ensure redundant slashes are not auto-cleaned before validation. Added regression tests for various double slash scenarios. * refactor: separate isMetadata logic in validateDirectoryPath Following reviewer feedback, refactored validateDirectoryPath to explicitly separate the handling of metadata and data paths. This improves readability and clarifies the function's intent while maintaining the strict validation rules and double-slash rejection previously implemented. * feat: validate bucket, namespace, and table path segments Updated ValidateTableBucketUpload to ensure that bucket, namespace, and table segments in the path are non-empty. This prevents invalid paths like '/table-buckets//myns/mytable/...' from being accepted during upload. Added regression tests for various empty segment scenarios. * Update weed/s3api/s3tables/iceberg_layout.go Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * feat: block double-slash bypass in table relative paths Added a guard in ValidateTableBucketUpload to reject tableRelativePath if it starts with a '/' or contains '//'. This ensures that paths like '/table-buckets/b/ns/t//data/file.parquet' are properly rejected and cannot bypass the layout validation. Added regression tests to verify. --------- Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> |