Files

Chris Lu 2f6aa98221 Refactor: Replace removeDuplicateSlashes with NormalizeObjectKey (#7873 )

* Replace removeDuplicateSlashes with NormalizeObjectKey

Use s3_constants.NormalizeObjectKey instead of removeDuplicateSlashes in most places
for consistency. NormalizeObjectKey handles both duplicate slash removal and ensures
the path starts with '/', providing more complete normalization.

* Fix double slash issues after NormalizeObjectKey

After using NormalizeObjectKey, object keys have a leading '/'. This commit ensures:
- getVersionedObjectDir strips leading slash before concatenation
- getEntry calls receive names without leading slash
- String concatenation with '/' doesn't create '//' paths

This prevents path construction errors like:
/buckets/bucket//object (wrong)
/buckets/bucket/object (correct)

* ensure object key leading "/"

* fix compilation

* fix: Strip leading slash from object keys in S3 API responses

After introducing NormalizeObjectKey, all internal object keys have a
leading slash. However, S3 API responses must return keys without
leading slashes to match AWS S3 behavior.

Fixed in three functions:
- addVersion: Strip slash for version list entries
- processRegularFile: Strip slash for regular file entries
- processExplicitDirectory: Strip slash for directory entries

This ensures ListObjectVersions and similar APIs return keys like
'bar' instead of '/bar', matching S3 API specifications.

* fix: Normalize keyMarker for consistent pagination comparison

The S3 API provides keyMarker without a leading slash (e.g., 'object-001'),
but after introducing NormalizeObjectKey, all internal object keys have
leading slashes (e.g., '/object-001').

When comparing keyMarker < normalizedObjectKey in shouldSkipObjectForMarker,
the ASCII value of '/' (47) is less than 'o' (111), causing all objects
to be incorrectly skipped during pagination. This resulted in page 2 and
beyond returning 0 results.

Fix: Normalize the keyMarker when creating versionCollector so comparisons
work correctly with normalized object keys.

Fixes pagination tests:
- TestVersioningPaginationOver1000Versions
- TestVersioningPaginationMultipleObjectsManyVersions

* refactor: Change NormalizeObjectKey to return keys without leading slash

BREAKING STRATEGY CHANGE:
Previously, NormalizeObjectKey added a leading slash to all object keys,
which required stripping it when returning keys to S3 API clients and
caused complexity in marker normalization for pagination.

NEW STRATEGY:
- NormalizeObjectKey now returns keys WITHOUT leading slash (e.g., 'foo/bar' not '/foo/bar')
- This matches the S3 API format directly
- All path concatenations now explicitly add '/' between bucket and object
- No need to strip slashes in responses or normalize markers

Changes:
1. Modified NormalizeObjectKey to strip leading slash instead of adding it
2. Fixed all path concatenations to use:
- BucketsPath + '/' + bucket + '/' + object
instead of:
- BucketsPath + '/' + bucket + object
3. Reverted response key stripping in:
- addVersion()
- processRegularFile()
- processExplicitDirectory()
4. Reverted keyMarker normalization in findVersionsRecursively()
5. Updated matchesPrefixFilter() to work with keys without leading slash
6. Fixed paths in handlers:
- s3api_object_handlers.go (GetObject, HeadObject, cacheRemoteObjectForStreaming)
- s3api_object_handlers_postpolicy.go
- s3api_object_handlers_tagging.go
- s3api_object_handlers_acl.go
- s3api_version_id.go (getVersionedObjectDir, getVersionIdFormat)
- s3api_object_versioning.go (getObjectVersionList, updateLatestVersionAfterDeletion)

All versioning tests pass including pagination stress tests.

* adjust format

* Update post policy tests to match new NormalizeObjectKey behavior

- Update TestPostPolicyKeyNormalization to expect keys without leading slashes
- Update TestNormalizeObjectKey to expect keys without leading slashes
- Update TestPostPolicyFilenameSubstitution to expect keys without leading slashes
- Update path construction in tests to use new pattern: BucketsPath + '/' + bucket + '/' + object

* Fix ListObjectVersions prefix filtering

Remove leading slash addition to prefix parameter to allow correct filtering
of .versions directories when listing object versions with a specific prefix.

The prefix parameter should match entry paths relative to bucket root.
Adding a leading slash was breaking the prefix filter for paginated requests.

Fixes pagination issue where second page returned 0 versions instead of
continuing with remaining versions.

* no leading slash

* Fix urlEscapeObject to add leading slash for filer paths

NormalizeObjectKey now returns keys without leading slashes to match S3 API format.
However, urlEscapeObject is used for filer paths which require leading slashes.
Add leading slash back after normalization to ensure filer paths are correct.

Fixes TestS3ApiServer_toFilerPath test failures.

* adjust tests

* normalize

* Fix: Normalize prefixes and markers in LIST operations using NormalizeObjectKey

Ensure consistent key normalization across all S3 operations (GET, PUT, LIST).
Previously, LIST operations were not applying the same normalization rules
(handling backslashes, duplicate slashes, leading slashes) as GET/PUT operations.

Changes:
- Updated normalizePrefixMarker() to call NormalizeObjectKey for both prefix and marker
- This ensures prefixes with leading slashes, backslashes, or duplicate slashes are
handled consistently with how object keys are normalized
- Fixes Parquet test failures where pads.write_dataset creates implicit directory
structures that couldn't be discovered by subsequent LIST operations
- Added TestPrefixNormalizationInList and TestListPrefixConsistency tests

All existing LIST tests continue to pass with the normalization improvements.

* Add debugging logging to LIST operations to track prefix normalization

* Fix: Remove leading slash addition from GetPrefix to work with NormalizeObjectKey

The NormalizeObjectKey function removes leading slashes to match S3 API format
(e.g., 'foo/bar' not '/foo/bar'). However, GetPrefix was adding a leading slash
back, which caused LIST operations to fail with incorrect path handling.

Now GetPrefix only normalizes duplicate slashes without adding a leading slash,
which allows NormalizeObjectKey changes to work correctly for S3 LIST operations.

All Parquet integration tests now pass (20/20).

* Fix: Handle object paths without leading slash in checkDirectoryObject

NormalizeObjectKey() removes the leading slash to match S3 API format.
However, checkDirectoryObject() was assuming the object path has a leading
slash when processing directory markers (paths ending with '/').

Now we ensure the object has a leading slash before processing it for
filer operations.

Fixes implicit directory marker test (explicit_dir/) while keeping
Parquet integration tests passing (20/20).

All tests pass:
- Implicit directory tests: 6/6
- Parquet integration tests: 20/20

* Fix: Handle explicit directory markers with trailing slashes

Explicit directory markers created with put_object(Key='dir/', ...) are stored
in the filer with the trailing slash as part of the name. The checkDirectoryObject()
function now checks for both:
1. Explicit directories: lookup with trailing slash preserved (e.g., 'explicit_dir/')
2. Implicit directories: lookup without trailing slash (e.g., 'implicit_dir')

This ensures both types of directory markers are properly recognized.

All tests pass:
- Implicit directory tests: 6/6 (including explicit directory marker test)
- Parquet integration tests: 20/20

* Fix: Preserve trailing slash in NormalizeObjectKey

NormalizeObjectKey now preserves trailing slashes when normalizing object keys.
This is important for explicit directory markers like 'explicit_dir/' which rely
on the trailing slash to be recognized as directory objects.

The normalization process:
1. Notes if trailing slash was present
2. Removes duplicate slashes and converts backslashes
3. Removes leading slash for S3 API format
4. Restores trailing slash if it was in the original

This ensures explicit directory markers created with put_object(Key='dir/', ...)
are properly normalized and can be looked up by their exact name.

All tests pass:
- Implicit directory tests: 6/6
- Parquet integration tests: 20/20

* clean object

* Fix: Don't restore trailing slash if result is empty

When normalizing paths that are only slashes (e.g., '///', '/'), the function
should return an empty string, not a single slash. The fix ensures we only
restore the trailing slash if the result is non-empty.

This fixes the 'just_slashes' test case:
- Input: '///'
- Expected: ''
- Previous: '/'
- Fixed: ''

All tests now pass:
- Unit tests: TestNormalizeObjectKey (13/13)
- Implicit directory tests: 6/6
- Parquet integration tests: 20/20

* prefixEndsOnDelimiter

* Update s3api_object_handlers_list.go

* handle create directory

2025-12-24 19:07:08 -08:00

.gitignore

S3: Directly read write volume servers (#7481 )

2025-11-18 23:18:35 -08:00

CROSS_FILESYSTEM_COMPATIBILITY.md

test read write by s3fs and PyArrow native file system for s3 (#7520 )

2025-11-21 11:18:32 -08:00

debug_write_dataset.py

Refactor: Replace removeDuplicateSlashes with NormalizeObjectKey (#7873 )

2025-12-24 19:07:08 -08:00

example_pyarrow_native.py

S3: Add tests for PyArrow with native S3 filesystem (#7508 )

2025-11-19 13:49:22 -08:00

FINAL_ROOT_CAUSE_ANALYSIS.md

S3: Directly read write volume servers (#7481 )

2025-11-18 23:18:35 -08:00

Makefile

fix: add missing backslash for volume extraArgs in helm chart (#7676 )

2025-12-08 23:21:02 -08:00

MINIO_DIRECTORY_HANDLING.md

S3: Directly read write volume servers (#7481 )

2025-11-18 23:18:35 -08:00

parquet_test_utils.py

S3: Add tests for PyArrow with native S3 filesystem (#7508 )

2025-11-19 13:49:22 -08:00

README.md

test read write by s3fs and PyArrow native file system for s3 (#7520 )

2025-11-21 11:18:32 -08:00

requirements.txt

S3: Directly read write volume servers (#7481 )

2025-11-18 23:18:35 -08:00

s3_parquet_test.py

S3: Directly read write volume servers (#7481 )

2025-11-18 23:18:35 -08:00

TEST_COVERAGE.md

S3: Directly read write volume servers (#7481 )

2025-11-18 23:18:35 -08:00

test_cross_filesystem_compatibility.py

test read write by s3fs and PyArrow native file system for s3 (#7520 )

2025-11-21 11:18:32 -08:00

test_implicit_directory_fix.py

Refactor: Replace removeDuplicateSlashes with NormalizeObjectKey (#7873 )

2025-12-24 19:07:08 -08:00

test_pyarrow_native_s3.py

S3: Add tests for PyArrow with native S3 filesystem (#7508 )

2025-11-19 13:49:22 -08:00

test_sse_s3_compatibility.py

S3: Add tests for PyArrow with native S3 filesystem (#7508 )

2025-11-19 13:49:22 -08:00

README.md

PyArrow Parquet S3 Compatibility Tests

This directory contains tests for PyArrow Parquet compatibility with SeaweedFS S3 API, including the implicit directory detection fix.

Overview

Status: ✅ All PyArrow methods work correctly with SeaweedFS

SeaweedFS implements implicit directory detection to improve compatibility with s3fs and PyArrow. When PyArrow writes datasets using write_dataset(), it may create directory markers that can confuse s3fs. SeaweedFS now handles these correctly by returning 404 for HEAD requests on implicit directories (directories with children), forcing s3fs to use LIST-based discovery.

Quick Start

Running the Example Script

# Start SeaweedFS server
make start-seaweedfs-ci

# Run the example script
python3 example_pyarrow_native.py

# Or with uv (if available)
uv run example_pyarrow_native.py

# Stop the server when done
make stop-seaweedfs-safe

Running Tests

# Setup Python environment
make setup-python

# Run all tests with server (small and large files)
make test-with-server

# Run quick tests with small files only (faster for development)
make test-quick

# Run implicit directory fix tests
make test-implicit-dir-with-server

# Run PyArrow native S3 filesystem tests
make test-native-s3-with-server

# Run cross-filesystem compatibility tests (s3fs ↔ PyArrow native)
make test-cross-fs-with-server

# Run SSE-S3 encryption tests
make test-sse-s3-compat

# Clean up
make clean

Using PyArrow with SeaweedFS

Option 1: Using s3fs (recommended for compatibility)

import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.dataset as pads
import s3fs

# Configure s3fs
fs = s3fs.S3FileSystem(
    key='your_access_key',
    secret='your_secret_key',
    endpoint_url='http://localhost:8333',
    use_ssl=False
)

# Write dataset (creates directory structure)
table = pa.table({'id': [1, 2, 3], 'value': ['a', 'b', 'c']})
pads.write_dataset(table, 'bucket/dataset', filesystem=fs)

# Read dataset (all methods work!)
dataset = pads.dataset('bucket/dataset', filesystem=fs)  # ✅
table = pq.read_table('bucket/dataset', filesystem=fs)   # ✅
dataset = pq.ParquetDataset('bucket/dataset', filesystem=fs)  # ✅

Option 2: Using PyArrow's native S3 filesystem (pure PyArrow)

import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.dataset as pads
import pyarrow.fs as pafs

# Configure PyArrow's native S3 filesystem
s3 = pafs.S3FileSystem(
    access_key='your_access_key',
    secret_key='your_secret_key',
    endpoint_override='localhost:8333',
    scheme='http',
    allow_bucket_creation=True,
    allow_bucket_deletion=True
)

# Write dataset
table = pa.table({'id': [1, 2, 3], 'value': ['a', 'b', 'c']})
pads.write_dataset(table, 'bucket/dataset', filesystem=s3)

# Read dataset (all methods work!)
table = pq.read_table('bucket/dataset', filesystem=s3)  # ✅
dataset = pq.ParquetDataset('bucket/dataset', filesystem=s3)  # ✅
dataset = pads.dataset('bucket/dataset', filesystem=s3)  # ✅

Test Files

Main Test Suite

s3_parquet_test.py - Comprehensive PyArrow test suite
- Tests 2 write methods × 5 read methods × 2 dataset sizes = 20 combinations
- Uses s3fs library for S3 operations
- All tests pass with the implicit directory fix ✅

PyArrow Native S3 Tests

test_pyarrow_native_s3.py - PyArrow's native S3 filesystem tests
- Tests PyArrow's built-in S3FileSystem (pyarrow.fs.S3FileSystem)
- Pure PyArrow solution without s3fs dependency
- Tests 3 read methods × 2 dataset sizes = 6 scenarios
- All tests pass ✅
test_sse_s3_compatibility.py - SSE-S3 encryption compatibility tests
- Tests PyArrow native S3 with SSE-S3 server-side encryption
- Tests 5 different file sizes (10 to 500,000 rows)
- Verifies multipart upload encryption works correctly
- All tests pass ✅

Cross-Filesystem Compatibility Tests

test_cross_filesystem_compatibility.py - Verifies cross-compatibility between s3fs and PyArrow native S3
- Tests write with s3fs → read with PyArrow native S3
- Tests write with PyArrow native S3 → read with s3fs
- Tests 2 directions × 3 read methods × 2 dataset sizes = 12 scenarios
- Validates that files written by either filesystem can be read by the other
- All tests pass ✅
- See CROSS_FILESYSTEM_COMPATIBILITY.md for detailed test results and analysis

Implicit Directory Tests

test_implicit_directory_fix.py - Specific tests for the implicit directory fix
- Tests HEAD request behavior
- Tests s3fs directory detection
- Tests PyArrow dataset reading
- All 6 tests pass ✅

Examples

example_pyarrow_native.py - Simple standalone example
- Demonstrates PyArrow's native S3 filesystem usage
- Can be run with uv run or regular Python
- Minimal dependencies (pyarrow, boto3)

Configuration

Makefile - Build and test automation
requirements.txt - Python dependencies (pyarrow, s3fs, boto3)
.gitignore - Ignore patterns for test artifacts

Documentation

Technical Documentation

TEST_COVERAGE.md - Comprehensive test coverage documentation
- Unit tests (Go): 17 test cases
- Integration tests (Python): 6 test cases
- End-to-end tests (Python): 20 test cases
FINAL_ROOT_CAUSE_ANALYSIS.md - Deep technical analysis
- Root cause of the s3fs compatibility issue
- How the implicit directory fix works
- Performance considerations
CROSS_FILESYSTEM_COMPATIBILITY.md - Cross-filesystem compatibility test results ✅ NEW
- Validates s3fs ↔ PyArrow native S3 interoperability
- Confirms files written by either can be read by the other
- Test methodology and detailed results
MINIO_DIRECTORY_HANDLING.md - Comparison with MinIO
- How MinIO handles directory markers
- Differences in implementation approaches

The Implicit Directory Fix

Problem

When PyArrow writes datasets with write_dataset(), it may create 0-byte directory markers. s3fs's info() method calls HEAD on these paths, and if HEAD returns 200 with size=0, s3fs incorrectly reports them as files instead of directories. This causes PyArrow to fail with "Parquet file size is 0 bytes".

Solution

SeaweedFS now returns 404 for HEAD requests on implicit directories (0-byte objects or directories with children, when requested without a trailing slash). This forces s3fs to fall back to LIST-based discovery, which correctly identifies directories by checking for children.

Implementation

The fix is implemented in weed/s3api/s3api_object_handlers.go:

HeadObjectHandler - Returns 404 for implicit directories
hasChildren - Helper function to check if a path has children

See the source code for detailed inline documentation.

Test Coverage

Unit tests (Go): weed/s3api/s3api_implicit_directory_test.go
- Run: cd weed/s3api && go test -v -run TestImplicitDirectory
Integration tests (Python): test_implicit_directory_fix.py
- Run: cd test/s3/parquet && make test-implicit-dir-with-server
End-to-end tests (Python): s3_parquet_test.py
- Run: cd test/s3/parquet && make test-with-server

Makefile Targets

# Setup
make setup-python          # Create Python virtual environment and install dependencies
make build-weed           # Build SeaweedFS binary

# Testing
make test                 # Run full tests (assumes server is already running)
make test-with-server     # Run full PyArrow test suite with server (small + large files)
make test-quick           # Run quick tests with small files only (assumes server is running)
make test-implicit-dir-with-server  # Run implicit directory tests with server
make test-native-s3       # Run PyArrow native S3 tests (assumes server is running)
make test-native-s3-with-server  # Run PyArrow native S3 tests with server management
make test-cross-fs        # Run cross-filesystem compatibility tests (assumes server is running)
make test-cross-fs-with-server  # Run cross-filesystem compatibility tests with server management
make test-sse-s3-compat   # Run comprehensive SSE-S3 encryption compatibility tests

# Server Management
make start-seaweedfs-ci   # Start SeaweedFS in background (CI mode)
make stop-seaweedfs-safe  # Stop SeaweedFS gracefully
make clean                # Clean up all test artifacts

# Development
make help                 # Show all available targets

Continuous Integration

The tests are automatically run in GitHub Actions on every push/PR that affects S3 or filer code:

Workflow: .github/workflows/s3-parquet-tests.yml

Test Matrix:

Python versions: 3.9, 3.11, 3.12
PyArrow integration tests (s3fs): 20 test combinations
PyArrow native S3 tests: 6 test scenarios ✅
Cross-filesystem compatibility tests: 12 test scenarios ✅ NEW
SSE-S3 encryption tests: 5 file sizes ✅
Implicit directory fix tests: 6 test scenarios
Go unit tests: 17 test cases

Test Steps (run for each Python version):

Build SeaweedFS
Run PyArrow Parquet integration tests (make test-with-server)
Run implicit directory fix tests (make test-implicit-dir-with-server)
Run PyArrow native S3 filesystem tests (make test-native-s3-with-server)
Run cross-filesystem compatibility tests (make test-cross-fs-with-server) ✅ NEW
Run SSE-S3 encryption compatibility tests (make test-sse-s3-compat)
Run Go unit tests for implicit directory handling

Triggers:

Push/PR to master (when weed/s3api/** or weed/filer/** changes)
Manual trigger via GitHub UI (workflow_dispatch)

Requirements

Python 3.8+
PyArrow 22.0.0+
s3fs 2024.12.0+
boto3 1.40.0+
SeaweedFS (latest)

AWS S3 Compatibility

The implicit directory fix makes SeaweedFS behavior more compatible with AWS S3:

AWS S3 typically doesn't create directory markers for implicit directories
HEAD on "dataset" (when only "dataset/file.txt" exists) returns 404 on AWS
SeaweedFS now matches this behavior for implicit directories with children

Edge Cases Handled

✅ Implicit directories with children → 404 (forces LIST-based discovery)
✅ Empty files (0-byte, no children) → 200 (legitimate empty file)
✅ Empty directories (no children) → 200 (legitimate empty directory)
✅ Explicit directory requests (trailing slash) → 200 (normal directory behavior)
✅ Versioned buckets → Skip implicit directory check (versioned semantics)
✅ Regular files → 200 (normal file behavior)

Performance

The implicit directory check adds minimal overhead:

Only triggered for 0-byte objects or directories without trailing slash
Cost: One LIST operation with Limit=1 (~1-5ms)
No impact on regular file operations

Contributing

When adding new tests:

Add test cases to the appropriate test file
Update TEST_COVERAGE.md
Run the full test suite to ensure no regressions
Update this README if adding new functionality

References

Last Updated: November 19, 2025
Status: All tests passing ✅

README.md Unescape Escape

PyArrow Parquet S3 Compatibility Tests

Overview

Quick Start

Running the Example Script

Running Tests

Using PyArrow with SeaweedFS

Option 1: Using s3fs (recommended for compatibility)

Option 2: Using PyArrow's native S3 filesystem (pure PyArrow)

Test Files

Main Test Suite

PyArrow Native S3 Tests

Cross-Filesystem Compatibility Tests

Implicit Directory Tests

Examples

Configuration

Documentation

Technical Documentation

The Implicit Directory Fix

Problem

Solution

Implementation

Test Coverage

Makefile Targets

Continuous Integration

Requirements

AWS S3 Compatibility

Edge Cases Handled

Performance

Contributing

References

README.md