* test read write by s3fs and PyArrow native file system for s3 * address comments * add github action
313 lines
11 KiB
Markdown
313 lines
11 KiB
Markdown
# PyArrow Parquet S3 Compatibility Tests
|
||
|
||
This directory contains tests for PyArrow Parquet compatibility with SeaweedFS S3 API, including the implicit directory detection fix.
|
||
|
||
## Overview
|
||
|
||
**Status**: ✅ **All PyArrow methods work correctly with SeaweedFS**
|
||
|
||
SeaweedFS implements implicit directory detection to improve compatibility with s3fs and PyArrow. When PyArrow writes datasets using `write_dataset()`, it may create directory markers that can confuse s3fs. SeaweedFS now handles these correctly by returning 404 for HEAD requests on implicit directories (directories with children), forcing s3fs to use LIST-based discovery.
|
||
|
||
## Quick Start
|
||
|
||
### Running the Example Script
|
||
|
||
```bash
|
||
# Start SeaweedFS server
|
||
make start-seaweedfs-ci
|
||
|
||
# Run the example script
|
||
python3 example_pyarrow_native.py
|
||
|
||
# Or with uv (if available)
|
||
uv run example_pyarrow_native.py
|
||
|
||
# Stop the server when done
|
||
make stop-seaweedfs-safe
|
||
```
|
||
|
||
### Running Tests
|
||
|
||
```bash
|
||
# Setup Python environment
|
||
make setup-python
|
||
|
||
# Run all tests with server (small and large files)
|
||
make test-with-server
|
||
|
||
# Run quick tests with small files only (faster for development)
|
||
make test-quick
|
||
|
||
# Run implicit directory fix tests
|
||
make test-implicit-dir-with-server
|
||
|
||
# Run PyArrow native S3 filesystem tests
|
||
make test-native-s3-with-server
|
||
|
||
# Run cross-filesystem compatibility tests (s3fs ↔ PyArrow native)
|
||
make test-cross-fs-with-server
|
||
|
||
# Run SSE-S3 encryption tests
|
||
make test-sse-s3-compat
|
||
|
||
# Clean up
|
||
make clean
|
||
```
|
||
|
||
### Using PyArrow with SeaweedFS
|
||
|
||
#### Option 1: Using s3fs (recommended for compatibility)
|
||
|
||
```python
|
||
import pyarrow as pa
|
||
import pyarrow.parquet as pq
|
||
import pyarrow.dataset as pads
|
||
import s3fs
|
||
|
||
# Configure s3fs
|
||
fs = s3fs.S3FileSystem(
|
||
key='your_access_key',
|
||
secret='your_secret_key',
|
||
endpoint_url='http://localhost:8333',
|
||
use_ssl=False
|
||
)
|
||
|
||
# Write dataset (creates directory structure)
|
||
table = pa.table({'id': [1, 2, 3], 'value': ['a', 'b', 'c']})
|
||
pads.write_dataset(table, 'bucket/dataset', filesystem=fs)
|
||
|
||
# Read dataset (all methods work!)
|
||
dataset = pads.dataset('bucket/dataset', filesystem=fs) # ✅
|
||
table = pq.read_table('bucket/dataset', filesystem=fs) # ✅
|
||
dataset = pq.ParquetDataset('bucket/dataset', filesystem=fs) # ✅
|
||
```
|
||
|
||
#### Option 2: Using PyArrow's native S3 filesystem (pure PyArrow)
|
||
|
||
```python
|
||
import pyarrow as pa
|
||
import pyarrow.parquet as pq
|
||
import pyarrow.dataset as pads
|
||
import pyarrow.fs as pafs
|
||
|
||
# Configure PyArrow's native S3 filesystem
|
||
s3 = pafs.S3FileSystem(
|
||
access_key='your_access_key',
|
||
secret_key='your_secret_key',
|
||
endpoint_override='localhost:8333',
|
||
scheme='http',
|
||
allow_bucket_creation=True,
|
||
allow_bucket_deletion=True
|
||
)
|
||
|
||
# Write dataset
|
||
table = pa.table({'id': [1, 2, 3], 'value': ['a', 'b', 'c']})
|
||
pads.write_dataset(table, 'bucket/dataset', filesystem=s3)
|
||
|
||
# Read dataset (all methods work!)
|
||
table = pq.read_table('bucket/dataset', filesystem=s3) # ✅
|
||
dataset = pq.ParquetDataset('bucket/dataset', filesystem=s3) # ✅
|
||
dataset = pads.dataset('bucket/dataset', filesystem=s3) # ✅
|
||
```
|
||
|
||
## Test Files
|
||
|
||
### Main Test Suite
|
||
- **`s3_parquet_test.py`** - Comprehensive PyArrow test suite
|
||
- Tests 2 write methods × 5 read methods × 2 dataset sizes = 20 combinations
|
||
- Uses s3fs library for S3 operations
|
||
- All tests pass with the implicit directory fix ✅
|
||
|
||
### PyArrow Native S3 Tests
|
||
- **`test_pyarrow_native_s3.py`** - PyArrow's native S3 filesystem tests
|
||
- Tests PyArrow's built-in S3FileSystem (pyarrow.fs.S3FileSystem)
|
||
- Pure PyArrow solution without s3fs dependency
|
||
- Tests 3 read methods × 2 dataset sizes = 6 scenarios
|
||
- All tests pass ✅
|
||
|
||
- **`test_sse_s3_compatibility.py`** - SSE-S3 encryption compatibility tests
|
||
- Tests PyArrow native S3 with SSE-S3 server-side encryption
|
||
- Tests 5 different file sizes (10 to 500,000 rows)
|
||
- Verifies multipart upload encryption works correctly
|
||
- All tests pass ✅
|
||
|
||
### Cross-Filesystem Compatibility Tests
|
||
- **`test_cross_filesystem_compatibility.py`** - Verifies cross-compatibility between s3fs and PyArrow native S3
|
||
- Tests write with s3fs → read with PyArrow native S3
|
||
- Tests write with PyArrow native S3 → read with s3fs
|
||
- Tests 2 directions × 3 read methods × 2 dataset sizes = 12 scenarios
|
||
- Validates that files written by either filesystem can be read by the other
|
||
- **All tests pass** ✅
|
||
- See **`CROSS_FILESYSTEM_COMPATIBILITY.md`** for detailed test results and analysis
|
||
|
||
### Implicit Directory Tests
|
||
- **`test_implicit_directory_fix.py`** - Specific tests for the implicit directory fix
|
||
- Tests HEAD request behavior
|
||
- Tests s3fs directory detection
|
||
- Tests PyArrow dataset reading
|
||
- All 6 tests pass ✅
|
||
|
||
### Examples
|
||
- **`example_pyarrow_native.py`** - Simple standalone example
|
||
- Demonstrates PyArrow's native S3 filesystem usage
|
||
- Can be run with `uv run` or regular Python
|
||
- Minimal dependencies (pyarrow, boto3)
|
||
|
||
### Configuration
|
||
- **`Makefile`** - Build and test automation
|
||
- **`requirements.txt`** - Python dependencies (pyarrow, s3fs, boto3)
|
||
- **`.gitignore`** - Ignore patterns for test artifacts
|
||
|
||
## Documentation
|
||
|
||
### Technical Documentation
|
||
- **`TEST_COVERAGE.md`** - Comprehensive test coverage documentation
|
||
- Unit tests (Go): 17 test cases
|
||
- Integration tests (Python): 6 test cases
|
||
- End-to-end tests (Python): 20 test cases
|
||
|
||
- **`FINAL_ROOT_CAUSE_ANALYSIS.md`** - Deep technical analysis
|
||
- Root cause of the s3fs compatibility issue
|
||
- How the implicit directory fix works
|
||
- Performance considerations
|
||
|
||
- **`CROSS_FILESYSTEM_COMPATIBILITY.md`** - Cross-filesystem compatibility test results ✅ **NEW**
|
||
- Validates s3fs ↔ PyArrow native S3 interoperability
|
||
- Confirms files written by either can be read by the other
|
||
- Test methodology and detailed results
|
||
|
||
- **`MINIO_DIRECTORY_HANDLING.md`** - Comparison with MinIO
|
||
- How MinIO handles directory markers
|
||
- Differences in implementation approaches
|
||
|
||
## The Implicit Directory Fix
|
||
|
||
### Problem
|
||
When PyArrow writes datasets with `write_dataset()`, it may create 0-byte directory markers. s3fs's `info()` method calls HEAD on these paths, and if HEAD returns 200 with size=0, s3fs incorrectly reports them as files instead of directories. This causes PyArrow to fail with "Parquet file size is 0 bytes".
|
||
|
||
### Solution
|
||
SeaweedFS now returns 404 for HEAD requests on implicit directories (0-byte objects or directories with children, when requested without a trailing slash). This forces s3fs to fall back to LIST-based discovery, which correctly identifies directories by checking for children.
|
||
|
||
### Implementation
|
||
The fix is implemented in `weed/s3api/s3api_object_handlers.go`:
|
||
- `HeadObjectHandler` - Returns 404 for implicit directories
|
||
- `hasChildren` - Helper function to check if a path has children
|
||
|
||
See the source code for detailed inline documentation.
|
||
|
||
### Test Coverage
|
||
- **Unit tests** (Go): `weed/s3api/s3api_implicit_directory_test.go`
|
||
- Run: `cd weed/s3api && go test -v -run TestImplicitDirectory`
|
||
|
||
- **Integration tests** (Python): `test_implicit_directory_fix.py`
|
||
- Run: `cd test/s3/parquet && make test-implicit-dir-with-server`
|
||
|
||
- **End-to-end tests** (Python): `s3_parquet_test.py`
|
||
- Run: `cd test/s3/parquet && make test-with-server`
|
||
|
||
## Makefile Targets
|
||
|
||
```bash
|
||
# Setup
|
||
make setup-python # Create Python virtual environment and install dependencies
|
||
make build-weed # Build SeaweedFS binary
|
||
|
||
# Testing
|
||
make test # Run full tests (assumes server is already running)
|
||
make test-with-server # Run full PyArrow test suite with server (small + large files)
|
||
make test-quick # Run quick tests with small files only (assumes server is running)
|
||
make test-implicit-dir-with-server # Run implicit directory tests with server
|
||
make test-native-s3 # Run PyArrow native S3 tests (assumes server is running)
|
||
make test-native-s3-with-server # Run PyArrow native S3 tests with server management
|
||
make test-cross-fs # Run cross-filesystem compatibility tests (assumes server is running)
|
||
make test-cross-fs-with-server # Run cross-filesystem compatibility tests with server management
|
||
make test-sse-s3-compat # Run comprehensive SSE-S3 encryption compatibility tests
|
||
|
||
# Server Management
|
||
make start-seaweedfs-ci # Start SeaweedFS in background (CI mode)
|
||
make stop-seaweedfs-safe # Stop SeaweedFS gracefully
|
||
make clean # Clean up all test artifacts
|
||
|
||
# Development
|
||
make help # Show all available targets
|
||
```
|
||
|
||
## Continuous Integration
|
||
|
||
The tests are automatically run in GitHub Actions on every push/PR that affects S3 or filer code:
|
||
|
||
**Workflow**: `.github/workflows/s3-parquet-tests.yml`
|
||
|
||
**Test Matrix**:
|
||
- Python versions: 3.9, 3.11, 3.12
|
||
- PyArrow integration tests (s3fs): 20 test combinations
|
||
- PyArrow native S3 tests: 6 test scenarios ✅
|
||
- Cross-filesystem compatibility tests: 12 test scenarios ✅ **NEW**
|
||
- SSE-S3 encryption tests: 5 file sizes ✅
|
||
- Implicit directory fix tests: 6 test scenarios
|
||
- Go unit tests: 17 test cases
|
||
|
||
**Test Steps** (run for each Python version):
|
||
1. Build SeaweedFS
|
||
2. Run PyArrow Parquet integration tests (`make test-with-server`)
|
||
3. Run implicit directory fix tests (`make test-implicit-dir-with-server`)
|
||
4. Run PyArrow native S3 filesystem tests (`make test-native-s3-with-server`)
|
||
5. Run cross-filesystem compatibility tests (`make test-cross-fs-with-server`) ✅ **NEW**
|
||
6. Run SSE-S3 encryption compatibility tests (`make test-sse-s3-compat`)
|
||
7. Run Go unit tests for implicit directory handling
|
||
|
||
**Triggers**:
|
||
- Push/PR to master (when `weed/s3api/**` or `weed/filer/**` changes)
|
||
- Manual trigger via GitHub UI (workflow_dispatch)
|
||
|
||
## Requirements
|
||
|
||
- Python 3.8+
|
||
- PyArrow 22.0.0+
|
||
- s3fs 2024.12.0+
|
||
- boto3 1.40.0+
|
||
- SeaweedFS (latest)
|
||
|
||
## AWS S3 Compatibility
|
||
|
||
The implicit directory fix makes SeaweedFS behavior more compatible with AWS S3:
|
||
- AWS S3 typically doesn't create directory markers for implicit directories
|
||
- HEAD on "dataset" (when only "dataset/file.txt" exists) returns 404 on AWS
|
||
- SeaweedFS now matches this behavior for implicit directories with children
|
||
|
||
## Edge Cases Handled
|
||
|
||
✅ **Implicit directories with children** → 404 (forces LIST-based discovery)
|
||
✅ **Empty files (0-byte, no children)** → 200 (legitimate empty file)
|
||
✅ **Empty directories (no children)** → 200 (legitimate empty directory)
|
||
✅ **Explicit directory requests (trailing slash)** → 200 (normal directory behavior)
|
||
✅ **Versioned buckets** → Skip implicit directory check (versioned semantics)
|
||
✅ **Regular files** → 200 (normal file behavior)
|
||
|
||
## Performance
|
||
|
||
The implicit directory check adds minimal overhead:
|
||
- Only triggered for 0-byte objects or directories without trailing slash
|
||
- Cost: One LIST operation with Limit=1 (~1-5ms)
|
||
- No impact on regular file operations
|
||
|
||
## Contributing
|
||
|
||
When adding new tests:
|
||
1. Add test cases to the appropriate test file
|
||
2. Update TEST_COVERAGE.md
|
||
3. Run the full test suite to ensure no regressions
|
||
4. Update this README if adding new functionality
|
||
|
||
## References
|
||
|
||
- [PyArrow Documentation](https://arrow.apache.org/docs/python/parquet.html)
|
||
- [s3fs Documentation](https://s3fs.readthedocs.io/)
|
||
- [SeaweedFS S3 API](https://github.com/seaweedfs/seaweedfs/wiki/Amazon-S3-API)
|
||
- [AWS S3 API Reference](https://docs.aws.amazon.com/AmazonS3/latest/API/)
|
||
|
||
---
|
||
|
||
**Last Updated**: November 19, 2025
|
||
**Status**: All tests passing ✅
|