S3: Add tests for PyArrow with native S3 filesystem (#7508)

* PyArrow native S3 filesystem

* add sse-s3 tests

* update

* minor

* ENABLE_SSE_S3

* Update test_pyarrow_native_s3.py

* clean up

* refactoring

* Update test_pyarrow_native_s3.py
This commit is contained in:
Chris Lu
2025-11-19 13:49:22 -08:00
committed by GitHub
parent ca84a8a713
commit 8be9e258fc
7 changed files with 1008 additions and 5 deletions

View File

@@ -10,6 +10,22 @@ SeaweedFS implements implicit directory detection to improve compatibility with
## Quick Start
### Running the Example Script
```bash
# Start SeaweedFS server
make start-seaweedfs-ci
# Run the example script
python3 example_pyarrow_native.py
# Or with uv (if available)
uv run example_pyarrow_native.py
# Stop the server when done
make stop-seaweedfs-safe
```
### Running Tests
```bash
@@ -25,12 +41,20 @@ make test-quick
# Run implicit directory fix tests
make test-implicit-dir-with-server
# Run PyArrow native S3 filesystem tests
make test-native-s3-with-server
# Run SSE-S3 encryption tests
make test-sse-s3-compat
# Clean up
make clean
```
### Using PyArrow with SeaweedFS
#### Option 1: Using s3fs (recommended for compatibility)
```python
import pyarrow as pa
import pyarrow.parquet as pq
@@ -55,13 +79,55 @@ table = pq.read_table('bucket/dataset', filesystem=fs) # ✅
dataset = pq.ParquetDataset('bucket/dataset', filesystem=fs) # ✅
```
#### Option 2: Using PyArrow's native S3 filesystem (pure PyArrow)
```python
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.dataset as pads
import pyarrow.fs as pafs
# Configure PyArrow's native S3 filesystem
s3 = pafs.S3FileSystem(
access_key='your_access_key',
secret_key='your_secret_key',
endpoint_override='localhost:8333',
scheme='http',
allow_bucket_creation=True,
allow_bucket_deletion=True
)
# Write dataset
table = pa.table({'id': [1, 2, 3], 'value': ['a', 'b', 'c']})
pads.write_dataset(table, 'bucket/dataset', filesystem=s3)
# Read dataset (all methods work!)
table = pq.read_table('bucket/dataset', filesystem=s3) # ✅
dataset = pq.ParquetDataset('bucket/dataset', filesystem=s3) # ✅
dataset = pads.dataset('bucket/dataset', filesystem=s3) # ✅
```
## Test Files
### Main Test Suite
- **`s3_parquet_test.py`** - Comprehensive PyArrow test suite
- Tests 2 write methods × 5 read methods × 2 dataset sizes = 20 combinations
- Uses s3fs library for S3 operations
- All tests pass with the implicit directory fix ✅
### PyArrow Native S3 Tests
- **`test_pyarrow_native_s3.py`** - PyArrow's native S3 filesystem tests
- Tests PyArrow's built-in S3FileSystem (pyarrow.fs.S3FileSystem)
- Pure PyArrow solution without s3fs dependency
- Tests 3 read methods × 2 dataset sizes = 6 scenarios
- All tests pass ✅
- **`test_sse_s3_compatibility.py`** - SSE-S3 encryption compatibility tests
- Tests PyArrow native S3 with SSE-S3 server-side encryption
- Tests 5 different file sizes (10 to 500,000 rows)
- Verifies multipart upload encryption works correctly
- All tests pass ✅
### Implicit Directory Tests
- **`test_implicit_directory_fix.py`** - Specific tests for the implicit directory fix
- Tests HEAD request behavior
@@ -69,6 +135,12 @@ dataset = pq.ParquetDataset('bucket/dataset', filesystem=fs) # ✅
- Tests PyArrow dataset reading
- All 6 tests pass ✅
### Examples
- **`example_pyarrow_native.py`** - Simple standalone example
- Demonstrates PyArrow's native S3 filesystem usage
- Can be run with `uv run` or regular Python
- Minimal dependencies (pyarrow, boto3)
### Configuration
- **`Makefile`** - Build and test automation
- **`requirements.txt`** - Python dependencies (pyarrow, s3fs, boto3)
@@ -128,6 +200,9 @@ make test # Run full tests (assumes server is already running)
make test-with-server # Run full PyArrow test suite with server (small + large files)
make test-quick # Run quick tests with small files only (assumes server is running)
make test-implicit-dir-with-server # Run implicit directory tests with server
make test-native-s3 # Run PyArrow native S3 tests (assumes server is running)
make test-native-s3-with-server # Run PyArrow native S3 tests with server management
make test-sse-s3-compat # Run comprehensive SSE-S3 encryption compatibility tests
# Server Management
make start-seaweedfs-ci # Start SeaweedFS in background (CI mode)
@@ -146,10 +221,20 @@ The tests are automatically run in GitHub Actions on every push/PR that affects
**Test Matrix**:
- Python versions: 3.9, 3.11, 3.12
- PyArrow integration tests: 20 test combinations
- PyArrow integration tests (s3fs): 20 test combinations
- PyArrow native S3 tests: 6 test scenarios ✅ **NEW**
- SSE-S3 encryption tests: 5 file sizes ✅ **NEW**
- Implicit directory fix tests: 6 test scenarios
- Go unit tests: 17 test cases
**Test Steps** (run for each Python version):
1. Build SeaweedFS
2. Run PyArrow Parquet integration tests (`make test-with-server`)
3. Run implicit directory fix tests (`make test-implicit-dir-with-server`)
4. Run PyArrow native S3 filesystem tests (`make test-native-s3-with-server`) ✅ **NEW**
5. Run SSE-S3 encryption compatibility tests (`make test-sse-s3-compat`) ✅ **NEW**
6. Run Go unit tests for implicit directory handling
**Triggers**:
- Push/PR to master (when `weed/s3api/**` or `weed/filer/**` changes)
- Manual trigger via GitHub UI (workflow_dispatch)