S3: Add tests for PyArrow with native S3 filesystem (#7508)

* PyArrow native S3 filesystem * add sse-s3 tests * update * minor * ENABLE_SSE_S3 * Update test_pyarrow_native_s3.py * clean up * refactoring * Update test_pyarrow_native_s3.py
2025-11-19 13:49:22 -08:00
parent ca84a8a713
commit 8be9e258fc
7 changed files with 1008 additions and 5 deletions
--- a/test/s3/parquet/README.md
+++ b/test/s3/parquet/README.md
@@ -10,6 +10,22 @@ SeaweedFS implements implicit directory detection to improve compatibility with

 ## Quick Start

+### Running the Example Script
+
+```bash
+# Start SeaweedFS server
+make start-seaweedfs-ci
+
+# Run the example script
+python3 example_pyarrow_native.py
+
+# Or with uv (if available)
+uv run example_pyarrow_native.py
+
+# Stop the server when done
+make stop-seaweedfs-safe
+```
+
 ### Running Tests

 ```bash
@@ -25,12 +41,20 @@ make test-quick
 # Run implicit directory fix tests
 make test-implicit-dir-with-server

+# Run PyArrow native S3 filesystem tests
+make test-native-s3-with-server
+
+# Run SSE-S3 encryption tests
+make test-sse-s3-compat
+
 # Clean up
 make clean
 ```

 ### Using PyArrow with SeaweedFS

+#### Option 1: Using s3fs (recommended for compatibility)
+
 ```python
 import pyarrow as pa
 import pyarrow.parquet as pq
@@ -55,13 +79,55 @@ table = pq.read_table('bucket/dataset', filesystem=fs)   # ✅
 dataset = pq.ParquetDataset('bucket/dataset', filesystem=fs)  # ✅
 ```

+#### Option 2: Using PyArrow's native S3 filesystem (pure PyArrow)
+
+```python
+import pyarrow as pa
+import pyarrow.parquet as pq
+import pyarrow.dataset as pads
+import pyarrow.fs as pafs
+
+# Configure PyArrow's native S3 filesystem
+s3 = pafs.S3FileSystem(
+    access_key='your_access_key',
+    secret_key='your_secret_key',
+    endpoint_override='localhost:8333',
+    scheme='http',
+    allow_bucket_creation=True,
+    allow_bucket_deletion=True
+)
+
+# Write dataset
+table = pa.table({'id': [1, 2, 3], 'value': ['a', 'b', 'c']})
+pads.write_dataset(table, 'bucket/dataset', filesystem=s3)
+
+# Read dataset (all methods work!)
+table = pq.read_table('bucket/dataset', filesystem=s3)  # ✅
+dataset = pq.ParquetDataset('bucket/dataset', filesystem=s3)  # ✅
+dataset = pads.dataset('bucket/dataset', filesystem=s3)  # ✅
+```
+
 ## Test Files

 ### Main Test Suite
 - **`s3_parquet_test.py`** - Comprehensive PyArrow test suite
  - Tests 2 write methods × 5 read methods × 2 dataset sizes = 20 combinations
+  - Uses s3fs library for S3 operations
  - All tests pass with the implicit directory fix ✅

+### PyArrow Native S3 Tests
+- **`test_pyarrow_native_s3.py`** - PyArrow's native S3 filesystem tests
+  - Tests PyArrow's built-in S3FileSystem (pyarrow.fs.S3FileSystem)
+  - Pure PyArrow solution without s3fs dependency
+  - Tests 3 read methods × 2 dataset sizes = 6 scenarios
+  - All tests pass ✅
+
+- **`test_sse_s3_compatibility.py`** - SSE-S3 encryption compatibility tests
+  - Tests PyArrow native S3 with SSE-S3 server-side encryption
+  - Tests 5 different file sizes (10 to 500,000 rows)
+  - Verifies multipart upload encryption works correctly
+  - All tests pass ✅
+
 ### Implicit Directory Tests
 - **`test_implicit_directory_fix.py`** - Specific tests for the implicit directory fix
  - Tests HEAD request behavior
@@ -69,6 +135,12 @@ dataset = pq.ParquetDataset('bucket/dataset', filesystem=fs)  # ✅
  - Tests PyArrow dataset reading
  - All 6 tests pass ✅

+### Examples
+- **`example_pyarrow_native.py`** - Simple standalone example
+  - Demonstrates PyArrow's native S3 filesystem usage
+  - Can be run with `uv run` or regular Python
+  - Minimal dependencies (pyarrow, boto3)
+
 ### Configuration
 - **`Makefile`** - Build and test automation
 - **`requirements.txt`** - Python dependencies (pyarrow, s3fs, boto3)
@@ -128,6 +200,9 @@ make test                 # Run full tests (assumes server is already running)
 make test-with-server     # Run full PyArrow test suite with server (small + large files)
 make test-quick           # Run quick tests with small files only (assumes server is running)
 make test-implicit-dir-with-server  # Run implicit directory tests with server
+make test-native-s3       # Run PyArrow native S3 tests (assumes server is running)
+make test-native-s3-with-server  # Run PyArrow native S3 tests with server management
+make test-sse-s3-compat   # Run comprehensive SSE-S3 encryption compatibility tests

 # Server Management
 make start-seaweedfs-ci   # Start SeaweedFS in background (CI mode)
@@ -146,10 +221,20 @@ The tests are automatically run in GitHub Actions on every push/PR that affects

 **Test Matrix**:
 - Python versions: 3.9, 3.11, 3.12
- PyArrow integration tests: 20 test combinations
+- PyArrow integration tests (s3fs): 20 test combinations
+- PyArrow native S3 tests: 6 test scenarios ✅ **NEW**
+- SSE-S3 encryption tests: 5 file sizes ✅ **NEW**
 - Implicit directory fix tests: 6 test scenarios
 - Go unit tests: 17 test cases

+**Test Steps** (run for each Python version):
+1. Build SeaweedFS
+2. Run PyArrow Parquet integration tests (`make test-with-server`)
+3. Run implicit directory fix tests (`make test-implicit-dir-with-server`)
+4. Run PyArrow native S3 filesystem tests (`make test-native-s3-with-server`) ✅ **NEW**
+5. Run SSE-S3 encryption compatibility tests (`make test-sse-s3-compat`) ✅ **NEW**
+6. Run Go unit tests for implicit directory handling
+
 **Triggers**:
 - Push/PR to master (when `weed/s3api/**` or `weed/filer/**` changes)
 - Manual trigger via GitHub UI (workflow_dispatch)