Refactor S3 integration tests to use weed mini (#7877)

* Refactor S3 integration tests to use weed mini

* Fix weed mini flags for sse and parquet tests

* Fix IAM test startup: remove -iam.config flag from weed mini

* Enhance logging in IAM Makefile to debug startup failure

* Simplify weed mini flags and checks in S3 tests (IAM, Parquet, SSE, Copying)

* Simplify weed mini flags and checks in all S3 tests

* Fix IAM tests: use -s3.iam.config for weed mini

* Replace timeout command with portable loop in IAM Makefile

* Standardize portable loop-based readiness checks in all S3 Makefiles

* Define SERVER_DIR in retention Makefile

* Fix versioning and retention Makefiles: remove unsupported weed mini flags

* fix filer_group test

* fix cors

* emojis

* fix sse

* fix retention

* fixes

* fix

* fixes

* fix parquet

* fixes

* fix

* clean up

* avoid duplicated debug server

* Update .gitignore

* simplify

* clean up

* add credentials

* bind

* delay

* Update Makefile

* Update Makefile

* check ready

* delay

* update remote credentials

* Update Makefile

* clean up

* kill

* Update Makefile

* update credentials
This commit is contained in:
Chris Lu
2025-12-25 11:00:54 -08:00
committed by GitHub
parent 2f6aa98221
commit 7064ad420d
21 changed files with 241 additions and 854 deletions

View File

@@ -1,172 +0,0 @@
# Cross-Filesystem Compatibility Test Results
## Overview
This document summarizes the cross-filesystem compatibility testing between **s3fs** and **PyArrow native S3 filesystem** implementations when working with SeaweedFS.
## Test Purpose
Verify that Parquet files written using one filesystem implementation (s3fs or PyArrow native S3) can be correctly read using the other implementation, confirming true file format compatibility.
## Test Methodology
### Test Matrix
The test performs the following combinations:
1. **Write with s3fs → Read with PyArrow native S3**
2. **Write with PyArrow native S3 → Read with s3fs**
For each direction, the test:
- Creates a sample PyArrow table with multiple data types (int64, string, float64, bool)
- Writes the Parquet file using one filesystem implementation
- Reads the Parquet file using the other filesystem implementation
- Verifies data integrity by comparing:
- Row counts
- Schema equality
- Data contents (after sorting by ID to handle row order differences)
### File Sizes Tested
- **Small files**: 5 rows (quick validation)
- **Large files**: 200,000 rows (multi-row-group validation)
## Test Results
### ✅ Small Files (5 rows)
| Write Method | Read Method | Result | Read Function Used |
|--------------|-------------|--------|--------------------|
| s3fs | PyArrow native S3 | ✅ PASS | pq.read_table |
| PyArrow native S3 | s3fs | ✅ PASS | pq.read_table |
**Status**: **ALL TESTS PASSED**
### Large Files (200,000 rows)
Large file testing requires adequate volume capacity in SeaweedFS. When run with default volume settings (50MB max size), tests may encounter capacity issues with the number of large test files created simultaneously.
**Recommendation**: For large file testing, increase `VOLUME_MAX_SIZE_MB` in the Makefile or run tests with `TEST_QUICK=1` for development/validation purposes.
## Key Findings
### ✅ Full Compatibility Confirmed
**Files written with s3fs and PyArrow native S3 filesystem are fully compatible and can be read by either implementation.**
This confirms that:
1. **Identical Parquet Format**: Both s3fs and PyArrow native S3 use the same underlying PyArrow library to generate Parquet files, resulting in identical file formats at the binary level.
2. **S3 API Compatibility**: SeaweedFS's S3 implementation handles both filesystem backends correctly, with proper:
- Object creation (PutObject)
- Object reading (GetObject)
- Directory handling (implicit directories)
- Multipart uploads (for larger files)
3. **Metadata Consistency**: File metadata, schemas, and data integrity are preserved across both write and read operations regardless of which filesystem implementation is used.
## Implementation Details
### Common Write Path
Both implementations use PyArrow's `pads.write_dataset()` function:
```python
# s3fs approach
fs = s3fs.S3FileSystem(...)
pads.write_dataset(table, path, format="parquet", filesystem=fs)
# PyArrow native approach
s3 = pafs.S3FileSystem(...)
pads.write_dataset(table, path, format="parquet", filesystem=s3)
```
### Multiple Read Methods Tested
The test attempts reads using multiple PyArrow methods:
- `pq.read_table()` - Direct table reading
- `pq.ParquetDataset()` - Dataset-based reading
- `pads.dataset()` - PyArrow dataset API
All methods successfully read files written by either filesystem implementation.
## Practical Implications
### For Users
1. **Flexibility**: Users can choose either s3fs or PyArrow native S3 based on their preferences:
- **s3fs**: More mature, widely used, familiar API
- **PyArrow native**: Pure PyArrow solution, fewer dependencies
2. **Interoperability**: Teams using different tools can seamlessly share Parquet datasets stored in SeaweedFS
3. **Migration**: Easy to migrate between filesystem implementations without data conversion
### For SeaweedFS
1. **S3 Compatibility**: Confirms SeaweedFS's S3 implementation is compatible with major Python data science tools
2. **Implicit Directory Handling**: The implicit directory fix works correctly for both filesystem implementations
3. **Standard Compliance**: SeaweedFS handles S3 operations in a way that's compatible with AWS S3 behavior
## Running the Tests
### Quick Test (Recommended for Development)
```bash
cd test/s3/parquet
TEST_QUICK=1 make test-cross-fs-with-server
```
### Full Test (All File Sizes)
```bash
cd test/s3/parquet
make test-cross-fs-with-server
```
### Manual Test (Assuming Server is Running)
```bash
cd test/s3/parquet
make setup-python
make start-seaweedfs-ci
# In another terminal
TEST_QUICK=1 make test-cross-fs
# Cleanup
make stop-seaweedfs-safe
```
## Environment Variables
The test supports customization through environment variables:
- `S3_ENDPOINT_URL`: S3 endpoint (default: `http://localhost:8333`)
- `S3_ACCESS_KEY`: Access key (default: `some_access_key1`)
- `S3_SECRET_KEY`: Secret key (default: `some_secret_key1`)
- `BUCKET_NAME`: Bucket name (default: `test-parquet-bucket`)
- `TEST_QUICK`: Run only small tests (default: `0`, set to `1` for quick mode)
## Conclusion
The cross-filesystem compatibility tests demonstrate that **Parquet files written via s3fs and PyArrow native S3 filesystem are completely interchangeable**. This validates that:
1. The Parquet file format is implementation-agnostic
2. SeaweedFS's S3 API correctly handles both filesystem backends
3. Users have full flexibility in choosing their preferred filesystem implementation
This compatibility is a testament to:
- PyArrow's consistent file format generation
- SeaweedFS's robust S3 API implementation
- Proper handling of S3 semantics (especially implicit directories)
---
**Test Implementation**: `test_cross_filesystem_compatibility.py`
**Last Updated**: November 21, 2024
**Status**: ✅ All critical tests passing

View File

@@ -1,58 +0,0 @@
# Final Root Cause Analysis
## Overview
This document provides a deep technical analysis of the s3fs compatibility issue with PyArrow Parquet datasets on SeaweedFS, and the solution implemented to resolve it.
## Root Cause
When PyArrow writes datasets using `write_dataset()`, it creates implicit directory structures by writing files without explicit directory markers. However, some S3 workflows may create 0-byte directory markers.
### The Problem
1. **PyArrow writes dataset files** without creating explicit directory objects
2. **s3fs calls HEAD** on the directory path to check if it exists
3. **If HEAD returns 200** with `Content-Length: 0`, s3fs interprets it as a file (not a directory)
4. **PyArrow fails** when trying to read, reporting "Parquet file size is 0 bytes"
### AWS S3 Behavior
AWS S3 returns **404 Not Found** for implicit directories (directories that only exist because they have children but no explicit marker object). This allows s3fs to fall back to LIST operations to detect the directory.
## The Solution
### Implementation
Modified the S3 API HEAD handler in `weed/s3api/s3api_object_handlers.go` to:
1. **Check if object ends with `/`**: Explicit directory markers return 200 as before
2. **Check if object has children**: If a 0-byte object has children in the filer, treat it as an implicit directory
3. **Return 404 for implicit directories**: This matches AWS S3 behavior and triggers s3fs's LIST fallback
### Code Changes
The fix is implemented in the `HeadObjectHandler` function with logic to:
- Detect implicit directories by checking for child entries
- Return 404 (NoSuchKey) for implicit directories
- Preserve existing behavior for explicit directory markers and regular files
## Performance Considerations
### Optimization: Child Check Cache
- Child existence checks are performed via filer LIST operations
- Results could be cached for frequently accessed paths
- Trade-off between consistency and performance
### Impact
- Minimal performance impact for normal file operations
- Slight overhead for HEAD requests on implicit directories (one additional LIST call)
- Overall improvement in PyArrow compatibility outweighs minor performance cost
## TODO
- [ ] Add detailed benchmarking results comparing before/after fix
- [ ] Document edge cases discovered during implementation
- [ ] Add architectural diagrams showing the request flow
- [ ] Document alternative solutions considered and why they were rejected
- [ ] Add performance profiling data for child existence checks

View File

@@ -1,70 +0,0 @@
# MinIO Directory Handling Comparison
## Overview
This document compares how MinIO handles directory markers versus SeaweedFS's implementation, and explains the different approaches to S3 directory semantics.
## MinIO's Approach
MinIO handles implicit directories similarly to AWS S3:
1. **No explicit directory objects**: Directories are implicit, defined only by object key prefixes
2. **HEAD on directory returns 404**: Consistent with AWS S3 behavior
3. **LIST operations reveal directories**: Directories are discovered through delimiter-based LIST operations
4. **Automatic prefix handling**: MinIO automatically recognizes prefixes as directories
### MinIO Implementation Details
- Uses in-memory metadata for fast prefix lookups
- Optimized for LIST operations with common delimiter (`/`)
- No persistent directory objects in storage layer
- Directories "exist" as long as they contain objects
## SeaweedFS Approach
SeaweedFS uses a filer-based approach with real directory entries:
### Before the Fix
1. **Explicit directory objects**: Could create 0-byte objects as directory markers
2. **HEAD returns 200**: Even for implicit directories
3. **Caused s3fs issues**: s3fs interpreted 0-byte HEAD responses as empty files
### After the Fix
1. **Hybrid approach**: Supports both explicit markers (with `/` suffix) and implicit directories
2. **HEAD returns 404 for implicit directories**: Matches AWS S3 and MinIO behavior
3. **Filer integration**: Uses filer's directory metadata to detect implicit directories
4. **s3fs compatibility**: Triggers proper LIST fallback behavior
## Key Differences
| Aspect | MinIO | SeaweedFS (After Fix) |
|--------|-------|----------------------|
| Directory Storage | No persistent objects | Filer directory entries |
| Implicit Directory HEAD | 404 Not Found | 404 Not Found |
| Explicit Marker HEAD | Not applicable | 200 OK (with `/` suffix) |
| Child Detection | Prefix scan | Filer LIST operation |
| Performance | In-memory lookups | Filer gRPC calls |
## Implementation Considerations
### Advantages of SeaweedFS Approach
- Integrates with existing filer metadata
- Supports both implicit and explicit directories
- Preserves directory metadata and attributes
- Compatible with POSIX filer semantics
### Trade-offs
- Additional filer communication overhead for HEAD requests
- Complexity of supporting both directory paradigms
- Performance depends on filer efficiency
## TODO
- [ ] Add performance benchmark comparison: MinIO vs SeaweedFS
- [ ] Document edge cases where behaviors differ
- [ ] Add example request/response traces for both systems
- [ ] Document migration path for users moving from MinIO to SeaweedFS
- [ ] Add compatibility matrix for different S3 clients

View File

@@ -4,14 +4,9 @@
# Default values
SEAWEEDFS_BINARY ?= weed
S3_PORT ?= 8333
FILER_PORT ?= 8888
VOLUME_PORT ?= 8080
MASTER_PORT ?= 9333
TEST_TIMEOUT ?= 15m
ACCESS_KEY ?= some_access_key1
SECRET_KEY ?= some_secret_key1
VOLUME_MAX_SIZE_MB ?= 50
VOLUME_MAX_COUNT ?= 100
BUCKET_NAME ?= test-parquet-bucket
ENABLE_SSE_S3 ?= false
@@ -68,11 +63,7 @@ help:
@echo "Configuration:"
@echo " SEAWEEDFS_BINARY=$(SEAWEEDFS_BINARY)"
@echo " S3_PORT=$(S3_PORT)"
@echo " FILER_PORT=$(FILER_PORT)"
@echo " VOLUME_PORT=$(VOLUME_PORT)"
@echo " MASTER_PORT=$(MASTER_PORT)"
@echo " BUCKET_NAME=$(BUCKET_NAME)"
@echo " VOLUME_MAX_SIZE_MB=$(VOLUME_MAX_SIZE_MB)"
@echo " ENABLE_SSE_S3=$(ENABLE_SSE_S3)"
@echo " PYTHON=$(PYTHON)"
@@ -106,39 +97,25 @@ setup-python: check-python
start-seaweedfs-ci: check-binary
@echo "$(YELLOW)Starting SeaweedFS server for Parquet testing...$(NC)"
# Clean up any existing processes first (CI-safe)
@echo "Cleaning up any existing processes..."
# Clean up any existing processes first (CI-safe) - aggressive cleanup
@echo "Aggressively cleaning up any existing processes on S3 port $(S3_PORT) and master port 9333..."
@if command -v lsof >/dev/null 2>&1; then \
lsof -ti :$(MASTER_PORT) 2>/dev/null | head -5 | while read pid; do kill -TERM $$pid 2>/dev/null || true; done; \
lsof -ti :$(VOLUME_PORT) 2>/dev/null | head -5 | while read pid; do kill -TERM $$pid 2>/dev/null || true; done; \
lsof -ti :$(FILER_PORT) 2>/dev/null | head -5 | while read pid; do kill -TERM $$pid 2>/dev/null || true; done; \
lsof -ti :$(S3_PORT) 2>/dev/null | head -5 | while read pid; do kill -TERM $$pid 2>/dev/null || true; done; \
lsof -ti :$$(( $(MASTER_PORT) + 10000 )) 2>/dev/null | head -5 | while read pid; do kill -TERM $$pid 2>/dev/null || true; done; \
lsof -ti :$$(( $(VOLUME_PORT) + 10000 )) 2>/dev/null | head -5 | while read pid; do kill -TERM $$pid 2>/dev/null || true; done; \
lsof -ti :$$(( $(FILER_PORT) + 10000 )) 2>/dev/null | head -5 | while read pid; do kill -TERM $$pid 2>/dev/null || true; done; \
for attempt in 1 2 3; do \
lsof -ti :$(S3_PORT) 2>/dev/null | head -5 | while read pid; do kill -9 $$pid 2>/dev/null || true; done; \
lsof -ti :9333 2>/dev/null | head -5 | while read pid; do kill -9 $$pid 2>/dev/null || true; done; \
sleep 1; \
done; \
fi
@sleep 2
@sleep 3
# Create necessary directories
@mkdir -p /tmp/seaweedfs-test-parquet-master
@mkdir -p /tmp/seaweedfs-test-parquet-volume
@mkdir -p /tmp/seaweedfs-test-parquet-filer
@mkdir -p /tmp/seaweedfs-test-parquet
# Clean up any old server logs
@rm -f /tmp/seaweedfs-parquet-*.log || true
# Start master server with volume size limit and explicit gRPC port
@echo "Starting master server..."
@nohup $(SEAWEEDFS_BINARY) master -port=$(MASTER_PORT) -port.grpc=$$(( $(MASTER_PORT) + 10000 )) -mdir=/tmp/seaweedfs-test-parquet-master -volumeSizeLimitMB=$(VOLUME_MAX_SIZE_MB) -ip=127.0.0.1 -peers=none > /tmp/seaweedfs-parquet-master.log 2>&1 &
@sleep 3
# Start volume server with master HTTP port and increased capacity
@echo "Starting volume server..."
@nohup $(SEAWEEDFS_BINARY) volume -port=$(VOLUME_PORT) -master=127.0.0.1:$(MASTER_PORT) -dir=/tmp/seaweedfs-test-parquet-volume -max=$(VOLUME_MAX_COUNT) -ip=127.0.0.1 -preStopSeconds=1 > /tmp/seaweedfs-parquet-volume.log 2>&1 &
@sleep 5
# Start filer server with embedded S3
@echo "Starting filer server with embedded S3..."
# Start weed mini with embedded S3
@echo "Starting weed mini with embedded S3..."
@if [ "$(ENABLE_SSE_S3)" = "true" ]; then \
echo " SSE-S3 encryption: ENABLED"; \
printf '{"identities":[{"name":"%s","credentials":[{"accessKey":"%s","secretKey":"%s"}],"actions":["Admin","Read","Write"]}],"buckets":[{"name":"$(BUCKET_NAME)","encryption":{"sseS3":{"enabled":true}}}]}' "$(ACCESS_KEY)" "$(ACCESS_KEY)" "$(SECRET_KEY)" > /tmp/seaweedfs-parquet-s3.json; \
@@ -146,96 +123,43 @@ start-seaweedfs-ci: check-binary
echo " SSE-S3 encryption: DISABLED"; \
printf '{"identities":[{"name":"%s","credentials":[{"accessKey":"%s","secretKey":"%s"}],"actions":["Admin","Read","Write"]}]}' "$(ACCESS_KEY)" "$(ACCESS_KEY)" "$(SECRET_KEY)" > /tmp/seaweedfs-parquet-s3.json; \
fi
@AWS_ACCESS_KEY_ID=$(ACCESS_KEY) AWS_SECRET_ACCESS_KEY=$(SECRET_KEY) nohup $(SEAWEEDFS_BINARY) filer -port=$(FILER_PORT) -port.grpc=$$(( $(FILER_PORT) + 10000 )) -master=127.0.0.1:$(MASTER_PORT) -dataCenter=defaultDataCenter -ip=127.0.0.1 -s3 -s3.port=$(S3_PORT) -s3.config=/tmp/seaweedfs-parquet-s3.json > /tmp/seaweedfs-parquet-filer.log 2>&1 &
@sleep 5
# Wait for S3 service to be ready - use port-based checking for reliability
@echo "$(YELLOW)Waiting for S3 service to be ready...$(NC)"
@for i in $$(seq 1 20); do \
if netstat -an 2>/dev/null | grep -q ":$(S3_PORT).*LISTEN" || \
ss -an 2>/dev/null | grep -q ":$(S3_PORT).*LISTEN" || \
lsof -i :$(S3_PORT) >/dev/null 2>&1; then \
echo "$(GREEN)S3 service is listening on port $(S3_PORT)$(NC)"; \
sleep 1; \
break; \
@$(SEAWEEDFS_BINARY) mini \
-dir=/tmp/seaweedfs-test-parquet \
-ip.bind=0.0.0.0 \
-s3.port=$(S3_PORT) \
-s3.config=/tmp/seaweedfs-parquet-s3.json \
> /tmp/seaweedfs-parquet-mini.log 2>&1 & echo $$! > /tmp/weed-mini.pid
@echo "Waiting for S3 service to be fully ready (max 90 seconds)..."
@bash -c 'for i in $$(seq 1 90); do \
if curl -s -H "Authorization: AWS4-HMAC-SHA256 Credential=$(ACCESS_KEY)" http://localhost:$(S3_PORT)/ > /dev/null 2>&1; then \
echo "✅ S3 service is ready"; \
sleep 2; \
exit 0; \
fi; \
if [ $$i -eq 20 ]; then \
echo "$(RED)S3 service failed to start within 20 seconds$(NC)"; \
echo "=== Detailed Logs ==="; \
echo "Master log:"; tail -30 /tmp/seaweedfs-parquet-master.log || true; \
echo "Volume log:"; tail -30 /tmp/seaweedfs-parquet-volume.log || true; \
echo "Filer log:"; tail -30 /tmp/seaweedfs-parquet-filer.log || true; \
echo "=== Port Status ==="; \
netstat -an 2>/dev/null | grep ":$(S3_PORT)" || \
ss -an 2>/dev/null | grep ":$(S3_PORT)" || \
echo "No port listening on $(S3_PORT)"; \
exit 1; \
fi; \
echo "Waiting for S3 service... ($$i/20)"; \
sleep 1; \
done
# Additional wait for filer gRPC to be ready
@echo "$(YELLOW)Waiting for filer gRPC to be ready...$(NC)"
@sleep 2
# Wait for volume server to register with master and ensure volume assignment works
@echo "$(YELLOW)Waiting for volume assignment to be ready...$(NC)"
@for i in $$(seq 1 30); do \
ASSIGN_RESULT=$$(curl -s "http://localhost:$(MASTER_PORT)/dir/assign?count=1" 2>/dev/null); \
if echo "$$ASSIGN_RESULT" | grep -q '"fid"'; then \
echo "$(GREEN)Volume assignment is ready$(NC)"; \
break; \
fi; \
if [ $$i -eq 30 ]; then \
echo "$(RED)Volume assignment not ready after 30 seconds$(NC)"; \
echo "=== Last assign attempt ==="; \
echo "$$ASSIGN_RESULT"; \
echo "=== Master Status ==="; \
curl -s "http://localhost:$(MASTER_PORT)/dir/status" 2>/dev/null || echo "Failed to get master status"; \
echo "=== Master Logs ==="; \
tail -50 /tmp/seaweedfs-parquet-master.log 2>/dev/null || echo "No master log"; \
echo "=== Volume Logs ==="; \
tail -50 /tmp/seaweedfs-parquet-volume.log 2>/dev/null || echo "No volume log"; \
exit 1; \
fi; \
echo "Waiting for volume assignment... ($$i/30)"; \
sleep 1; \
done
@echo "$(GREEN)SeaweedFS server started successfully for Parquet testing$(NC)"
@echo "Master: http://localhost:$(MASTER_PORT)"
@echo "Volume: http://localhost:$(VOLUME_PORT)"
@echo "Filer: http://localhost:$(FILER_PORT)"
@echo "S3: http://localhost:$(S3_PORT)"
@echo "Volume Max Size: $(VOLUME_MAX_SIZE_MB)MB"
done; \
echo "ERROR S3 service failed to start within 90 seconds"; \
echo "=== Server log output ==="; \
cat /tmp/seaweedfs-parquet-mini.log 2>/dev/null || echo "No startup log available"; \
exit 1'
start-seaweedfs: check-binary
@echo "$(YELLOW)Starting SeaweedFS server for Parquet testing...$(NC)"
@# Use port-based cleanup for consistency and safety
@echo "Cleaning up any existing processes..."
@lsof -ti :$(MASTER_PORT) 2>/dev/null | xargs -r kill -TERM || true
@lsof -ti :$(VOLUME_PORT) 2>/dev/null | xargs -r kill -TERM || true
@lsof -ti :$(FILER_PORT) 2>/dev/null | xargs -r kill -TERM || true
@lsof -ti :$(S3_PORT) 2>/dev/null | xargs -r kill -TERM || true
@# Clean up gRPC ports (HTTP port + 10000)
@lsof -ti :$$(( $(MASTER_PORT) + 10000 )) 2>/dev/null | xargs -r kill -TERM || true
@lsof -ti :$$(( $(VOLUME_PORT) + 10000 )) 2>/dev/null | xargs -r kill -TERM || true
@lsof -ti :$$(( $(FILER_PORT) + 10000 )) 2>/dev/null | xargs -r kill -TERM || true
@sleep 2
@$(MAKE) start-seaweedfs-ci
stop-seaweedfs:
@echo "$(YELLOW)Stopping SeaweedFS server...$(NC)"
@# Use port-based cleanup for consistency and safety
@lsof -ti :$(MASTER_PORT) 2>/dev/null | xargs -r kill -TERM || true
@lsof -ti :$(VOLUME_PORT) 2>/dev/null | xargs -r kill -TERM || true
@lsof -ti :$(FILER_PORT) 2>/dev/null | xargs -r kill -TERM || true
@if [ -f /tmp/weed-mini.pid ]; then \
echo "Stopping weed mini..."; \
kill $$(cat /tmp/weed-mini.pid) || true; \
rm -f /tmp/weed-mini.pid; \
fi
@lsof -ti :$(S3_PORT) 2>/dev/null | xargs -r kill -TERM || true
@# Clean up gRPC ports (HTTP port + 10000)
@lsof -ti :$$(( $(MASTER_PORT) + 10000 )) 2>/dev/null | xargs -r kill -TERM || true
@lsof -ti :$$(( $(VOLUME_PORT) + 10000 )) 2>/dev/null | xargs -r kill -TERM || true
@lsof -ti :$$(( $(FILER_PORT) + 10000 )) 2>/dev/null | xargs -r kill -TERM || true
@sleep 2
@echo "$(GREEN)SeaweedFS server stopped$(NC)"
@@ -245,22 +169,10 @@ stop-seaweedfs-safe:
@# Use port-based cleanup which is safer in CI
@if command -v lsof >/dev/null 2>&1; then \
echo "Using lsof for port-based cleanup..."; \
lsof -ti :$(MASTER_PORT) 2>/dev/null | head -5 | while read pid; do kill -TERM $$pid 2>/dev/null || true; done; \
lsof -ti :$(VOLUME_PORT) 2>/dev/null | head -5 | while read pid; do kill -TERM $$pid 2>/dev/null || true; done; \
lsof -ti :$(FILER_PORT) 2>/dev/null | head -5 | while read pid; do kill -TERM $$pid 2>/dev/null || true; done; \
lsof -ti :$(S3_PORT) 2>/dev/null | head -5 | while read pid; do kill -TERM $$pid 2>/dev/null || true; done; \
lsof -ti :$$(( $(MASTER_PORT) + 10000 )) 2>/dev/null | head -5 | while read pid; do kill -TERM $$pid 2>/dev/null || true; done; \
lsof -ti :$$(( $(VOLUME_PORT) + 10000 )) 2>/dev/null | head -5 | while read pid; do kill -TERM $$pid 2>/dev/null || true; done; \
lsof -ti :$$(( $(FILER_PORT) + 10000 )) 2>/dev/null | head -5 | while read pid; do kill -TERM $$pid 2>/dev/null || true; done; \
else \
echo "lsof not available, using netstat approach..."; \
netstat -tlnp 2>/dev/null | grep :$(MASTER_PORT) | awk '{print $$7}' | cut -d/ -f1 | head -5 | while read pid; do [ "$$pid" != "-" ] && kill -TERM $$pid 2>/dev/null || true; done; \
netstat -tlnp 2>/dev/null | grep :$(VOLUME_PORT) | awk '{print $$7}' | cut -d/ -f1 | head -5 | while read pid; do [ "$$pid" != "-" ] && kill -TERM $$pid 2>/dev/null || true; done; \
netstat -tlnp 2>/dev/null | grep :$(FILER_PORT) | awk '{print $$7}' | cut -d/ -f1 | head -5 | while read pid; do [ "$$pid" != "-" ] && kill -TERM $$pid 2>/dev/null || true; done; \
netstat -tlnp 2>/dev/null | grep :$(S3_PORT) | awk '{print $$7}' | cut -d/ -f1 | head -5 | while read pid; do [ "$$pid" != "-" ] && kill -TERM $$pid 2>/dev/null || true; done; \
netstat -tlnp 2>/dev/null | grep :$$(( $(MASTER_PORT) + 10000 )) | awk '{print $$7}' | cut -d/ -f1 | head -5 | while read pid; do [ "$$pid" != "-" ] && kill -TERM $$pid 2>/dev/null || true; done; \
netstat -tlnp 2>/dev/null | grep :$$(( $(VOLUME_PORT) + 10000 )) | awk '{print $$7}' | cut -d/ -f1 | head -5 | while read pid; do [ "$$pid" != "-" ] && kill -TERM $$pid 2>/dev/null || true; done; \
netstat -tlnp 2>/dev/null | grep :$$(( $(FILER_PORT) + 10000 )) | awk '{print $$7}' | cut -d/ -f1 | head -5 | while read pid; do [ "$$pid" != "-" ] && kill -TERM $$pid 2>/dev/null || true; done; \
fi
@sleep 2
@echo "$(GREEN)SeaweedFS server safely stopped$(NC)"
@@ -351,18 +263,14 @@ test-implicit-dir-with-server: build-weed setup-python
# Debug targets
debug-logs:
@echo "$(YELLOW)=== Master Log ===$(NC)"
@tail -n 50 /tmp/seaweedfs-parquet-master.log || echo "No master log found"
@echo "$(YELLOW)=== Volume Log ===$(NC)"
@tail -n 50 /tmp/seaweedfs-parquet-volume.log || echo "No volume log found"
@echo "$(YELLOW)=== Filer Log ===$(NC)"
@tail -n 50 /tmp/seaweedfs-parquet-filer.log || echo "No filer log found"
@echo "$(YELLOW)=== Mini Log ===$(NC)"
@tail -n 50 /tmp/seaweedfs-parquet-mini.log || echo "No mini log found"
debug-status:
@echo "$(YELLOW)=== Process Status ===$(NC)"
@ps aux | grep -E "(weed|seaweedfs)" | grep -v grep || echo "No SeaweedFS processes found"
@echo "$(YELLOW)=== Port Status ===$(NC)"
@netstat -an | grep -E "($(MASTER_PORT)|$(VOLUME_PORT)|$(FILER_PORT)|$(S3_PORT))" || echo "No ports in use"
@netstat -an | grep -E "($(S3_PORT))" || echo "No ports in use"
# Manual test targets for development
manual-start: start-seaweedfs

View File

@@ -1,46 +0,0 @@
# Test Coverage Documentation
## Overview
This document provides comprehensive test coverage documentation for the SeaweedFS S3 Parquet integration tests.
## Test Categories
### Unit Tests (Go)
- 17 test cases covering S3 API handlers
- Tests for implicit directory handling
- HEAD request behavior validation
- Located in: `weed/s3api/s3api_implicit_directory_test.go`
### Integration Tests (Python)
- 6 test cases for implicit directory fix
- Tests HEAD request behavior on directory markers
- s3fs directory detection validation
- PyArrow dataset read compatibility
- Located in: `test_implicit_directory_fix.py`
### End-to-End Tests (Python)
- 20 test cases combining write and read methods
- Small file tests (5 rows): 10 test combinations
- Large file tests (200,000 rows): 10 test combinations
- Tests multiple write methods: `pads.write_dataset`, `pq.write_table+s3fs`
- Tests multiple read methods: `pads.dataset`, `pq.ParquetDataset`, `pq.read_table`, `s3fs+direct`, `s3fs+buffered`
- Located in: `s3_parquet_test.py`
## Coverage Summary
| Test Type | Count | Status |
|-----------|-------|--------|
| Unit Tests (Go) | 17 | ✅ Pass |
| Integration Tests (Python) | 6 | ✅ Pass |
| End-to-End Tests (Python) | 20 | ✅ Pass |
| **Total** | **43** | **✅ All Pass** |
## TODO
- [ ] Add detailed test execution time metrics
- [ ] Document test data generation strategies
- [ ] Add code coverage percentages for Go tests
- [ ] Document edge cases and corner cases tested
- [ ] Add performance benchmarking results

View File

@@ -60,6 +60,7 @@ def setup_s3():
endpoint_url=S3_ENDPOINT_URL,
aws_access_key_id=S3_ACCESS_KEY,
aws_secret_access_key=S3_SECRET_KEY,
region_name='us-east-1',
use_ssl=False
)