diff --git a/fusion_docs/sidebar.json b/fusion_docs/sidebar.json index 037b2813d..ecef61b81 100644 --- a/fusion_docs/sidebar.json +++ b/fusion_docs/sidebar.json @@ -39,7 +39,16 @@ }, "licensing", "reference", - "troubleshooting", + { + "type": "category", + "label": "Troubleshooting", + "collapsed": true, + "items": [ + "troubleshooting/general", + "troubleshooting/fusion-snapshots", + "troubleshooting/error-codes-exit-messages" + ] + }, "faq", { "type": "link", diff --git a/fusion_docs/troubleshooting/error-codes-exit-messages.md b/fusion_docs/troubleshooting/error-codes-exit-messages.md new file mode 100644 index 000000000..8c86db2e7 --- /dev/null +++ b/fusion_docs/troubleshooting/error-codes-exit-messages.md @@ -0,0 +1,420 @@ +--- +title: Error codes and exit messages +description: "Reference for Fusion error codes, exit codes, and error messages" +date created: "2025-01-12" +last updated: "2025-01-20" +tags: [errors, error-codes, exit-codes, fuse, logging, fusion] +--- + +This page describes Fusion's error reporting system, including exit codes, FUSE status codes (errno values), cloud provider error categories, and internal error types. + +## Error paths + +Fusion is a FUSE filesystem that bridges applications and cloud object stores. As such, errors may originate from multiple layers, but will propagate through the filesystem components following three major paths: + +1. **Cloud > Storage Backend > FUSE Layer > Kernel > Application** + + - Errors from the cloud provider (e.g. network timeouts, auth failures, rate limits) are captured by the Storage backend, which normalizes them into provider-agnostic categories (see #cloud-provider-error-categories). + - Storage backends return normalized cloud errors (with provider-agnostic categories) or internal errors (`ErrNotFound`, `ErrReadOnly`, etc.) + - The FUSE layer maps both cloud errors and internal errors to FUSE status codes (e.g., `ENOENT`, `EACCES`, `EREMOTEIO`, `EIO`) + - The kernel translates FUSE status to errno values for the application + - Fusion logs cloud errors with structured details (provider, error code, HTTP status, request ID) + +1. **Failures during startup/shutdown → Exit Code** + + - Startup: Configuration errors, missing credentials, or mount failures terminate Fusion immediately + - Shutdown: Async uploads or consolidation of pending operations + - Failures surface as exit code `174` (Fusion I/O error) or `1` (fatal error) + +1. **Background Operations → Logs** + + - Async uploads during normal operation, cache eviction, and snapshot operations log errors but may not surface them to applications + - Errors are reported in Fusion (see [Understanding Fusion logs](#understanding-fusion-logs)) + +## Triaging errors + +When troubleshooting Fusion errors: + +1. Check the exit code: + - Check the process exit code (`$?`) to understand if Fusion terminated normally (`0`), encountered an I/O error (`174`), or had a command issue (`127`). +1. Look at FUSE status in the logs: + - If a filesystem operation failed, use the logs to identify the FUSE status code (e.g., `ENOENT`, `EREMOTEIO`, `EIO`) returned to the application. +1. Check for cloud error fields: + - If you see `EREMOTEIO` or cloud-related failures, identify the specific cloud error fields in the logs: + - `provider` + - `provider_code` + - `provider_http_status` + - `provider_request_id` + + :::note + The field `error_code` indicates Fusion's internal categorization of the cloud error normalized across providers (e.g., `ResourceNotFound`, `Forbidden`, `RateLimited`). + ::: + +1. Identify the mapped internal error: + - The FUSE status code maps back to either a cloud error category or a specific internal error (e.g., `EACCES` indicates an authentication problem, `EREMOTEIO` indicates a cloud backend issue). Check the Fusion logs for more details on the error that triggered the FUSE status code (see [Understanding Fusion logs](#understanding-fusion-logs)). + +:::tip +Enable `debug` logging for the full log: + +```bash +export FUSION_LOG_LEVEL=debug +``` + +::: + +## Exit codes + +Fusion binaries return specific exit codes to indicate the outcome of execution. + +:::tip +For exit codes `175` and `176`, see [Fusion Snapshots](./fusion-snapshots.md). +::: + +### Fusion binary + +| Exit code | Constant | Description | +|-----------|----------|-------------| +| `0` | - | Success, normal completion. | +| `1` | - | Fatal error during startup (via `log.Fatal()`). | +| `127` | - | Command not found (`.command.sh` missing). Triggers automatic retry up to `FUSION_MAX_MOUNT_RETRIES` times. | +| `174` | `ErrorExitCode` | Fusion I/O error, application-level input/output error. | + +:::note +`log.Fatal()` calls during startup produce exit code `1`. See [Fatal error messages](#fatal-error-messages) for the specific messages that trigger this exit. +::: + +The `sysexits.h` standard uses exit code 74 for "input/output error" and reserves 150-199 for application use. In Fusion's context, 174 means "application input/output error". + +| Scenario | Log cue | Suggested next step | +|----------|----------------|---------------------| +| Failed to start FUSE process in background | `on FUSE process` | Check FUSE/kernel support. Verify `/dev/fuse` exists. | +| Failed to send SIGTERM to FUSE process | `on FUSE sigterm send` | Check kernel logs (`dmesg`) for crashed processes. | +| Failed to wait for FUSE process termination | `on FUSE stop wait` | Check for zombie processes. Review kernel signal handling. | +| Error during filesystem shutdown | `on file system shutdown` | Check Fusion logs for pending upload errors. See [Understanding Fusion logs](#understanding-fusion-logs). | +| Error during filesystem unmount | `on file system unmount` | Run `fusermount -u /fusion` or `umount -l /fusion` manually. | +| Failed read/write path validation | `check-rw` or `check-ro` | Verify cloud credentials and bucket permissions. | + +### GPU tracer binary + +| Exit code | Meaning | When | +|-----------|---------|------| +| `0` | Success | Normal completion (GPU detected or not) | +| `1` | Error | Failed to start GPU monitoring | +| `2` | Invalid input | Missing PID, invalid PID format, or PID `<= 0` | + +## FUSE status codes + +Fusion maps internal errors to standard FUSE status codes returned to the operating system. These are the [errno](https://man7.org/linux/man-pages/man3/errno.3.html) values applications receive when filesystem operations fail. + +:::note +For a complete list of errno values and their meanings, see the [Linux errno man page](https://man7.org/linux/man-pages/man3/errno.3.html) or run `errno -l` on a Linux system. +::: + +### Returned status codes + +Fusion's filesystem implementation actively returns these status codes: + +| FUSE status | Errno | Description | Common causes in Fusion | +|------------------|-------|---------------------------|-------------------------| +| `fuse.OK` | 0 | Success | Operation completed successfully | +| `fuse.ENOENT` | 2 | No such file or directory | File/entry not found in cache or remote store; cloud provider ResourceNotFound/ContainerNotFound errors | +| `fuse.EINTR` | 4 | Interrupted system call | Context cancelled | +| `fuse.EIO` | 5 | I/O error | General I/O errors, internal failures, remote store errors, unknown non-cloud errors | +| `fuse.EACCES` | 13 | Permission denied | Write attempt to read-only path; cloud provider Unauthenticated/InvalidCredentials/Forbidden/AccountError errors | +| `fuse.EBUSY` | 16 | Device or resource busy | Cloud provider RateLimited/Busy/ResourceArchived errors | +| `fuse.EEXIST` | 17 | File exists | Cloud provider Conflict errors (e.g., resource already exists) | +| `fuse.EINVAL` | 22 | Invalid argument | Invalid parameters (e.g., readlink on non-symlink) | +| `fuse.EROFS` | 30 | Read-only file system | Attempt to modify read-only object | +| `fuse.ERANGE` | 34 | Result too large | Buffer too small for xattr value | +| `fuse.ENOSYS` | 38 | Function not implemented | Operation not wired in Fusion's FUSE layer | +| `fuse.ENOATTR` | 93 | No such attribute | Extended attribute not found | +| `fuse.ENOTSUP` | 95 | Operation not supported | Operation explicitly rejected (for example, hard links) | +| `fuse.ETIMEDOUT` | 110 | Connection timed out | Context deadline exceeded | +| `fuse.EREMOTEIO` | 121 | Remote I/O error | Cloud provider errors (QuotaExceeded, unknown cloud errors) | + +### Troubleshooting FUSE status codes + +When you encounter a FUSE status code, use the following table to identify likely causes and next steps: + +| Status | Likely causes and troubleshooting steps | +|--------|----------------------------------------| +| `ENOENT` | Path typo or object deleted from remote store. Check if the path exists using your cloud provider's CLI (`aws s3 ls`, `gsutil ls`, `az storage blob list`). | +| `EACCES` | Mount configured as read-only, object ACL blocking writes, or authentication/permission issues. Check cloud IAM permissions and credentials. | +| `EEXIST` | Resource already exists in cloud storage. Check if the operation was retried or if there's a naming conflict. | +| `EIO` | General I/O error or unknown internal failure. Check Fusion logs for the underlying cause. See [Understanding Fusion logs](#understanding-fusion-logs). | +| `EREMOTEIO` | Cloud provider error. Check Fusion logs for detailed cloud error information (provider, error code, HTTP status, request ID). May indicate quota exceeded, rate limiting, or other cloud-specific issues. See [Understanding Fusion logs](#understanding-fusion-logs). | +| `EBUSY` | Cloud provider rate limiting requests or temporarily busy. Retry with backoff. Check cloud provider dashboard for service status. | +| `ETIMEDOUT` | Operation timed out due to network connectivity issues or slow cloud response. Check network connection and cloud service status. | +| `EINTR` | Caller cancelled the operation. Usually safe to retry. | +| `ENOTSUP` | Unsupported operation. Adjust workload to avoid hard links. Use symlinks or copies instead. | +| `ENOSYS` | Operation not implemented in Fusion. Check if the operation is supported or use an alternative approach. | + +### ENOSYS vs ENOTSUP + +Both indicate an operation cannot be performed, but they have different meanings: + +- **`ENOSYS` (Function not implemented)**: The operation is not implemented in Fusion's FUSE layer. This is the default response for operations Fusion doesn't handle. + +- **`ENOTSUP` (Operation not supported)**: The operation exists in Fusion but is explicitly rejected for specific cases. For example: + - **Hard links (`Link`)**: Fusion explicitly returns `ENOTSUP` because hard links cannot be meaningfully represented on object storage backends. + - **Whiteout character device creation**: During overlay-style renames, if creating the whiteout marker fails, `ENOTSUP` signals this specific failure. + +:::tip +If you encounter `ENOTSUP` on hard links, use symbolic links (`ln -s`) or file copies instead. +::: + +### EREMOTEIO vs EIO + +Fusion distinguishes between local I/O failures and cloud provider errors: + +- **`EREMOTEIO` (Remote I/O error)**: Used specifically for cloud provider errors. This errno value indicates that: + - The error originated from a remote cloud storage system (S3, Azure Blob Storage, or Google Cloud Storage). + - The failure is due to cloud provider issues (quota exceeded, rate limiting, service unavailable). + - Debugging should focus on cloud provider logs and status, not local system issues. + - The request ID from logs can be provided to cloud support for investigation. + +- **`EIO` (I/O error)**: Used as a generic catch-all for: + - Unknown internal errors that are not cloud-related. + - Local filesystem or system failures. + - Errors that cannot be classified into more specific categories. + +:::note +Using `EREMOTEIO` for cloud errors provides more accurate error context, making it easier to distinguish between local system issues and cloud service problems during troubleshooting and monitoring. +::: + +:::tip +When you see `EREMOTEIO`, check the Fusion logs for cloud error fields: `provider`, `error_code`, `provider_code`, `provider_http_status`, and `provider_request_id`. See [Understanding Fusion logs](#understanding-fusion-logs). +::: + +### Error mapping + +Fusion maps cloud provider errors and internal errors to FUSE status codes. + +#### Cloud provider error mapping + +Cloud provider errors are normalized and mapped to appropriate FUSE status codes: + +| Cloud error category | FUSE status | Examples | +|---------------------|-------------|----------| +| `Unauthenticated` | `fuse.EACCES` | No credentials provided | +| `InvalidCredentials` | `fuse.EACCES` | Wrong, malformed, or expired credentials | +| `Forbidden` | `fuse.EACCES` | Valid credentials, insufficient permissions | +| `AccountError` | `fuse.EACCES` | Account disabled, suspended, or has billing issues | +| `ResourceNotFound` | `fuse.ENOENT` | S3 "NoSuchKey", Azure "BlobNotFound", GCS 404 | +| `ContainerNotFound` | `fuse.ENOENT` | S3 "NoSuchBucket", Azure "ContainerNotFound", GCS 404 with bucket error | +| `RateLimited` | `fuse.EBUSY` | Request rate limits exceeded | +| `Busy` | `fuse.EBUSY` | Service temporarily unavailable or overloaded | +| `ResourceArchived` | `fuse.EBUSY` | Resource in archived/transitional state (for example, Glacier) | +| `Conflict` | `fuse.EEXIST` | Resource already exists or precondition failed | +| `InvalidArgument` | `fuse.EINVAL` | Malformed request or invalid parameters | +| `QuotaExceeded` | `fuse.EREMOTEIO` | Storage quota or capacity limit reached | +| `Unknown` (cloud errors) | `fuse.EREMOTEIO` | Unclassified cloud provider errors | + +#### Internal error mapping + +| Internal error | FUSE status | +|---------------|-------------| +| Not found | `fuse.ENOENT` | +| Read-only | `fuse.EROFS` | +| Unsupported | `fuse.ENOSYS` | +| Context cancelled | `fuse.EINTR` | +| Context deadline exceeded | `fuse.ETIMEDOUT` | +| Other errors | `fuse.EIO` | + +## Cloud provider error categories + +Fusion normalizes errors from different cloud storage providers (S3, Azure Blob Storage, Google Cloud Storage) into consistent categories. When you see an `error_code` field in Fusion logs, it represents one of these categories: + +| Category | Description | Common provider codes | +|----------|-------------|----------------------| +| `ResourceNotFound` | Requested resource (object/file) does not exist | S3: "NoSuchKey", Azure: "BlobNotFound", GCS: HTTP 404 | +| `ContainerNotFound` | Storage container (bucket) does not exist | S3: "NoSuchBucket", Azure: "ContainerNotFound", GCS: HTTP 404 with bucket error | +| `Unauthenticated` | No credentials provided | S3: "MissingSecurityHeader", GCS: HTTP 401 with no credentials | +| `InvalidCredentials` | Credentials provided but wrong, malformed, or expired | S3: "InvalidAccessKeyId", "ExpiredToken", Azure: "InvalidAuthenticationInfo", GCS: HTTP 401 with invalid credentials | +| `Forbidden` | Valid credentials but insufficient permissions | S3: "AccessDenied", Azure: "AuthorizationPermissionMismatch", GCS: HTTP 403 | +| `AccountError` | Account-level problems (disabled, suspended, billing issues) | S3: "AccountProblem", Azure: "AccountIsDisabled", GCS: HTTP 403 with specific messages | +| `ResourceArchived` | Resource exists but is in archived/transitional state | S3: "InvalidObjectState" (Glacier), Azure: "BlobArchived" | +| `RateLimited` | Request rate limits exceeded | S3: "SlowDown", Azure: "TooManyRequests", GCS: HTTP 429 | +| `Busy` | Service temporarily unavailable or overloaded | S3: "ServiceUnavailable", "InternalError", Azure: "ServerBusy", GCS: HTTP 503 | +| `Conflict` | Resource state conflict or precondition failure | S3: "BucketAlreadyExists", Azure: "BlobAlreadyExists", "ConditionNotMet", GCS: HTTP 409/412 | +| `InvalidArgument` | Malformed request or invalid parameters | S3: "InvalidArgument", "InvalidRange", Azure: "InvalidQueryParameterValue", GCS: HTTP 400 | +| `QuotaExceeded` | Storage quota or capacity limit reached | S3: "TooManyBuckets", Azure: "AccountLimitExceeded", GCS: HTTP 429 with quota message | +| `Unknown` | Unclassified or unexpected error | Various | + +## Fatal error messages + +These messages indicate Fusion terminated immediately with exit code 1. They occur during startup or critical failures: + +| Message | Cause | +|---------|-------| +| `configuring fusion` | Failed to configure Fusion (invalid config, missing environment variables) | +| `building remote store options` | Failed to build remote store options | +| `creating metadata store` | Failed to create metadata store | +| `creating data store` | Failed to create data store connection | +| `validating work path` | Work path validation failed (empty prefix or connection error) | +| `creating filesystem` | Failed to create FUSE filesystem | +| `mounting filesystem` | Failed to mount FUSE filesystem | +| `could not get current job attempt` | Failed to get job attempt from compute environment | + +## Understanding Fusion logs + +Fusion emits logs in two formats: + +- **Console logs** (stderr): Timestamped, human-readable format with `[seqera-fusion]` prefix. These logs are collected by Seqera Platform and shown in the UI. They provide immediate visibility during runtime. +- **File logs** (`${workdir}/.fusion.log`): Structured logs in JSON format for detailed analysis and troubleshooting. + +:::note +The `[seqera-fusion]` prefix for console logs was introduced in Fusion v2.6, v2.5.9, and v2.4.20. +::: + +### Log fields reference + +Fusion uses structured logging with consistent field names. Understanding these fields is essential for troubleshooting. + +#### Cloud error fields + +These fields appear automatically when a cloud provider error is detected: + +| Field | Description | Example values | +|-------|-------------|----------------| +| `provider` | Cloud provider name | `s3`, `azure`, `gcs` | +| `error_code` | Normalized error category (provider-agnostic) | `Forbidden`, `ResourceNotFound`, `InvalidCredentials` | +| `provider_code` | Provider-specific error code | S3: `NoSuchKey`, Azure: `BlobNotFound`, GCS: `invalid` | +| `provider_http_status` | HTTP status code from cloud provider | `403`, `404`, `429`, `500` | +| `provider_request_id` | Request ID for cloud provider support tickets | `ABCD1234EXAMPLE`, `b8e8a1f5-...` | +| `provider_error` | Original error message from cloud provider | `The specified key does not exist.` | + +:::tip +When opening support tickets with cloud providers, always include the `provider_request_id` from logs. This enables their support team to trace the exact request in their systems. +::: + +#### Common operational fields + +These fields appear in most log entries to provide operation context: + +| Field | Description | Example values | +|-------|-------------|----------------| +| `path` | Filesystem path where operation occurred | `/fusion/s3/bucket/file.txt`, `/.Trash` | +| `operation` | FUSE operation or internal operation name | `Read`, `Write`, `Lookup`, `listDirectory` | +| `error` | Main error message (non-cloud portion) | `not found`, `permission denied` | +| `message` | Human-readable log message describing what happened | `find entry error`, `configuration` | +| `level` | Log severity level | `debug`, `info`, `warn`, `error`, `fatal` | + +### Log examples + +#### Fatal error (causes termination) + +This example indicates Fusion could not authenticate with the cloud provider and terminated immediately: + +**Console logs:** +``` +11:23AM FTL [seqera-fusion] creating data store error="NoCredentialProviders: no valid providers in chain" +``` + +**File logs (`.fusion.log`):** +```json +{ + "level": "fatal", + "error": "NoCredentialProviders: no valid providers in chain", + "time": 1765738531263809778, + "message": "creating data store" +} +``` + +#### Recoverable error (operation failed, Fusion continues) + +This example indicates a file lookup failed and returned `ENOENT` to the application. Fusion continues running: + +**Console logs:** +``` +11:24AM ERR [seqera-fusion] find entry error error="element not found" path=/fusion/s3/bucket/missing-file.txt +``` + +**File logs (`.fusion.log`):** +```json +{ + "level": "error", + "error": "element not found", + "path": "/.Trash", + "time": 1765738531284473208, + "message": "listDirectory" +} +``` + +#### Cloud provider error + +This example shows a cloud provider error with structured fields: + +**File logs (`.fusion.log`):** +```json +{ + "level": "error", + "error": "not found", + "provider": "s3", + "error_code": "ResourceNotFound", + "provider_code": "NoSuchKey", + "provider_http_status": 404, + "provider_request_id": "ABCD1234EXAMPLE", + "provider_error": "The specified key does not exist.", + "message": "The requested resource was not found in cloud storage. Verify the file path is correct and the resource exists." +} +``` + +### Searching logs + +**Console logs** (grep-based searching): +```bash +# Find all cloud provider errors +grep 'provider=' .fusion.log + +# Find specific error categories +grep 'error_code=' .fusion.log | grep 'Forbidden' + +# Find operations on specific paths +grep 'path=/fusion/s3/bucket/file.txt' .fusion.log +``` + +**JSON logs** (jq-based searching): +```bash +# Find all cloud errors +jq 'select(.provider != null)' .fusion.log + +# Find S3 errors only +jq 'select(.provider == "s3")' .fusion.log + +# Find all Forbidden errors across providers +jq 'select(.error_code == "Forbidden")' .fusion.log + +# Find errors with request IDs (for cloud support tickets) +jq 'select(.provider_request_id != null) | {provider, provider_request_id, provider_code, message}' .fusion.log +``` + +## Nextflow integration + +When running Nextflow with Fusion: + +- Exit code `0`: Task completed successfully +- Exit code `127`: Retry logic activates (`.command.sh` not found) +- Exit code `174`: Fusion I/O error—check logs for details + +### Check exit codes + +```bash +fusion --foreground +exit_code=$? + +case $exit_code in + 0) + echo "Success" + ;; + 127) + echo "Command not found - may retry" + ;; + 174) + echo "Fusion I/O error - check Fusion logs" + ;; + *) + echo "Unknown exit code: $exit_code" + ;; +esac +``` diff --git a/fusion_docs/troubleshooting/fusion-snapshots.md b/fusion_docs/troubleshooting/fusion-snapshots.md new file mode 100644 index 000000000..94fc88f2f --- /dev/null +++ b/fusion_docs/troubleshooting/fusion-snapshots.md @@ -0,0 +1,302 @@ +--- +title: Fusion Snapshots +description: "Troubleshooting for Fusion Snapshots" +date created: "2025-11-29" +last updated: "2025-01-12" +tags: [troubleshooting, fusion, fusion-snapshots, configuration] +--- + +When working with Fusion Snapshots, you might encounter the following issues. + +## Exit code `175`: Checkpoint dump failed + +Task fails with exit code `175`, indicating the checkpoint dump operation did not complete successfully. + +This issue can occur due to: + +1. Checkpoint timeout - The process could not be saved within the reclamation window (typically due to high memory usage). The reclamation windows are: + - AWS Batch: 120 seconds (guaranteed) + - Google Batch: Up to 30 seconds (not guaranteed) + - Other factors: Large number of open file descriptors, complex process trees +1. Insufficient network bandwidth - Cannot upload checkpoint data fast enough. +1. Disk space issues - Not enough local storage for checkpoint files. + +To resolve this issue: + +1. Reduce memory usage: + + - Lower memory requested by tasks + - Process smaller data chunks + - Set `process.resourceLimits` to enforce limits: + + ```groovy + // AWS Batch example + process.resourceLimits = [cpus: 32, memory: '60.GB'] + + // Google Batch example (more conservative for 30s window) + process.resourceLimits = [cpus: 16, memory: '20.GB'] + ``` + +1. Increase network bandwidth: + + - Use instance types with higher guaranteed network bandwidth. + - Ensure memory:bandwidth ratio is appropriate (5:1 or better for AWS). + +1. Enable incremental snapshots (automatic on `x86_64`): + + - Verify you're using `x86_64` architecture: `uname -m` + - Avoid ARM64 instances if checkpoints are failing. + +1. Configure retry strategy: + + ```groovy + process { + maxRetries = 2 + errorStrategy = { + if (task.exitStatus == 175) { + return 'retry' + } else { + return 'terminate' + } + } + } + ``` + +See [AWS Batch instance selection](./guide/snapshots/aws.md#selecting-an-ec2-instance) or [Google Batch best practices](./guide/snapshots/gcp.md) for recommended configurations. + +:::tip +For a comprehensive explanation of exit code `175`, see [Exit Codes](./error-reference.md#exit-codes). +::: + +## Exit code `176`: Checkpoint restore failed + +Task fails with exit code `176` when attempting to restore from a checkpoint. + +This issue can occur due to: + +1. Corrupted checkpoint - Previous checkpoint did not complete properly. +1. Missing checkpoint files - Checkpoint data missing or inaccessible in object storage. +1. State conflict - Attempting to restore while dump still in progress. +1. Environment mismatch - Different environment between checkpoint and restore. + +To resolve this issue: + +1. Check if previous checkpoint completed: + - Review logs for "Dumping finished successfully". + - If the "Dumping finished successfully" message is missing, it means the previous checkpoint timed out with a `175` exit error. + +1. Verify checkpoint data exists: + - Check that the `.fusion/dump/` work directory contains checkpoint files. + - Ensure that the S3/GCS bucket is accessible. + - If the bucket is missing, open a support ticket. See [Getting help](#getting-help) for more information. + +1. Configure retry for dump failures first: + - Handle exit code `175` with retry. See [Retry handling](./guide/snapshots/configuration.md#retry-handling) for more information. + +:::tip +For a comprehensive explanation of exit code `176`, see [Exit Codes](./error-reference.md#exit-codes). +::: + +## Long checkpoint times + +Checkpoints take longer than expected, approaching timeout limits. + +This issue can occur due to: + +1. High memory usage - Memory is typically the primary factor affecting checkpoint time. +1. ARM64 architecture - Only full dumps available (no incremental snapshots). +1. Insufficient network bandwidth - Instance bandwidth too low for memory size. +1. Open file descriptors - Large number of open files or complex process trees. + +To resolve this issue: + +1. For AWS Batch (120-second window): + - Use instances with 5:1 or better memory:bandwidth ratio. + - Use `x86_64` instances for incremental snapshot support (`c6id`, `m6id`, `r6id` families). + - Check architecture: `uname -m` + +1. For Google Batch (30-second window): + - Use `x86_64` instances (mandatory for larger workloads). + - Use more conservative memory limits. + - Consider smaller instance types with better ratios. + +1. Review instance specifications: + - Verify guaranteed network bandwidth (not "up to" values). + - Prefer NVMe storage instances on AWS (instances with `d` suffix). + +See [Selecting an EC2 instance](./guide/snapshots/aws.md#selecting-an-ec2-instance) for detailed recommendations. + +## Frequent checkpoint failures + +Checkpoints consistently fail across multiple tasks. + +This issue can occur due to: + +1. Task too large for reclamation window - Memory usage exceeds what can be checkpointed in time (more common on Google Batch with 30-second window). +1. Network congestion or throttling - Bandwidth lower than instance specifications. +1. ARM64 architecture limitations - Full dumps only, requiring much more time and bandwidth. + +To resolve this issue: + +1. Split large tasks: + - Break into smaller, checkpointable units. + - Process data in chunks. + +1. Switch to `x86_64` instances: + - Essential for Google Batch. + - Recommended for AWS Batch tasks > 40 GiB. + +1. Adjust memory limits: + + ```groovy + // For AWS Batch + process.resourceLimits = [cpus: 32, memory: '60.GB'] + + // For Google Batch (more conservative) + process.resourceLimits = [cpus: 16, memory: '20.GB'] + ``` + +## SSL/TLS connection errors after restore + +Applications fail after restore with connection errors, especially HTTPS connections. + +This issue occurs when applications use HTTPS connections, as CRIU cannot preserve encrypted TCP connections (SSL/TLS). + +To resolve this issue, configure TCP close mode to drop connections during checkpoint: + +```groovy +process.containerOptions = '-e FUSION_SNAPSHOTS_TCP_MODE=close' +``` + +Applications will need to re-establish connections after restore. See [TCP connection handling](./guide/snapshots/configuration.md#tcp-connection-handling) for more information. + +## Debugging workflow + +To diagnose checkpoint problems: + +1. Check the exit code to identify the failure type: + + - Exit code `175`: Checkpoint dump failed - The snapshot could not be saved. + - Exit code `176`: Checkpoint restore failed - The snapshot could not be restored. + - Other exit codes: Likely an application error, not snapshot-related. + +1. Review task logs: + + - Check `.command.log` in the task work directory for Fusion Snapshots messages (prefixed with timestamps). + + :::tip + Enable `debug` logging for more details. + + ```groovy + process.containerOptions = '-e FUSION_SNAPSHOT_LOG_LEVEL=debug' + ``` + ::: + +1. Inspect your checkpoint data: + + 1. Open the `.fusion/dump/` folder: + + ```console + .fusion/dump/ + ├── 1/ # First dump + │ ├── pre_*.log # Pre-dump log (if incremental) + │ └── + ├── 2/ # Second dump + │ ├── pre_*.log + │ └── + ├── 3/ # Third dump (full) + │ ├── dump_*.log # Full dump log + │ ├── restore_*.log # Restore log (if restored) + │ └── + └── dump_metadata # Metadata tracking all dumps + ``` + + 1. For incremental dumps (PRE type), check for success markers at the end of the `pre_*.log` file: + + ```console + (66.525687) page-pipe: Killing page pipe + (66.563939) irmap: Running irmap pre-dump + (66.610871) Writing stats + (66.658902) Pre-dumping finished successfully + ``` + + 1. For full dumps (FULL type), check for success markers at the end of the `dump_*.log` file: + + ```console + (25.867099) Unseizing 90 into 2 + (27.160829) Writing stats + (27.197458) Dumping finished successfully + ``` + + 1. If the log ends abruptly without success message, check the last timestamp: + + ```console + (121.37535) Dumping path for 329 fd via self 353 [/path/to/file.tmp] + (121.65146) 90 fdinfo 330: pos: 0x4380000 flags: 100000/0 + # Log truncated - instance was reclaimed before dump completed + ``` + + - AWS Batch: Timestamps near 120 seconds indicate instance terminated during dump. + - Google Batch: Timestamps near 30 seconds indicate instance terminated during dump. + + Cause: Task memory too large or bandwidth too low for reclamation window. + + 1. For restore operations, check for a success marker at the end of the `restore_*.log` file: + + ```console + (145.81974) Running pre-resume scripts + (145.81994) Restore finished successfully. Tasks resumed. + (145.82001) Writing stats + ``` + +1. Verify your configuration: + + Confirm your environment is properly configured: + + - Instance type has sufficient network bandwidth. + - Memory usage is within safe limits for your cloud provider. + - Architecture is `x86_64` (not ARM64) if experiencing issues. + - Fusion Snapshots are enabled in your compute environment. + +1. Test with different instance types. If uncertain: + + - Run the same task with different instance types that have better disk iops and bandwidth guarantees and verify if Fusions Snapshots work there. + - Decrease memory usage to a manageable amount. + +:::tip +For detailed information about error codes and logging, see [Error reference](./error-reference.md). +::: + +## Getting help + +When contacting Seqera support about Fusion Snapshots issues, provide the following information to help diagnose the problem: + +1. Task information: + + - Nextflow version + - Cloud provider (AWS Batch or Google Cloud Batch) + - Instance type used + - Memory and CPU requested + - Linux kernel version + +1. Error details: + + - Exit code (especially `175` or `176` for snapshot failures) + - Task logs from the work directory (`.command.log`) + - Fusion Snapshots logs (if available) + - Timestamp of failure + +1. Configuration: + + - Compute environment settings in Platform + - Nextflow config related to Fusion Snapshots (`fusion.snapshots.*` settings) + - Architecture (`x86_64` or ARM64) + +1. Dump data (if available): + + Diagnostic data from snapshot operations can help identify the root cause: + + - Preferred: Complete `.fusion/dump/` directory from the task work directory. + - Minimum: The `dump_metadata` file and all `*.log` files from numbered dump folders. + + If the directory is too large to share, prioritize the metadata and log files over the full checkpoint data. diff --git a/fusion_docs/troubleshooting/general.md b/fusion_docs/troubleshooting/general.md new file mode 100644 index 000000000..064be5573 --- /dev/null +++ b/fusion_docs/troubleshooting/general.md @@ -0,0 +1,21 @@ +--- +title: General +description: "Troubleshooting for general Fusion issues" +date created: "2025-11-29" +last updated: "2025-01-12" +tags: [troubleshooting, fusion, fusion-snapshots, configuration] +--- + +When working with Fusion, you might encounter the following issues. + +## Too many open files + +Tasks fail with an error about too many open files. + +This issue occurs when the default file descriptor limit is too low for the container workload. + +To resolve this issue, increase the `ulimit` for the container. Append the following to your Nextflow configuration: + +```groovy +process.containerOptions = '--ulimit nofile=1048576:1048576' +```