From 0a0280d89d7900dc18c33e0c05286fba0382fbd9 Mon Sep 17 00:00:00 2001 From: Christopher Hakkaart Date: Mon, 12 Jan 2026 16:47:48 +1300 Subject: [PATCH 1/9] Testing --- fusion_docs/sidebar.json | 11 +- .../error-codes-exit-messages.md | 151 +++++++++ .../troubleshooting/fusion-snapshots.md | 302 ++++++++++++++++++ fusion_docs/troubleshooting/general.md | 21 ++ 4 files changed, 484 insertions(+), 1 deletion(-) create mode 100644 fusion_docs/troubleshooting/error-codes-exit-messages.md create mode 100644 fusion_docs/troubleshooting/fusion-snapshots.md create mode 100644 fusion_docs/troubleshooting/general.md diff --git a/fusion_docs/sidebar.json b/fusion_docs/sidebar.json index 037b2813d..ecef61b81 100644 --- a/fusion_docs/sidebar.json +++ b/fusion_docs/sidebar.json @@ -39,7 +39,16 @@ }, "licensing", "reference", - "troubleshooting", + { + "type": "category", + "label": "Troubleshooting", + "collapsed": true, + "items": [ + "troubleshooting/general", + "troubleshooting/fusion-snapshots", + "troubleshooting/error-codes-exit-messages" + ] + }, "faq", { "type": "link", diff --git a/fusion_docs/troubleshooting/error-codes-exit-messages.md b/fusion_docs/troubleshooting/error-codes-exit-messages.md new file mode 100644 index 000000000..1a18e7e32 --- /dev/null +++ b/fusion_docs/troubleshooting/error-codes-exit-messages.md @@ -0,0 +1,151 @@ +--- +title: Error codes and exit messages +description: "Reference for Fusion error codes, exit codes, and error messages" +date created: "2025-01-12" +last updated: "2025-01-12" +tags: [errors, error-codes, exit-codes, fuse, logging, fusion] +--- + +This page describes Fusion's error reporting system, including exit codes, FUSE status codes (errno values), cloud provider error categories, and internal error types. + +## Architectural limitations + +Fusion plays two critical roles in the execution of a Nextflow task. Each impacts error reporting differently: + +- **As a filesystem**: Fusion operates as a FUSE filesystem to provide transparent access to cloud storage. When processes execute I/O operations (e.g., `open()`, `read()`, `write()`, `unlink()`), they interact with Fusion through the standard POSIX filesystem interface. + +- **As a container entrypoint**: Fusion acts as the container's entrypoint, wrapping the execution of the Nextflow task. When a container starts, Fusion initializes first, mounts the filesystem, and then launches the actual task command. Upon task completion, both Fusion and the task must communicate their status through a single exit code, which is constrained to 256 possible values (`0`-`255`). This creates a ambiguity as Fusion reports its own failures using specific codes (e.g. `174`), but the task process can return any value in this range. Consequently, when a container exits with a specific code, determining whether the failure originated from Fusion or the task is impossible without examining the logs. + +These architectural roles impose constraints on how Fusion can communicate operation errors as well as its own failures: + +- **Limitations in POSIX error codes**: As a filesystem, Fusion communicates I/O operation failures only through standard POSIX error codes (`ENOENT`, `EACCES`, `EIO`, etc.). The filesystem interface provides no mechanism to return rich error context—it cannot explain why an operation failed beyond a generic error code. As a result, Fusion cannot tell the user process through the filesystem interface and returns only a generic I/O error (`EIO`). + +- **User process controls output**: When Fusion returns a POSIX error code, the process (not Fusion) determines what to display to the user. Fusion cannot control this output. + +- **Ambiguity of exit codes**: As the container entrypoint, the container's final exit code can originate from either Fusion itself (e.g., exit code `174`) or from the task's actual command. When a container exits with a failure code, there's no immediate way to determine the source of the failure without examining logs. + +- **Mixed log outputs**: Fusion also emits errors to stdout, where they typically mix with the task output. They are indistinguishable from a task's own errors when your review the task's logs. + +## Error paths + +Fusion is a FUSE filesystem that bridges applications and cloud object stores. Errors originate from multiple layers and propagate through the filesystem components in three main paths: + +1. **Cloud → Storage Backend → FUSE Layer → Kernel → Application** + + - Storage backends catch and normalize cloud errors (network timeouts, auth failures, rate limits) using the `clouderr` package + - Storage backends return normalized cloud errors (with provider-agnostic categories) or internal errors (`ErrNotFound`, `ErrReadOnly`, etc.) + - The FUSE layer maps both cloud errors and internal errors to FUSE status codes (`ENOENT`, `EACCES`, `EREMOTEIO`, `EIO`) + - The kernel translates FUSE status to errno values for the application + - Fusion logs cloud errors with structured details (provider, error code, HTTP status, request ID) + +1. **Failures during start up/shut down → Exit Code** + + - Startup: Configuration errors, missing credentials, or mount failures terminate Fusion immediately + - Shutdown: Async uploads or consolidation of pending operations + - Failures surface as exit code `174` (Fusion I/O error) or `1` (fatal error) + +1. **Background Operations → Logs** + + - Async uploads during normal operation, cache eviction, and snapshot operations log errors but may not surface them to applications + - Errors are reported in Fusion (see [Understanding Fusion logs](#understanding-fusion-logs)) + +## Triaging errors + +When troubleshooting Fusion errors: + +1. Check the exit code: + - Check the process exit code (`$?`) to understand if Fusion terminated normally (`0`), encountered an I/O error (`174`), or had a command issue (`127`). +1. Look at FUSE status in the logs: + - If a filesystem operation failed, use the logs to identify the FUSE status code (e.g., `ENOENT`, `EREMOTEIO`, `EIO`) returned to the application. +1. Check for cloud error fields: + - If you see `EREMOTEIO` or cloud-related failures, identify the specific cloud error fields in the logs: + - `provider` + - `provider_code` + - `provider_http_status` + - `provider_request_id` + + :::note + The field `error_code` indicates Fusion's internal categorization of the cloud error normalized across providers (e.g., `ResourceNotFound`, `Forbidden`, `RateLimited`). + ::: + +1. Identify the mapped internal error: + - The FUSE status code maps back to either a cloud error category or a specific internal error (e.g., `EACCES` indicates an authentication problem, `EREMOTEIO` indicates a cloud backend issue). Check the Fusion logs for more details on the error that triggered the FUSE status code (see [Understanding Fusion logs](#understanding-fusion-logs)). + +:::tip +Enable `debug` logging to for the full log: + +```bash +export FUSION_LOG_LEVEL=debug +``` + +::: + +## Exit codes + +Fusion binaries return specific exit codes to indicate the outcome of execution. + +:::tip +For exit codes `175` and `176`, see [Fusion Snapshots](./fusion-snapshots.md). +::: + +### Fusion binary + +| Exit code | Constant | Description | +|-----------|----------|-------------| +| `0` | - | Success, normal completion. | +| `1` | - | Fatal error during startup (via `log.Fatal()`). | +| `127` | - | Command not found (`.command.sh` missing). Triggers automatic retry up to `FUSION_MAX_MOUNT_RETRIES` times. | +| `174` | `ErrorExitCode` | Fusion I/O error, application-level input/output error. | + +:::note +`log.Fatal()` calls during startup produce exit code `1`. See [Fatal error messages](#fatal-error-messages) for the specific messages that trigger this exit. +::: + +The `sysexits.h` standard uses exit code 74 for "input/output error" and reserves 150-199 for application use. In Fusion's context, 174 means "application input/output error". + +| Scenario | Log cue | Suggested next step | +|----------|----------------|---------------------| +| Failed to start FUSE process in background | `on FUSE process` | Check FUSE/kernel support. Verify `/dev/fuse` exists. | +| Failed to send SIGTERM to FUSE process | `on FUSE sigterm send` | Check kernel logs (`dmesg`) for crashed processes. | +| Failed to wait for FUSE process termination | `on FUSE stop wait` | Check for zombie processes. Review kernel signal handling. | +| Error during filesystem shutdown | `on file system shutdown` | Check Fusion logs for pending upload errors. See [Understanding Fusion logs](#understanding-fusion-logs). | +| Error during filesystem unmount | `on file system unmount` | Run `fusermount -u /fusion` or `umount -l /fusion` manually. | +| Failed read/write path validation | `check-rw` or `check-ro` | Verify cloud credentials and bucket permissions. | + +### GPU tracer binary + +| Exit code | Meaning | When | +|-----------|---------|------| +| `0` | Success | Normal completion (GPU detected or not) | +| `1` | Error | Failed to start GPU monitoring | +| `2` | Invalid input | Missing PID, invalid PID format, or PID `<= 0` | + +## FUSE status codes + +Fusion maps internal errors to standard FUSE status codes returned to the operating system. These are the [errno](https://man7.org/linux/man-pages/man3/errno.3.html) values applications receive when filesystem operations fail. + +:::note +For a complete list of errno values and their meanings, see the [Linux errno man page](https://man7.org/linux/man-pages/man3/errno.3.html) or run `errno -l` on a Linux system. +::: + +### Returned status codes + +Fusion's filesystem implementation actively returns these status codes: + +| FUSE status | Errno | Description | Common causes in Fusion | +|------------------|-------|---------------------------|-------------------------| +| `fuse.OK` | 0 | Success | Operation completed successfully | +| `fuse.ENOENT` | 2 | No such file or directory | File/entry not found in cache or remote store; cloud provider ResourceNotFound/ContainerNotFound errors | +| `fuse.EINTR` | 4 | Interrupted system call | Context cancelled | +| `fuse.EIO` | 5 | I/O error | General I/O errors, internal failures, remote store errors, unknown non-cloud errors | +| `fuse.EACCES` | 13 | Permission denied | Write attempt to read-only path; cloud provider Unauthenticated/InvalidCredentials/Forbidden/AccountError errors | +| `fuse.EBUSY` | 16 | Device or resource busy | Cloud provider RateLimited/Busy/ResourceArchived errors | +| `fuse.EEXIST` | 17 | File exists | Cloud provider Conflict errors (e.g., resource already exists) | +| `fuse.EINVAL` | 22 | Invalid argument | Invalid parameters (e.g., readlink on non-symlink) | +| `fuse.EROFS` | 30 | Read-only file system | Attempt to modify read-only object | +| `fuse.ERANGE` | 34 | Result too large | Buffer too small for xattr value | +| `fuse.ENOSYS` | 38 | Function not implemented | Operation not wired in Fusion's FUSE layer | +| `fuse.ENOATTR` | 93 | No such attribute | Extended attribute not found | +| `fuse.ENOTSUP` | 95 | Operation not supported | Operation explicitly rejected (e.g., hard links) | +| `fuse.ETIMEDOUT` | 110 | Connection timed out | Context deadline exceeded | +| `fuse.EREMOTEIO` | 121 | Remote I/O error | Cloud provider errors (QuotaExceeded, unknown cloud errors) | diff --git a/fusion_docs/troubleshooting/fusion-snapshots.md b/fusion_docs/troubleshooting/fusion-snapshots.md new file mode 100644 index 000000000..94fc88f2f --- /dev/null +++ b/fusion_docs/troubleshooting/fusion-snapshots.md @@ -0,0 +1,302 @@ +--- +title: Fusion Snapshots +description: "Troubleshooting for Fusion Snapshots" +date created: "2025-11-29" +last updated: "2025-01-12" +tags: [troubleshooting, fusion, fusion-snapshots, configuration] +--- + +When working with Fusion Snapshots, you might encounter the following issues. + +## Exit code `175`: Checkpoint dump failed + +Task fails with exit code `175`, indicating the checkpoint dump operation did not complete successfully. + +This issue can occur due to: + +1. Checkpoint timeout - The process could not be saved within the reclamation window (typically due to high memory usage). The reclamation windows are: + - AWS Batch: 120 seconds (guaranteed) + - Google Batch: Up to 30 seconds (not guaranteed) + - Other factors: Large number of open file descriptors, complex process trees +1. Insufficient network bandwidth - Cannot upload checkpoint data fast enough. +1. Disk space issues - Not enough local storage for checkpoint files. + +To resolve this issue: + +1. Reduce memory usage: + + - Lower memory requested by tasks + - Process smaller data chunks + - Set `process.resourceLimits` to enforce limits: + + ```groovy + // AWS Batch example + process.resourceLimits = [cpus: 32, memory: '60.GB'] + + // Google Batch example (more conservative for 30s window) + process.resourceLimits = [cpus: 16, memory: '20.GB'] + ``` + +1. Increase network bandwidth: + + - Use instance types with higher guaranteed network bandwidth. + - Ensure memory:bandwidth ratio is appropriate (5:1 or better for AWS). + +1. Enable incremental snapshots (automatic on `x86_64`): + + - Verify you're using `x86_64` architecture: `uname -m` + - Avoid ARM64 instances if checkpoints are failing. + +1. Configure retry strategy: + + ```groovy + process { + maxRetries = 2 + errorStrategy = { + if (task.exitStatus == 175) { + return 'retry' + } else { + return 'terminate' + } + } + } + ``` + +See [AWS Batch instance selection](./guide/snapshots/aws.md#selecting-an-ec2-instance) or [Google Batch best practices](./guide/snapshots/gcp.md) for recommended configurations. + +:::tip +For a comprehensive explanation of exit code `175`, see [Exit Codes](./error-reference.md#exit-codes). +::: + +## Exit code `176`: Checkpoint restore failed + +Task fails with exit code `176` when attempting to restore from a checkpoint. + +This issue can occur due to: + +1. Corrupted checkpoint - Previous checkpoint did not complete properly. +1. Missing checkpoint files - Checkpoint data missing or inaccessible in object storage. +1. State conflict - Attempting to restore while dump still in progress. +1. Environment mismatch - Different environment between checkpoint and restore. + +To resolve this issue: + +1. Check if previous checkpoint completed: + - Review logs for "Dumping finished successfully". + - If the "Dumping finished successfully" message is missing, it means the previous checkpoint timed out with a `175` exit error. + +1. Verify checkpoint data exists: + - Check that the `.fusion/dump/` work directory contains checkpoint files. + - Ensure that the S3/GCS bucket is accessible. + - If the bucket is missing, open a support ticket. See [Getting help](#getting-help) for more information. + +1. Configure retry for dump failures first: + - Handle exit code `175` with retry. See [Retry handling](./guide/snapshots/configuration.md#retry-handling) for more information. + +:::tip +For a comprehensive explanation of exit code `176`, see [Exit Codes](./error-reference.md#exit-codes). +::: + +## Long checkpoint times + +Checkpoints take longer than expected, approaching timeout limits. + +This issue can occur due to: + +1. High memory usage - Memory is typically the primary factor affecting checkpoint time. +1. ARM64 architecture - Only full dumps available (no incremental snapshots). +1. Insufficient network bandwidth - Instance bandwidth too low for memory size. +1. Open file descriptors - Large number of open files or complex process trees. + +To resolve this issue: + +1. For AWS Batch (120-second window): + - Use instances with 5:1 or better memory:bandwidth ratio. + - Use `x86_64` instances for incremental snapshot support (`c6id`, `m6id`, `r6id` families). + - Check architecture: `uname -m` + +1. For Google Batch (30-second window): + - Use `x86_64` instances (mandatory for larger workloads). + - Use more conservative memory limits. + - Consider smaller instance types with better ratios. + +1. Review instance specifications: + - Verify guaranteed network bandwidth (not "up to" values). + - Prefer NVMe storage instances on AWS (instances with `d` suffix). + +See [Selecting an EC2 instance](./guide/snapshots/aws.md#selecting-an-ec2-instance) for detailed recommendations. + +## Frequent checkpoint failures + +Checkpoints consistently fail across multiple tasks. + +This issue can occur due to: + +1. Task too large for reclamation window - Memory usage exceeds what can be checkpointed in time (more common on Google Batch with 30-second window). +1. Network congestion or throttling - Bandwidth lower than instance specifications. +1. ARM64 architecture limitations - Full dumps only, requiring much more time and bandwidth. + +To resolve this issue: + +1. Split large tasks: + - Break into smaller, checkpointable units. + - Process data in chunks. + +1. Switch to `x86_64` instances: + - Essential for Google Batch. + - Recommended for AWS Batch tasks > 40 GiB. + +1. Adjust memory limits: + + ```groovy + // For AWS Batch + process.resourceLimits = [cpus: 32, memory: '60.GB'] + + // For Google Batch (more conservative) + process.resourceLimits = [cpus: 16, memory: '20.GB'] + ``` + +## SSL/TLS connection errors after restore + +Applications fail after restore with connection errors, especially HTTPS connections. + +This issue occurs when applications use HTTPS connections, as CRIU cannot preserve encrypted TCP connections (SSL/TLS). + +To resolve this issue, configure TCP close mode to drop connections during checkpoint: + +```groovy +process.containerOptions = '-e FUSION_SNAPSHOTS_TCP_MODE=close' +``` + +Applications will need to re-establish connections after restore. See [TCP connection handling](./guide/snapshots/configuration.md#tcp-connection-handling) for more information. + +## Debugging workflow + +To diagnose checkpoint problems: + +1. Check the exit code to identify the failure type: + + - Exit code `175`: Checkpoint dump failed - The snapshot could not be saved. + - Exit code `176`: Checkpoint restore failed - The snapshot could not be restored. + - Other exit codes: Likely an application error, not snapshot-related. + +1. Review task logs: + + - Check `.command.log` in the task work directory for Fusion Snapshots messages (prefixed with timestamps). + + :::tip + Enable `debug` logging for more details. + + ```groovy + process.containerOptions = '-e FUSION_SNAPSHOT_LOG_LEVEL=debug' + ``` + ::: + +1. Inspect your checkpoint data: + + 1. Open the `.fusion/dump/` folder: + + ```console + .fusion/dump/ + ├── 1/ # First dump + │ ├── pre_*.log # Pre-dump log (if incremental) + │ └── + ├── 2/ # Second dump + │ ├── pre_*.log + │ └── + ├── 3/ # Third dump (full) + │ ├── dump_*.log # Full dump log + │ ├── restore_*.log # Restore log (if restored) + │ └── + └── dump_metadata # Metadata tracking all dumps + ``` + + 1. For incremental dumps (PRE type), check for success markers at the end of the `pre_*.log` file: + + ```console + (66.525687) page-pipe: Killing page pipe + (66.563939) irmap: Running irmap pre-dump + (66.610871) Writing stats + (66.658902) Pre-dumping finished successfully + ``` + + 1. For full dumps (FULL type), check for success markers at the end of the `dump_*.log` file: + + ```console + (25.867099) Unseizing 90 into 2 + (27.160829) Writing stats + (27.197458) Dumping finished successfully + ``` + + 1. If the log ends abruptly without success message, check the last timestamp: + + ```console + (121.37535) Dumping path for 329 fd via self 353 [/path/to/file.tmp] + (121.65146) 90 fdinfo 330: pos: 0x4380000 flags: 100000/0 + # Log truncated - instance was reclaimed before dump completed + ``` + + - AWS Batch: Timestamps near 120 seconds indicate instance terminated during dump. + - Google Batch: Timestamps near 30 seconds indicate instance terminated during dump. + + Cause: Task memory too large or bandwidth too low for reclamation window. + + 1. For restore operations, check for a success marker at the end of the `restore_*.log` file: + + ```console + (145.81974) Running pre-resume scripts + (145.81994) Restore finished successfully. Tasks resumed. + (145.82001) Writing stats + ``` + +1. Verify your configuration: + + Confirm your environment is properly configured: + + - Instance type has sufficient network bandwidth. + - Memory usage is within safe limits for your cloud provider. + - Architecture is `x86_64` (not ARM64) if experiencing issues. + - Fusion Snapshots are enabled in your compute environment. + +1. Test with different instance types. If uncertain: + + - Run the same task with different instance types that have better disk iops and bandwidth guarantees and verify if Fusions Snapshots work there. + - Decrease memory usage to a manageable amount. + +:::tip +For detailed information about error codes and logging, see [Error reference](./error-reference.md). +::: + +## Getting help + +When contacting Seqera support about Fusion Snapshots issues, provide the following information to help diagnose the problem: + +1. Task information: + + - Nextflow version + - Cloud provider (AWS Batch or Google Cloud Batch) + - Instance type used + - Memory and CPU requested + - Linux kernel version + +1. Error details: + + - Exit code (especially `175` or `176` for snapshot failures) + - Task logs from the work directory (`.command.log`) + - Fusion Snapshots logs (if available) + - Timestamp of failure + +1. Configuration: + + - Compute environment settings in Platform + - Nextflow config related to Fusion Snapshots (`fusion.snapshots.*` settings) + - Architecture (`x86_64` or ARM64) + +1. Dump data (if available): + + Diagnostic data from snapshot operations can help identify the root cause: + + - Preferred: Complete `.fusion/dump/` directory from the task work directory. + - Minimum: The `dump_metadata` file and all `*.log` files from numbered dump folders. + + If the directory is too large to share, prioritize the metadata and log files over the full checkpoint data. diff --git a/fusion_docs/troubleshooting/general.md b/fusion_docs/troubleshooting/general.md new file mode 100644 index 000000000..064be5573 --- /dev/null +++ b/fusion_docs/troubleshooting/general.md @@ -0,0 +1,21 @@ +--- +title: General +description: "Troubleshooting for general Fusion issues" +date created: "2025-11-29" +last updated: "2025-01-12" +tags: [troubleshooting, fusion, fusion-snapshots, configuration] +--- + +When working with Fusion, you might encounter the following issues. + +## Too many open files + +Tasks fail with an error about too many open files. + +This issue occurs when the default file descriptor limit is too low for the container workload. + +To resolve this issue, increase the `ulimit` for the container. Append the following to your Nextflow configuration: + +```groovy +process.containerOptions = '--ulimit nofile=1048576:1048576' +``` From 39aff272e8271f994c52e496e65c48d564a1fb99 Mon Sep 17 00:00:00 2001 From: Llewellyn vd Berg <113503285+llewellyn-sl@users.noreply.github.com> Date: Wed, 14 Jan 2026 11:09:42 +0200 Subject: [PATCH 2/9] Add remaining user-facing error reference, lint for GD style guide compliance --- .../error-codes-exit-messages.md | 305 +++++++++++++++++- .../error-codes-exit-messages_old.md | 151 +++++++++ 2 files changed, 450 insertions(+), 6 deletions(-) create mode 100644 fusion_docs/troubleshooting/error-codes-exit-messages_old.md diff --git a/fusion_docs/troubleshooting/error-codes-exit-messages.md b/fusion_docs/troubleshooting/error-codes-exit-messages.md index 1a18e7e32..215cc4c1e 100644 --- a/fusion_docs/troubleshooting/error-codes-exit-messages.md +++ b/fusion_docs/troubleshooting/error-codes-exit-messages.md @@ -14,17 +14,17 @@ Fusion plays two critical roles in the execution of a Nextflow task. Each impact - **As a filesystem**: Fusion operates as a FUSE filesystem to provide transparent access to cloud storage. When processes execute I/O operations (e.g., `open()`, `read()`, `write()`, `unlink()`), they interact with Fusion through the standard POSIX filesystem interface. -- **As a container entrypoint**: Fusion acts as the container's entrypoint, wrapping the execution of the Nextflow task. When a container starts, Fusion initializes first, mounts the filesystem, and then launches the actual task command. Upon task completion, both Fusion and the task must communicate their status through a single exit code, which is constrained to 256 possible values (`0`-`255`). This creates a ambiguity as Fusion reports its own failures using specific codes (e.g. `174`), but the task process can return any value in this range. Consequently, when a container exits with a specific code, determining whether the failure originated from Fusion or the task is impossible without examining the logs. +- **As a container entrypoint**: Fusion acts as the container's entrypoint, wrapping the execution of the Nextflow task. When a container starts, Fusion initializes first, mounts the filesystem, and then launches the actual task command. Upon task completion, both Fusion and the task must communicate their status through a single exit code, which is constrained to 256 possible values (`0`-`255`). This creates an ambiguity as Fusion reports its own failures using specific codes (e.g. `174`), but the task process can return any value in this range. Consequently, when a container exits with a specific code, determining whether the failure originated from Fusion or the task is impossible without examining the logs. These architectural roles impose constraints on how Fusion can communicate operation errors as well as its own failures: -- **Limitations in POSIX error codes**: As a filesystem, Fusion communicates I/O operation failures only through standard POSIX error codes (`ENOENT`, `EACCES`, `EIO`, etc.). The filesystem interface provides no mechanism to return rich error context—it cannot explain why an operation failed beyond a generic error code. As a result, Fusion cannot tell the user process through the filesystem interface and returns only a generic I/O error (`EIO`). +- **Limitations in POSIX error codes**: As a filesystem, Fusion communicates I/O operation failures only through standard POSIX error codes (`ENOENT`, `EACCES`, `EIO`, etc.). The filesystem interface provides no mechanism to return rich error context—it cannot explain why an operation failed beyond a generic error code. As a result, Fusion cannot provide detailed error information to the user process through the filesystem interface and returns only a generic I/O error (`EIO`). - **User process controls output**: When Fusion returns a POSIX error code, the process (not Fusion) determines what to display to the user. Fusion cannot control this output. - **Ambiguity of exit codes**: As the container entrypoint, the container's final exit code can originate from either Fusion itself (e.g., exit code `174`) or from the task's actual command. When a container exits with a failure code, there's no immediate way to determine the source of the failure without examining logs. -- **Mixed log outputs**: Fusion also emits errors to stdout, where they typically mix with the task output. They are indistinguishable from a task's own errors when your review the task's logs. +- **Mixed log outputs**: Fusion also emits errors to stdout, where they typically mix with the task output. They are indistinguishable from a task's own errors when you review the task's logs. ## Error paths @@ -38,7 +38,7 @@ Fusion is a FUSE filesystem that bridges applications and cloud object stores. E - The kernel translates FUSE status to errno values for the application - Fusion logs cloud errors with structured details (provider, error code, HTTP status, request ID) -1. **Failures during start up/shut down → Exit Code** +1. **Failures during startup/shutdown → Exit Code** - Startup: Configuration errors, missing credentials, or mount failures terminate Fusion immediately - Shutdown: Async uploads or consolidation of pending operations @@ -72,7 +72,7 @@ When troubleshooting Fusion errors: - The FUSE status code maps back to either a cloud error category or a specific internal error (e.g., `EACCES` indicates an authentication problem, `EREMOTEIO` indicates a cloud backend issue). Check the Fusion logs for more details on the error that triggered the FUSE status code (see [Understanding Fusion logs](#understanding-fusion-logs)). :::tip -Enable `debug` logging to for the full log: +Enable `debug` logging for the full log: ```bash export FUSION_LOG_LEVEL=debug @@ -146,6 +146,299 @@ Fusion's filesystem implementation actively returns these status codes: | `fuse.ERANGE` | 34 | Result too large | Buffer too small for xattr value | | `fuse.ENOSYS` | 38 | Function not implemented | Operation not wired in Fusion's FUSE layer | | `fuse.ENOATTR` | 93 | No such attribute | Extended attribute not found | -| `fuse.ENOTSUP` | 95 | Operation not supported | Operation explicitly rejected (e.g., hard links) | +| `fuse.ENOTSUP` | 95 | Operation not supported | Operation explicitly rejected (for example, hard links) | | `fuse.ETIMEDOUT` | 110 | Connection timed out | Context deadline exceeded | | `fuse.EREMOTEIO` | 121 | Remote I/O error | Cloud provider errors (QuotaExceeded, unknown cloud errors) | + +### Troubleshooting FUSE status codes + +When you encounter a FUSE status code, use the following table to identify likely causes and next steps: + +| Status | Likely causes and troubleshooting steps | +|--------|----------------------------------------| +| `ENOENT` | Path typo or object deleted from remote store. Check if the path exists using your cloud provider's CLI (`aws s3 ls`, `gsutil ls`, `az storage blob list`). | +| `EACCES` | Mount configured as read-only, object ACL blocking writes, or authentication/permission issues. Check cloud IAM permissions and credentials. | +| `EEXIST` | Resource already exists in cloud storage. Check if the operation was retried or if there's a naming conflict. | +| `EIO` | General I/O error or unknown internal failure. Check Fusion logs for the underlying cause. See [Understanding Fusion logs](#understanding-fusion-logs). | +| `EREMOTEIO` | Cloud provider error. Check Fusion logs for detailed cloud error information (provider, error code, HTTP status, request ID). May indicate quota exceeded, rate limiting, or other cloud-specific issues. See [Understanding Fusion logs](#understanding-fusion-logs). | +| `EBUSY` | Cloud provider rate limiting requests or temporarily busy. Retry with backoff. Check cloud provider dashboard for service status. | +| `ETIMEDOUT` | Operation timed out due to network connectivity issues or slow cloud response. Check network connection and cloud service status. | +| `EINTR` | Caller cancelled the operation. Usually safe to retry. | +| `ENOTSUP` | Unsupported operation. Adjust workload to avoid hard links. Use symlinks or copies instead. | +| `ENOSYS` | Operation not implemented in Fusion. Check if the operation is supported or use an alternative approach. | + +### ENOSYS vs ENOTSUP + +Both indicate an operation cannot be performed, but they have different meanings: + +- **`ENOSYS` (Function not implemented)**: The operation is not implemented in Fusion's FUSE layer. This is the default response for operations Fusion doesn't handle. + +- **`ENOTSUP` (Operation not supported)**: The operation exists in Fusion but is explicitly rejected for specific cases. For example: + - **Hard links (`Link`)**: Fusion explicitly returns `ENOTSUP` because hard links cannot be meaningfully represented on object storage backends. + - **Whiteout character device creation**: During overlay-style renames, if creating the whiteout marker fails, `ENOTSUP` signals this specific failure. + +:::tip +If you encounter `ENOTSUP` on hard links, use symbolic links (`ln -s`) or file copies instead. +::: + +### EREMOTEIO vs EIO + +Fusion distinguishes between local I/O failures and cloud provider errors: + +- **`EREMOTEIO` (Remote I/O error)**: Used specifically for cloud provider errors. This errno value indicates that: + - The error originated from a remote cloud storage system (S3, Azure Blob Storage, or Google Cloud Storage). + - The failure is due to cloud provider issues (quota exceeded, rate limiting, service unavailable). + - Debugging should focus on cloud provider logs and status, not local system issues. + - The request ID from logs can be provided to cloud support for investigation. + +- **`EIO` (I/O error)**: Used as a generic catch-all for: + - Unknown internal errors that are not cloud-related. + - Local filesystem or system failures. + - Errors that cannot be classified into more specific categories. + +:::note +Using `EREMOTEIO` for cloud errors provides more accurate error context, making it easier to distinguish between local system issues and cloud service problems during troubleshooting and monitoring. +::: + +:::tip +When you see `EREMOTEIO`, check the Fusion logs for cloud error fields: `provider`, `error_code`, `provider_code`, `provider_http_status`, and `provider_request_id`. See [Understanding Fusion logs](#understanding-fusion-logs). +::: + +### Error mapping + +Fusion maps cloud provider errors and internal errors to FUSE status codes. + +#### Cloud provider error mapping + +Cloud provider errors are normalized and mapped to appropriate FUSE status codes: + +| Cloud error category | FUSE status | Examples | +|---------------------|-------------|----------| +| `Unauthenticated` | `fuse.EACCES` | No credentials provided | +| `InvalidCredentials` | `fuse.EACCES` | Wrong, malformed, or expired credentials | +| `Forbidden` | `fuse.EACCES` | Valid credentials, insufficient permissions | +| `AccountError` | `fuse.EACCES` | Account disabled, suspended, or has billing issues | +| `ResourceNotFound` | `fuse.ENOENT` | S3 "NoSuchKey", Azure "BlobNotFound", GCS 404 | +| `ContainerNotFound` | `fuse.ENOENT` | S3 "NoSuchBucket", Azure "ContainerNotFound", GCS 404 with bucket error | +| `RateLimited` | `fuse.EBUSY` | Request rate limits exceeded | +| `Busy` | `fuse.EBUSY` | Service temporarily unavailable or overloaded | +| `ResourceArchived` | `fuse.EBUSY` | Resource in archived/transitional state (for example, Glacier) | +| `Conflict` | `fuse.EEXIST` | Resource already exists or precondition failed | +| `InvalidArgument` | `fuse.EINVAL` | Malformed request or invalid parameters | +| `QuotaExceeded` | `fuse.EREMOTEIO` | Storage quota or capacity limit reached | +| `Unknown` (cloud errors) | `fuse.EREMOTEIO` | Unclassified cloud provider errors | + +#### Internal error mapping + +| Internal error | FUSE status | +|---------------|-------------| +| Not found | `fuse.ENOENT` | +| Read-only | `fuse.EROFS` | +| Unsupported | `fuse.ENOSYS` | +| Context cancelled | `fuse.EINTR` | +| Context deadline exceeded | `fuse.ETIMEDOUT` | +| Other errors | `fuse.EIO` | + +## Cloud provider error categories + +Fusion normalizes errors from different cloud storage providers (S3, Azure Blob Storage, Google Cloud Storage) into consistent categories. When you see an `error_code` field in Fusion logs, it represents one of these categories: + +| Category | Description | Common provider codes | +|----------|-------------|----------------------| +| `ResourceNotFound` | Requested resource (object/file) does not exist | S3: "NoSuchKey", Azure: "BlobNotFound", GCS: HTTP 404 | +| `ContainerNotFound` | Storage container (bucket) does not exist | S3: "NoSuchBucket", Azure: "ContainerNotFound", GCS: HTTP 404 with bucket error | +| `Unauthenticated` | No credentials provided | S3: "MissingSecurityHeader", GCS: HTTP 401 with no credentials | +| `InvalidCredentials` | Credentials provided but wrong, malformed, or expired | S3: "InvalidAccessKeyId", "ExpiredToken", Azure: "InvalidAuthenticationInfo", GCS: HTTP 401 with invalid credentials | +| `Forbidden` | Valid credentials but insufficient permissions | S3: "AccessDenied", Azure: "AuthorizationPermissionMismatch", GCS: HTTP 403 | +| `AccountError` | Account-level problems (disabled, suspended, billing issues) | S3: "AccountProblem", Azure: "AccountIsDisabled", GCS: HTTP 403 with specific messages | +| `ResourceArchived` | Resource exists but is in archived/transitional state | S3: "InvalidObjectState" (Glacier), Azure: "BlobArchived" | +| `RateLimited` | Request rate limits exceeded | S3: "SlowDown", Azure: "TooManyRequests", GCS: HTTP 429 | +| `Busy` | Service temporarily unavailable or overloaded | S3: "ServiceUnavailable", "InternalError", Azure: "ServerBusy", GCS: HTTP 503 | +| `Conflict` | Resource state conflict or precondition failure | S3: "BucketAlreadyExists", Azure: "BlobAlreadyExists", "ConditionNotMet", GCS: HTTP 409/412 | +| `InvalidArgument` | Malformed request or invalid parameters | S3: "InvalidArgument", "InvalidRange", Azure: "InvalidQueryParameterValue", GCS: HTTP 400 | +| `QuotaExceeded` | Storage quota or capacity limit reached | S3: "TooManyBuckets", Azure: "AccountLimitExceeded", GCS: HTTP 429 with quota message | +| `Unknown` | Unclassified or unexpected error | Various | + +## Fatal error messages + +These messages indicate Fusion terminated immediately with exit code 1. They occur during startup or critical failures: + +| Message | Cause | +|---------|-------| +| `configuring fusion` | Failed to configure Fusion (invalid config, missing environment variables) | +| `building remote store options` | Failed to build remote store options | +| `creating metadata store` | Failed to create metadata store | +| `creating data store` | Failed to create data store connection | +| `validating work path` | Work path validation failed (empty prefix or connection error) | +| `creating filesystem` | Failed to create FUSE filesystem | +| `mounting filesystem` | Failed to mount FUSE filesystem | +| `could not get current job attempt` | Failed to get job attempt from compute environment | + +## Understanding Fusion logs + +Fusion emits logs in two formats: + +- **Console logs** (stderr): Timestamped, human-readable format with `[seqera-fusion]` prefix. These logs are collected by Seqera Platform and shown in the UI. They provide immediate visibility during runtime. +- **File logs** (`${workdir}/.fusion.log`): Structured logs in JSON format for detailed analysis and troubleshooting. + +:::note +The `[seqera-fusion]` prefix for console logs was introduced in Fusion v2.6, v2.5.9, and v2.4.20. +::: + +### Log fields reference + +Fusion uses structured logging with consistent field names. Understanding these fields is essential for troubleshooting. + +#### Cloud error fields + +These fields appear automatically when a cloud provider error is detected: + +| Field | Description | Example values | +|-------|-------------|----------------| +| `provider` | Cloud provider name | `s3`, `azure`, `gcs` | +| `error_code` | Normalized error category (provider-agnostic) | `Forbidden`, `ResourceNotFound`, `InvalidCredentials` | +| `provider_code` | Provider-specific error code | S3: `NoSuchKey`, Azure: `BlobNotFound`, GCS: `invalid` | +| `provider_http_status` | HTTP status code from cloud provider | `403`, `404`, `429`, `500` | +| `provider_request_id` | Request ID for cloud provider support tickets | `ABCD1234EXAMPLE`, `b8e8a1f5-...` | +| `provider_error` | Original error message from cloud provider | `The specified key does not exist.` | + +:::tip +When opening support tickets with cloud providers, always include the `provider_request_id` from logs. This enables their support team to trace the exact request in their systems. +::: + +#### Common operational fields + +These fields appear in most log entries to provide operation context: + +| Field | Description | Example values | +|-------|-------------|----------------| +| `path` | Filesystem path where operation occurred | `/fusion/s3/bucket/file.txt`, `/.Trash` | +| `operation` | FUSE operation or internal operation name | `Read`, `Write`, `Lookup`, `listDirectory` | +| `error` | Main error message (non-cloud portion) | `not found`, `permission denied` | +| `message` | Human-readable log message describing what happened | `find entry error`, `configuration` | +| `level` | Log severity level | `debug`, `info`, `warn`, `error`, `fatal` | + +### Log examples + +#### Fatal error (causes termination) + +This example indicates Fusion could not authenticate with the cloud provider and terminated immediately: + +**Console logs:** +``` +11:23AM FTL [seqera-fusion] creating data store error="NoCredentialProviders: no valid providers in chain" +``` + +**File logs (`.fusion.log`):** +```json +{ + "level": "fatal", + "error": "NoCredentialProviders: no valid providers in chain", + "time": 1765738531263809778, + "message": "creating data store" +} +``` + +#### Recoverable error (operation failed, Fusion continues) + +This example indicates a file lookup failed and returned `ENOENT` to the application. Fusion continues running: + +**Console logs:** +``` +11:24AM ERR [seqera-fusion] find entry error error="element not found" path=/fusion/s3/bucket/missing-file.txt +``` + +**File logs (`.fusion.log`):** +```json +{ + "level": "error", + "error": "element not found", + "path": "/.Trash", + "time": 1765738531284473208, + "message": "listDirectory" +} +``` + +#### Cloud provider error + +This example shows a cloud provider error with structured fields: + +**File logs (`.fusion.log`):** +```json +{ + "level": "error", + "error": "not found", + "provider": "s3", + "error_code": "ResourceNotFound", + "provider_code": "NoSuchKey", + "provider_http_status": 404, + "provider_request_id": "ABCD1234EXAMPLE", + "provider_error": "The specified key does not exist.", + "message": "The requested resource was not found in cloud storage. Verify the file path is correct and the resource exists." +} +``` + +### Searching logs + +**Console logs** (grep-based searching): +```bash +# Find all cloud provider errors +grep 'provider=' .fusion.log + +# Find specific error categories +grep 'error_code=' .fusion.log | grep 'Forbidden' + +# Find operations on specific paths +grep 'path=/fusion/s3/bucket/file.txt' .fusion.log +``` + +**JSON logs** (jq-based searching): +```bash +# Find all cloud errors +jq 'select(.provider != null)' .fusion.log + +# Find S3 errors only +jq 'select(.provider == "s3")' .fusion.log + +# Find all Forbidden errors across providers +jq 'select(.error_code == "Forbidden")' .fusion.log + +# Find errors with request IDs (for cloud support tickets) +jq 'select(.provider_request_id != null) | {provider, provider_request_id, provider_code, message}' .fusion.log +``` + +## Nextflow integration + +When running Nextflow with Fusion: + +- Exit code `0`: Task completed successfully +- Exit code `127`: Retry logic activates (`.command.sh` not found) +- Exit code `174`: Fusion I/O error—check logs for details + +### Check exit codes + +```bash +fusion --foreground +exit_code=$? + +case $exit_code in + 0) + echo "Success" + ;; + 127) + echo "Command not found - may retry" + ;; + 174) + echo "Fusion I/O error - check Fusion logs" + ;; + *) + echo "Unknown exit code: $exit_code" + ;; +esac +``` + +## Related documentation + +- [Fusion overview](https://docs.seqera.io/fusion/) +- [Fusion configuration](https://docs.seqera.io/fusion/configuration/) +- [Fusion Snapshots](./fusion-snapshots.md) diff --git a/fusion_docs/troubleshooting/error-codes-exit-messages_old.md b/fusion_docs/troubleshooting/error-codes-exit-messages_old.md new file mode 100644 index 000000000..1a18e7e32 --- /dev/null +++ b/fusion_docs/troubleshooting/error-codes-exit-messages_old.md @@ -0,0 +1,151 @@ +--- +title: Error codes and exit messages +description: "Reference for Fusion error codes, exit codes, and error messages" +date created: "2025-01-12" +last updated: "2025-01-12" +tags: [errors, error-codes, exit-codes, fuse, logging, fusion] +--- + +This page describes Fusion's error reporting system, including exit codes, FUSE status codes (errno values), cloud provider error categories, and internal error types. + +## Architectural limitations + +Fusion plays two critical roles in the execution of a Nextflow task. Each impacts error reporting differently: + +- **As a filesystem**: Fusion operates as a FUSE filesystem to provide transparent access to cloud storage. When processes execute I/O operations (e.g., `open()`, `read()`, `write()`, `unlink()`), they interact with Fusion through the standard POSIX filesystem interface. + +- **As a container entrypoint**: Fusion acts as the container's entrypoint, wrapping the execution of the Nextflow task. When a container starts, Fusion initializes first, mounts the filesystem, and then launches the actual task command. Upon task completion, both Fusion and the task must communicate their status through a single exit code, which is constrained to 256 possible values (`0`-`255`). This creates a ambiguity as Fusion reports its own failures using specific codes (e.g. `174`), but the task process can return any value in this range. Consequently, when a container exits with a specific code, determining whether the failure originated from Fusion or the task is impossible without examining the logs. + +These architectural roles impose constraints on how Fusion can communicate operation errors as well as its own failures: + +- **Limitations in POSIX error codes**: As a filesystem, Fusion communicates I/O operation failures only through standard POSIX error codes (`ENOENT`, `EACCES`, `EIO`, etc.). The filesystem interface provides no mechanism to return rich error context—it cannot explain why an operation failed beyond a generic error code. As a result, Fusion cannot tell the user process through the filesystem interface and returns only a generic I/O error (`EIO`). + +- **User process controls output**: When Fusion returns a POSIX error code, the process (not Fusion) determines what to display to the user. Fusion cannot control this output. + +- **Ambiguity of exit codes**: As the container entrypoint, the container's final exit code can originate from either Fusion itself (e.g., exit code `174`) or from the task's actual command. When a container exits with a failure code, there's no immediate way to determine the source of the failure without examining logs. + +- **Mixed log outputs**: Fusion also emits errors to stdout, where they typically mix with the task output. They are indistinguishable from a task's own errors when your review the task's logs. + +## Error paths + +Fusion is a FUSE filesystem that bridges applications and cloud object stores. Errors originate from multiple layers and propagate through the filesystem components in three main paths: + +1. **Cloud → Storage Backend → FUSE Layer → Kernel → Application** + + - Storage backends catch and normalize cloud errors (network timeouts, auth failures, rate limits) using the `clouderr` package + - Storage backends return normalized cloud errors (with provider-agnostic categories) or internal errors (`ErrNotFound`, `ErrReadOnly`, etc.) + - The FUSE layer maps both cloud errors and internal errors to FUSE status codes (`ENOENT`, `EACCES`, `EREMOTEIO`, `EIO`) + - The kernel translates FUSE status to errno values for the application + - Fusion logs cloud errors with structured details (provider, error code, HTTP status, request ID) + +1. **Failures during start up/shut down → Exit Code** + + - Startup: Configuration errors, missing credentials, or mount failures terminate Fusion immediately + - Shutdown: Async uploads or consolidation of pending operations + - Failures surface as exit code `174` (Fusion I/O error) or `1` (fatal error) + +1. **Background Operations → Logs** + + - Async uploads during normal operation, cache eviction, and snapshot operations log errors but may not surface them to applications + - Errors are reported in Fusion (see [Understanding Fusion logs](#understanding-fusion-logs)) + +## Triaging errors + +When troubleshooting Fusion errors: + +1. Check the exit code: + - Check the process exit code (`$?`) to understand if Fusion terminated normally (`0`), encountered an I/O error (`174`), or had a command issue (`127`). +1. Look at FUSE status in the logs: + - If a filesystem operation failed, use the logs to identify the FUSE status code (e.g., `ENOENT`, `EREMOTEIO`, `EIO`) returned to the application. +1. Check for cloud error fields: + - If you see `EREMOTEIO` or cloud-related failures, identify the specific cloud error fields in the logs: + - `provider` + - `provider_code` + - `provider_http_status` + - `provider_request_id` + + :::note + The field `error_code` indicates Fusion's internal categorization of the cloud error normalized across providers (e.g., `ResourceNotFound`, `Forbidden`, `RateLimited`). + ::: + +1. Identify the mapped internal error: + - The FUSE status code maps back to either a cloud error category or a specific internal error (e.g., `EACCES` indicates an authentication problem, `EREMOTEIO` indicates a cloud backend issue). Check the Fusion logs for more details on the error that triggered the FUSE status code (see [Understanding Fusion logs](#understanding-fusion-logs)). + +:::tip +Enable `debug` logging to for the full log: + +```bash +export FUSION_LOG_LEVEL=debug +``` + +::: + +## Exit codes + +Fusion binaries return specific exit codes to indicate the outcome of execution. + +:::tip +For exit codes `175` and `176`, see [Fusion Snapshots](./fusion-snapshots.md). +::: + +### Fusion binary + +| Exit code | Constant | Description | +|-----------|----------|-------------| +| `0` | - | Success, normal completion. | +| `1` | - | Fatal error during startup (via `log.Fatal()`). | +| `127` | - | Command not found (`.command.sh` missing). Triggers automatic retry up to `FUSION_MAX_MOUNT_RETRIES` times. | +| `174` | `ErrorExitCode` | Fusion I/O error, application-level input/output error. | + +:::note +`log.Fatal()` calls during startup produce exit code `1`. See [Fatal error messages](#fatal-error-messages) for the specific messages that trigger this exit. +::: + +The `sysexits.h` standard uses exit code 74 for "input/output error" and reserves 150-199 for application use. In Fusion's context, 174 means "application input/output error". + +| Scenario | Log cue | Suggested next step | +|----------|----------------|---------------------| +| Failed to start FUSE process in background | `on FUSE process` | Check FUSE/kernel support. Verify `/dev/fuse` exists. | +| Failed to send SIGTERM to FUSE process | `on FUSE sigterm send` | Check kernel logs (`dmesg`) for crashed processes. | +| Failed to wait for FUSE process termination | `on FUSE stop wait` | Check for zombie processes. Review kernel signal handling. | +| Error during filesystem shutdown | `on file system shutdown` | Check Fusion logs for pending upload errors. See [Understanding Fusion logs](#understanding-fusion-logs). | +| Error during filesystem unmount | `on file system unmount` | Run `fusermount -u /fusion` or `umount -l /fusion` manually. | +| Failed read/write path validation | `check-rw` or `check-ro` | Verify cloud credentials and bucket permissions. | + +### GPU tracer binary + +| Exit code | Meaning | When | +|-----------|---------|------| +| `0` | Success | Normal completion (GPU detected or not) | +| `1` | Error | Failed to start GPU monitoring | +| `2` | Invalid input | Missing PID, invalid PID format, or PID `<= 0` | + +## FUSE status codes + +Fusion maps internal errors to standard FUSE status codes returned to the operating system. These are the [errno](https://man7.org/linux/man-pages/man3/errno.3.html) values applications receive when filesystem operations fail. + +:::note +For a complete list of errno values and their meanings, see the [Linux errno man page](https://man7.org/linux/man-pages/man3/errno.3.html) or run `errno -l` on a Linux system. +::: + +### Returned status codes + +Fusion's filesystem implementation actively returns these status codes: + +| FUSE status | Errno | Description | Common causes in Fusion | +|------------------|-------|---------------------------|-------------------------| +| `fuse.OK` | 0 | Success | Operation completed successfully | +| `fuse.ENOENT` | 2 | No such file or directory | File/entry not found in cache or remote store; cloud provider ResourceNotFound/ContainerNotFound errors | +| `fuse.EINTR` | 4 | Interrupted system call | Context cancelled | +| `fuse.EIO` | 5 | I/O error | General I/O errors, internal failures, remote store errors, unknown non-cloud errors | +| `fuse.EACCES` | 13 | Permission denied | Write attempt to read-only path; cloud provider Unauthenticated/InvalidCredentials/Forbidden/AccountError errors | +| `fuse.EBUSY` | 16 | Device or resource busy | Cloud provider RateLimited/Busy/ResourceArchived errors | +| `fuse.EEXIST` | 17 | File exists | Cloud provider Conflict errors (e.g., resource already exists) | +| `fuse.EINVAL` | 22 | Invalid argument | Invalid parameters (e.g., readlink on non-symlink) | +| `fuse.EROFS` | 30 | Read-only file system | Attempt to modify read-only object | +| `fuse.ERANGE` | 34 | Result too large | Buffer too small for xattr value | +| `fuse.ENOSYS` | 38 | Function not implemented | Operation not wired in Fusion's FUSE layer | +| `fuse.ENOATTR` | 93 | No such attribute | Extended attribute not found | +| `fuse.ENOTSUP` | 95 | Operation not supported | Operation explicitly rejected (e.g., hard links) | +| `fuse.ETIMEDOUT` | 110 | Connection timed out | Context deadline exceeded | +| `fuse.EREMOTEIO` | 121 | Remote I/O error | Cloud provider errors (QuotaExceeded, unknown cloud errors) | From 47e0a48181b4d6e3f4dac1b6a5cb618e371acd4a Mon Sep 17 00:00:00 2001 From: Llewellyn vd Berg <113503285+llewellyn-sl@users.noreply.github.com> Date: Wed, 14 Jan 2026 11:11:36 +0200 Subject: [PATCH 3/9] Remove related docs links from page bottom --- fusion_docs/troubleshooting/error-codes-exit-messages.md | 6 ------ 1 file changed, 6 deletions(-) diff --git a/fusion_docs/troubleshooting/error-codes-exit-messages.md b/fusion_docs/troubleshooting/error-codes-exit-messages.md index 215cc4c1e..5e9bf9b6a 100644 --- a/fusion_docs/troubleshooting/error-codes-exit-messages.md +++ b/fusion_docs/troubleshooting/error-codes-exit-messages.md @@ -436,9 +436,3 @@ case $exit_code in ;; esac ``` - -## Related documentation - -- [Fusion overview](https://docs.seqera.io/fusion/) -- [Fusion configuration](https://docs.seqera.io/fusion/configuration/) -- [Fusion Snapshots](./fusion-snapshots.md) From 2263543c7ef9dd49a6316fa27bca5b8d30ea120e Mon Sep 17 00:00:00 2001 From: Llewellyn vd Berg <113503285+llewellyn-sl@users.noreply.github.com> Date: Wed, 14 Jan 2026 11:12:19 +0200 Subject: [PATCH 4/9] =?UTF-8?q?Delete=20first=20draft=20=E2=80=94=20change?= =?UTF-8?q?s=20incorporated?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- .../error-codes-exit-messages_old.md | 151 ------------------ 1 file changed, 151 deletions(-) delete mode 100644 fusion_docs/troubleshooting/error-codes-exit-messages_old.md diff --git a/fusion_docs/troubleshooting/error-codes-exit-messages_old.md b/fusion_docs/troubleshooting/error-codes-exit-messages_old.md deleted file mode 100644 index 1a18e7e32..000000000 --- a/fusion_docs/troubleshooting/error-codes-exit-messages_old.md +++ /dev/null @@ -1,151 +0,0 @@ ---- -title: Error codes and exit messages -description: "Reference for Fusion error codes, exit codes, and error messages" -date created: "2025-01-12" -last updated: "2025-01-12" -tags: [errors, error-codes, exit-codes, fuse, logging, fusion] ---- - -This page describes Fusion's error reporting system, including exit codes, FUSE status codes (errno values), cloud provider error categories, and internal error types. - -## Architectural limitations - -Fusion plays two critical roles in the execution of a Nextflow task. Each impacts error reporting differently: - -- **As a filesystem**: Fusion operates as a FUSE filesystem to provide transparent access to cloud storage. When processes execute I/O operations (e.g., `open()`, `read()`, `write()`, `unlink()`), they interact with Fusion through the standard POSIX filesystem interface. - -- **As a container entrypoint**: Fusion acts as the container's entrypoint, wrapping the execution of the Nextflow task. When a container starts, Fusion initializes first, mounts the filesystem, and then launches the actual task command. Upon task completion, both Fusion and the task must communicate their status through a single exit code, which is constrained to 256 possible values (`0`-`255`). This creates a ambiguity as Fusion reports its own failures using specific codes (e.g. `174`), but the task process can return any value in this range. Consequently, when a container exits with a specific code, determining whether the failure originated from Fusion or the task is impossible without examining the logs. - -These architectural roles impose constraints on how Fusion can communicate operation errors as well as its own failures: - -- **Limitations in POSIX error codes**: As a filesystem, Fusion communicates I/O operation failures only through standard POSIX error codes (`ENOENT`, `EACCES`, `EIO`, etc.). The filesystem interface provides no mechanism to return rich error context—it cannot explain why an operation failed beyond a generic error code. As a result, Fusion cannot tell the user process through the filesystem interface and returns only a generic I/O error (`EIO`). - -- **User process controls output**: When Fusion returns a POSIX error code, the process (not Fusion) determines what to display to the user. Fusion cannot control this output. - -- **Ambiguity of exit codes**: As the container entrypoint, the container's final exit code can originate from either Fusion itself (e.g., exit code `174`) or from the task's actual command. When a container exits with a failure code, there's no immediate way to determine the source of the failure without examining logs. - -- **Mixed log outputs**: Fusion also emits errors to stdout, where they typically mix with the task output. They are indistinguishable from a task's own errors when your review the task's logs. - -## Error paths - -Fusion is a FUSE filesystem that bridges applications and cloud object stores. Errors originate from multiple layers and propagate through the filesystem components in three main paths: - -1. **Cloud → Storage Backend → FUSE Layer → Kernel → Application** - - - Storage backends catch and normalize cloud errors (network timeouts, auth failures, rate limits) using the `clouderr` package - - Storage backends return normalized cloud errors (with provider-agnostic categories) or internal errors (`ErrNotFound`, `ErrReadOnly`, etc.) - - The FUSE layer maps both cloud errors and internal errors to FUSE status codes (`ENOENT`, `EACCES`, `EREMOTEIO`, `EIO`) - - The kernel translates FUSE status to errno values for the application - - Fusion logs cloud errors with structured details (provider, error code, HTTP status, request ID) - -1. **Failures during start up/shut down → Exit Code** - - - Startup: Configuration errors, missing credentials, or mount failures terminate Fusion immediately - - Shutdown: Async uploads or consolidation of pending operations - - Failures surface as exit code `174` (Fusion I/O error) or `1` (fatal error) - -1. **Background Operations → Logs** - - - Async uploads during normal operation, cache eviction, and snapshot operations log errors but may not surface them to applications - - Errors are reported in Fusion (see [Understanding Fusion logs](#understanding-fusion-logs)) - -## Triaging errors - -When troubleshooting Fusion errors: - -1. Check the exit code: - - Check the process exit code (`$?`) to understand if Fusion terminated normally (`0`), encountered an I/O error (`174`), or had a command issue (`127`). -1. Look at FUSE status in the logs: - - If a filesystem operation failed, use the logs to identify the FUSE status code (e.g., `ENOENT`, `EREMOTEIO`, `EIO`) returned to the application. -1. Check for cloud error fields: - - If you see `EREMOTEIO` or cloud-related failures, identify the specific cloud error fields in the logs: - - `provider` - - `provider_code` - - `provider_http_status` - - `provider_request_id` - - :::note - The field `error_code` indicates Fusion's internal categorization of the cloud error normalized across providers (e.g., `ResourceNotFound`, `Forbidden`, `RateLimited`). - ::: - -1. Identify the mapped internal error: - - The FUSE status code maps back to either a cloud error category or a specific internal error (e.g., `EACCES` indicates an authentication problem, `EREMOTEIO` indicates a cloud backend issue). Check the Fusion logs for more details on the error that triggered the FUSE status code (see [Understanding Fusion logs](#understanding-fusion-logs)). - -:::tip -Enable `debug` logging to for the full log: - -```bash -export FUSION_LOG_LEVEL=debug -``` - -::: - -## Exit codes - -Fusion binaries return specific exit codes to indicate the outcome of execution. - -:::tip -For exit codes `175` and `176`, see [Fusion Snapshots](./fusion-snapshots.md). -::: - -### Fusion binary - -| Exit code | Constant | Description | -|-----------|----------|-------------| -| `0` | - | Success, normal completion. | -| `1` | - | Fatal error during startup (via `log.Fatal()`). | -| `127` | - | Command not found (`.command.sh` missing). Triggers automatic retry up to `FUSION_MAX_MOUNT_RETRIES` times. | -| `174` | `ErrorExitCode` | Fusion I/O error, application-level input/output error. | - -:::note -`log.Fatal()` calls during startup produce exit code `1`. See [Fatal error messages](#fatal-error-messages) for the specific messages that trigger this exit. -::: - -The `sysexits.h` standard uses exit code 74 for "input/output error" and reserves 150-199 for application use. In Fusion's context, 174 means "application input/output error". - -| Scenario | Log cue | Suggested next step | -|----------|----------------|---------------------| -| Failed to start FUSE process in background | `on FUSE process` | Check FUSE/kernel support. Verify `/dev/fuse` exists. | -| Failed to send SIGTERM to FUSE process | `on FUSE sigterm send` | Check kernel logs (`dmesg`) for crashed processes. | -| Failed to wait for FUSE process termination | `on FUSE stop wait` | Check for zombie processes. Review kernel signal handling. | -| Error during filesystem shutdown | `on file system shutdown` | Check Fusion logs for pending upload errors. See [Understanding Fusion logs](#understanding-fusion-logs). | -| Error during filesystem unmount | `on file system unmount` | Run `fusermount -u /fusion` or `umount -l /fusion` manually. | -| Failed read/write path validation | `check-rw` or `check-ro` | Verify cloud credentials and bucket permissions. | - -### GPU tracer binary - -| Exit code | Meaning | When | -|-----------|---------|------| -| `0` | Success | Normal completion (GPU detected or not) | -| `1` | Error | Failed to start GPU monitoring | -| `2` | Invalid input | Missing PID, invalid PID format, or PID `<= 0` | - -## FUSE status codes - -Fusion maps internal errors to standard FUSE status codes returned to the operating system. These are the [errno](https://man7.org/linux/man-pages/man3/errno.3.html) values applications receive when filesystem operations fail. - -:::note -For a complete list of errno values and their meanings, see the [Linux errno man page](https://man7.org/linux/man-pages/man3/errno.3.html) or run `errno -l` on a Linux system. -::: - -### Returned status codes - -Fusion's filesystem implementation actively returns these status codes: - -| FUSE status | Errno | Description | Common causes in Fusion | -|------------------|-------|---------------------------|-------------------------| -| `fuse.OK` | 0 | Success | Operation completed successfully | -| `fuse.ENOENT` | 2 | No such file or directory | File/entry not found in cache or remote store; cloud provider ResourceNotFound/ContainerNotFound errors | -| `fuse.EINTR` | 4 | Interrupted system call | Context cancelled | -| `fuse.EIO` | 5 | I/O error | General I/O errors, internal failures, remote store errors, unknown non-cloud errors | -| `fuse.EACCES` | 13 | Permission denied | Write attempt to read-only path; cloud provider Unauthenticated/InvalidCredentials/Forbidden/AccountError errors | -| `fuse.EBUSY` | 16 | Device or resource busy | Cloud provider RateLimited/Busy/ResourceArchived errors | -| `fuse.EEXIST` | 17 | File exists | Cloud provider Conflict errors (e.g., resource already exists) | -| `fuse.EINVAL` | 22 | Invalid argument | Invalid parameters (e.g., readlink on non-symlink) | -| `fuse.EROFS` | 30 | Read-only file system | Attempt to modify read-only object | -| `fuse.ERANGE` | 34 | Result too large | Buffer too small for xattr value | -| `fuse.ENOSYS` | 38 | Function not implemented | Operation not wired in Fusion's FUSE layer | -| `fuse.ENOATTR` | 93 | No such attribute | Extended attribute not found | -| `fuse.ENOTSUP` | 95 | Operation not supported | Operation explicitly rejected (e.g., hard links) | -| `fuse.ETIMEDOUT` | 110 | Connection timed out | Context deadline exceeded | -| `fuse.EREMOTEIO` | 121 | Remote I/O error | Cloud provider errors (QuotaExceeded, unknown cloud errors) | From c6a91f6afd0ebcf34069e2da6b0ed2e6d67fff29 Mon Sep 17 00:00:00 2001 From: Justine Geffen Date: Tue, 20 Jan 2026 15:51:26 +0200 Subject: [PATCH 5/9] Update error codes documentation and last updated date Updated last updated date and removed architectural limitations section. Signed-off-by: Justine Geffen --- .../error-codes-exit-messages.md | 20 +------------------ 1 file changed, 1 insertion(+), 19 deletions(-) diff --git a/fusion_docs/troubleshooting/error-codes-exit-messages.md b/fusion_docs/troubleshooting/error-codes-exit-messages.md index 5e9bf9b6a..43f4db311 100644 --- a/fusion_docs/troubleshooting/error-codes-exit-messages.md +++ b/fusion_docs/troubleshooting/error-codes-exit-messages.md @@ -2,30 +2,12 @@ title: Error codes and exit messages description: "Reference for Fusion error codes, exit codes, and error messages" date created: "2025-01-12" -last updated: "2025-01-12" +last updated: "2025-01-20" tags: [errors, error-codes, exit-codes, fuse, logging, fusion] --- This page describes Fusion's error reporting system, including exit codes, FUSE status codes (errno values), cloud provider error categories, and internal error types. -## Architectural limitations - -Fusion plays two critical roles in the execution of a Nextflow task. Each impacts error reporting differently: - -- **As a filesystem**: Fusion operates as a FUSE filesystem to provide transparent access to cloud storage. When processes execute I/O operations (e.g., `open()`, `read()`, `write()`, `unlink()`), they interact with Fusion through the standard POSIX filesystem interface. - -- **As a container entrypoint**: Fusion acts as the container's entrypoint, wrapping the execution of the Nextflow task. When a container starts, Fusion initializes first, mounts the filesystem, and then launches the actual task command. Upon task completion, both Fusion and the task must communicate their status through a single exit code, which is constrained to 256 possible values (`0`-`255`). This creates an ambiguity as Fusion reports its own failures using specific codes (e.g. `174`), but the task process can return any value in this range. Consequently, when a container exits with a specific code, determining whether the failure originated from Fusion or the task is impossible without examining the logs. - -These architectural roles impose constraints on how Fusion can communicate operation errors as well as its own failures: - -- **Limitations in POSIX error codes**: As a filesystem, Fusion communicates I/O operation failures only through standard POSIX error codes (`ENOENT`, `EACCES`, `EIO`, etc.). The filesystem interface provides no mechanism to return rich error context—it cannot explain why an operation failed beyond a generic error code. As a result, Fusion cannot provide detailed error information to the user process through the filesystem interface and returns only a generic I/O error (`EIO`). - -- **User process controls output**: When Fusion returns a POSIX error code, the process (not Fusion) determines what to display to the user. Fusion cannot control this output. - -- **Ambiguity of exit codes**: As the container entrypoint, the container's final exit code can originate from either Fusion itself (e.g., exit code `174`) or from the task's actual command. When a container exits with a failure code, there's no immediate way to determine the source of the failure without examining logs. - -- **Mixed log outputs**: Fusion also emits errors to stdout, where they typically mix with the task output. They are indistinguishable from a task's own errors when you review the task's logs. - ## Error paths Fusion is a FUSE filesystem that bridges applications and cloud object stores. Errors originate from multiple layers and propagate through the filesystem components in three main paths: From f087c26ff6e5397a124193a92cff4e85f0ca883f Mon Sep 17 00:00:00 2001 From: Justine Geffen Date: Tue, 20 Jan 2026 15:51:37 +0200 Subject: [PATCH 6/9] Apply suggestion from @justinegeffen Signed-off-by: Justine Geffen --- fusion_docs/troubleshooting/error-codes-exit-messages.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fusion_docs/troubleshooting/error-codes-exit-messages.md b/fusion_docs/troubleshooting/error-codes-exit-messages.md index 43f4db311..05f3e36d8 100644 --- a/fusion_docs/troubleshooting/error-codes-exit-messages.md +++ b/fusion_docs/troubleshooting/error-codes-exit-messages.md @@ -12,7 +12,7 @@ This page describes Fusion's error reporting system, including exit codes, FUSE Fusion is a FUSE filesystem that bridges applications and cloud object stores. Errors originate from multiple layers and propagate through the filesystem components in three main paths: -1. **Cloud → Storage Backend → FUSE Layer → Kernel → Application** +1. **Cloud > Storage Backend > FUSE Layer > Kernel > Application** - Storage backends catch and normalize cloud errors (network timeouts, auth failures, rate limits) using the `clouderr` package - Storage backends return normalized cloud errors (with provider-agnostic categories) or internal errors (`ErrNotFound`, `ErrReadOnly`, etc.) From b13d64c5cdc8c5647131f0745420a7d2a166b41c Mon Sep 17 00:00:00 2001 From: Justine Geffen Date: Wed, 21 Jan 2026 00:50:32 +0200 Subject: [PATCH 7/9] Update fusion_docs/troubleshooting/error-codes-exit-messages.md Co-authored-by: Alberto Miranda Signed-off-by: Justine Geffen --- fusion_docs/troubleshooting/error-codes-exit-messages.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fusion_docs/troubleshooting/error-codes-exit-messages.md b/fusion_docs/troubleshooting/error-codes-exit-messages.md index 05f3e36d8..3ded1461c 100644 --- a/fusion_docs/troubleshooting/error-codes-exit-messages.md +++ b/fusion_docs/troubleshooting/error-codes-exit-messages.md @@ -14,7 +14,7 @@ Fusion is a FUSE filesystem that bridges applications and cloud object stores. E 1. **Cloud > Storage Backend > FUSE Layer > Kernel > Application** - - Storage backends catch and normalize cloud errors (network timeouts, auth failures, rate limits) using the `clouderr` package + - Errors from the cloud provider (e.g. network timeouts, auth failures, rate limits) are captured by the Storage backend, which normalizes them into provider-agnostic categories (see #cloud-provider-error-categories). - Storage backends return normalized cloud errors (with provider-agnostic categories) or internal errors (`ErrNotFound`, `ErrReadOnly`, etc.) - The FUSE layer maps both cloud errors and internal errors to FUSE status codes (`ENOENT`, `EACCES`, `EREMOTEIO`, `EIO`) - The kernel translates FUSE status to errno values for the application From a2212754724a66abb0b08bc49a3d8f8f2226b65c Mon Sep 17 00:00:00 2001 From: Justine Geffen Date: Wed, 21 Jan 2026 00:53:38 +0200 Subject: [PATCH 8/9] Apply suggestion from @alberto-miranda Co-authored-by: Alberto Miranda Signed-off-by: Justine Geffen --- fusion_docs/troubleshooting/error-codes-exit-messages.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fusion_docs/troubleshooting/error-codes-exit-messages.md b/fusion_docs/troubleshooting/error-codes-exit-messages.md index 3ded1461c..e8aeeaa82 100644 --- a/fusion_docs/troubleshooting/error-codes-exit-messages.md +++ b/fusion_docs/troubleshooting/error-codes-exit-messages.md @@ -10,7 +10,7 @@ This page describes Fusion's error reporting system, including exit codes, FUSE ## Error paths -Fusion is a FUSE filesystem that bridges applications and cloud object stores. Errors originate from multiple layers and propagate through the filesystem components in three main paths: +Fusion is a FUSE filesystem that bridges applications and cloud object stores. As such, errors may originate from multiple layers, but will propagate through the filesystem components following three major paths: 1. **Cloud > Storage Backend > FUSE Layer > Kernel > Application** From 911dd2f7cdf6f31e7ffa56c4da6ce3c20a0a6e1e Mon Sep 17 00:00:00 2001 From: Justine Geffen Date: Wed, 21 Jan 2026 13:36:22 +0200 Subject: [PATCH 9/9] Update fusion_docs/troubleshooting/error-codes-exit-messages.md Co-authored-by: Cristian Ramon-Cortes Signed-off-by: Justine Geffen --- fusion_docs/troubleshooting/error-codes-exit-messages.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fusion_docs/troubleshooting/error-codes-exit-messages.md b/fusion_docs/troubleshooting/error-codes-exit-messages.md index e8aeeaa82..8c86db2e7 100644 --- a/fusion_docs/troubleshooting/error-codes-exit-messages.md +++ b/fusion_docs/troubleshooting/error-codes-exit-messages.md @@ -16,7 +16,7 @@ Fusion is a FUSE filesystem that bridges applications and cloud object stores. A - Errors from the cloud provider (e.g. network timeouts, auth failures, rate limits) are captured by the Storage backend, which normalizes them into provider-agnostic categories (see #cloud-provider-error-categories). - Storage backends return normalized cloud errors (with provider-agnostic categories) or internal errors (`ErrNotFound`, `ErrReadOnly`, etc.) - - The FUSE layer maps both cloud errors and internal errors to FUSE status codes (`ENOENT`, `EACCES`, `EREMOTEIO`, `EIO`) + - The FUSE layer maps both cloud errors and internal errors to FUSE status codes (e.g., `ENOENT`, `EACCES`, `EREMOTEIO`, `EIO`) - The kernel translates FUSE status to errno values for the application - Fusion logs cloud errors with structured details (provider, error code, HTTP status, request ID)