-
Notifications
You must be signed in to change notification settings - Fork 15
feature: Add diagnose command #336
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
@microsoft-github-policy-service agree company="Microsoft" |
|
/AzurePipelines run [GITHUB]-trident-pr |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/AzurePipelines run [GITHUB]-trident-pr |
|
Commenter does not have sufficient privileges for PR 336 in repo microsoft/trident |
|
/AzurePipelines run [GITHUB]-trident-pr-e2e |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/AzurePipelines run [GITHUB]-trident-pr |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
i think it'd be great to invoke trident/tools/storm/helpers/ab_update.go Line 395 in ce9da55
something like: out, err := utils.InvokeTrident(h.args.Env, client, h.args.EnvVars, "diagnose")
if err != nil {
return fmt.Errorf("failed to invoke Trident diagnose: %w", err)
}
logrus.Infof("Trident service is in expected state")
return nil, nilthen we'd get at least a little testing for it. it'd be even better to scp the diagnostics to the artifacts folder or even validate the contents of the diagnostics, but i'd be happy with something simple first :) |
|
/AzurePipelines run [GITHUB]-trident-pr |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/AzurePipelines run [GITHUB]-trident-pr |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
would be nice to have a docs/Explanation/Diagnostics.md file, maybe it references https://microsoft.github.io/trident/docs/How-To-Guides/View-Trident's-Background-Log to explain some of the collected diagnostics? maybe suggest using
also add note about diagnostics here: https://microsoft.github.io/trident/docs/Trident/How-Do-I-Interact-With-Trident |
c52d608 to
163a337
Compare
|
@bfjelds added an e2e test and docs. Promoting from draft PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds a diagnose command to Trident that generates comprehensive diagnostic bundles for troubleshooting. The command collects system information, logs, metrics, and datastores into a compressed tarball (.tar.zst) that can be shared for support purposes.
Key Changes
- New
trident diagnoseCLI command that creates compressed diagnostic bundles containing system info, logs, metrics, and datastore files - End-to-end test in Storm framework that validates the diagnostic bundle contents
- Comprehensive documentation including a new How-To guide and updates to tutorials
Reviewed Changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| crates/trident/src/diagnostics.rs | Core diagnostics implementation including report collection, bundle creation, and support for historical logs |
| crates/trident/src/cli.rs | Added Diagnose command to CLI with output path parameter |
| crates/trident/src/main.rs | Integrated diagnose command into main execution flow |
| crates/trident/src/lib.rs | Added public diagnose method and made diagnostics module and TEMPORARY_DATASTORE_PATH public |
| crates/trident/src/logging/tracestream.rs | Made PLATFORM_INFO publicly accessible for diagnostics collection |
| crates/trident_api/src/error.rs | Added DiagnosticBundleGeneration error variant |
| tools/storm/utils/ssh/client/client.go | Added CopyRemoteFileToLocal function for downloading diagnostic bundles in tests |
| tools/storm/helpers/ab_update.go | Added checkDiagnostics test case that validates bundle contents and structure |
| docs/How-To-Guides/Diagnostics.md | Comprehensive guide explaining bundle generation, structure, and use cases |
| docs/Tutorials/Trident-Hello-World.md | Added troubleshooting section with diagnose command example |
| docs/Tutorials/Performing-an-AB-Update.md | Added troubleshooting section with diagnose command example |
| docs/Reference/Trident-CLI.md | Added complete CLI reference for diagnose command |
| docs/Trident/How-Do-I-Interact-With-Trident.md | Added diagnose command to the list of available CLI commands |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
docs/How-To-Guides/Diagnostics.md
Outdated
| ``` | ||
| trident-diagnostics/ | ||
| ├── report.json # Diagnostic report with metadata | ||
| └── logs/ | ||
| ├── trident-full.log # Current Trident execution log | ||
| ├── trident-metrics.jsonl # Current Trident metrics | ||
| ├── historical/ # Logs from past servicing | ||
| │ ├── trident-<servicing_state>-<timestamp>.log | ||
| │ ├── trident-metrics-<servicing_state>-<timestamp>.log | ||
| │ └── ... | ||
| ├── datastore.sqlite # Default datastore | ||
| ├── datastore-tmp.sqlite # Temporary datastore (if applicable) | ||
| └── datastore-configured.sqlite # Configured datastore (if applicable) | ||
| ``` |
Copilot
AI
Nov 20, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The bundle structure documentation shows datastore files as being inside the logs/ directory, but the implementation (in diagnostics.rs lines 276-292) places them at the root level of the bundle. The structure should be:
trident-diagnostics/
├── report.json
├── datastore.sqlite
├── datastore-tmp.sqlite
├── datastore-configured.sqlite
└── logs/
├── trident-full.log
├── trident-metrics.jsonl
└── historical/
└── ...
| ) -> Result<PathBuf, Error> { | ||
| let mut collected_files = Vec::new(); | ||
| let file = osutils::files::create_file(output_path)?; | ||
| let encoder = zstd::Encoder::new(file, 0)?; |
Copilot
AI
Nov 20, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Using compression level 0 for zstd means using the default compression level (which is typically level 3). Consider adding a comment to clarify this is intentional, or explicitly setting a specific level (e.g., 3 for default) for better code clarity. Alternatively, consider using a higher level like 10 for better compression at the cost of slightly slower compression time, since diagnostics bundles are typically created infrequently and smaller file sizes are beneficial for transmission.
| let encoder = zstd::Encoder::new(file, 0)?; | |
| // Use compression level 10 for zstd: higher compression for diagnostics bundles, which are created infrequently. | |
| let encoder = zstd::Encoder::new(file, 10)?; |
|
/AzurePipelines run [GITHUB]-trident-pr-e2e |
|
Azure Pipelines successfully started running 1 pipeline(s). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 13 out of 13 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
crates/trident/src/diagnostics.rs
Outdated
| if let Ok(entries) = fs::read_dir(log_dir) { | ||
| for entry in entries.flatten() { | ||
| if let Ok(file_name) = entry.file_name().into_string() { | ||
| if file_name.starts_with("trident-") && file_name.ends_with(".log") { |
Copilot
AI
Nov 21, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The file extension filter only captures .log files, but historical metrics files use the .jsonl extension. This means historical metrics files (e.g., trident-metrics-<timestamp>.jsonl) will not be collected. Update the condition to also include .jsonl files: file_name.ends_with(\".log\") || file_name.ends_with(\".jsonl\")
| if file_name.starts_with("trident-") && file_name.ends_with(".log") { | |
| if file_name.starts_with("trident-") && (file_name.ends_with(".log") || file_name.ends_with(".jsonl")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 17 out of 17 changed files in this pull request and generated 8 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| ``` | ||
| Options: | ||
| -o, --output <OUTPUT> | ||
| Path where the support bundle will be saved | ||
| -v, --verbosity <VERBOSITY> | ||
| Logging verbosity [OFF, ERROR, WARN, INFO, DEBUG, TRACE] | ||
| [default: DEBUG] | ||
| ``` |
Copilot
AI
Dec 16, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The CLI reference documentation is incomplete. The --full and --selinux flags are defined in the CLI code but not documented in this reference. These optional flags should be documented in the "Argument summary" section and have their own "Argument Details" subsections explaining what they do.
| /// This command collects diagnostic information including logs, and | ||
| /// system information. The output is packaged as a compressed tarball |
Copilot
AI
Dec 16, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment has inconsistent spacing. There's an extra space before "system information" on line 147. It should be "including logs, and system information" or better yet, "including logs and system information" (removing the comma).
crates/trident/src/diagnostics.rs
Outdated
| fn get_virtualization_info() -> (bool, String) { | ||
| if let Ok(content) = fs::read_to_string(DMI_SYS_VENDOR_FILE) { | ||
| let vendor = content.trim().to_lowercase(); | ||
| if vendor.contains("qemu") { | ||
| return (true, "qemu".to_string()); | ||
| } | ||
| if let Ok(product) = fs::read_to_string(DMI_PRODUCT_NAME_FILE) { | ||
| let product = product.trim().to_lowercase(); | ||
| if vendor.contains("microsoft corporation") && product.contains("virtual machine") { | ||
| return (true, "hyperv".to_string()); | ||
| } | ||
| } | ||
| } | ||
|
|
||
| (false, "none detected".to_string()) | ||
| } |
Copilot
AI
Dec 16, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The virtualization detection logic is incomplete. It only checks for QEMU and Hyper-V, but common virtualization platforms like VMware, VirtualBox, Xen, and KVM (when not using QEMU) are not detected. Consider adding detection for additional platforms or using a more comprehensive method like checking /sys/class/dmi/id/product_name for "VMware", "VirtualBox", "Xen", etc.
| /// Path where the support bundle will be saved | ||
| #[clap(short, long)] | ||
| output: PathBuf, | ||
| /// Include full system journal and dmesg output |
Copilot
AI
Dec 16, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment mentions "Include full system journal and dmesg output" but the implementation only collects the journal via collect_full_journal(). There's no dmesg collection in the code. Either update the comment to remove "and dmesg output" or add dmesg collection to the implementation.
| func CopyRemoteFileToLocal(client *ssh.Client, remotePath string, localFilePath string) error { | ||
| sftpClient, err := sftp.NewSftpSudoClient(client) | ||
| if err != nil { | ||
| return fmt.Errorf("failed to create SFTP client: %w", err) | ||
| } | ||
| defer sftpClient.Close() | ||
|
|
||
| remoteFile, err := sftpClient.Open(remotePath) | ||
| if err != nil { | ||
| return fmt.Errorf("failed to open remote file %s: %w", remotePath, err) | ||
| } | ||
| defer remoteFile.Close() | ||
|
|
||
| localFile, err := os.Create(localFilePath) | ||
| if err != nil { | ||
| return fmt.Errorf("failed to create local file: %w", err) | ||
| } | ||
| defer localFile.Close() | ||
|
|
||
| if _, err := io.Copy(localFile, remoteFile); err != nil { | ||
| return fmt.Errorf("failed to copy file to local system: %w", err) | ||
| } | ||
|
|
||
| return nil |
Copilot
AI
Dec 16, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the io.Copy operation fails, the partially written local file is not cleaned up. Consider removing the local file on error to avoid leaving incomplete files on disk. You could either manually delete the file in an error path, or track the success and defer a conditional cleanup.
| bundlePath := "/tmp/bundle" | ||
| if env == stormenv.TridentEnvironmentContainer { | ||
| bundlePath = "/host/tmp/bundle" | ||
| } |
Copilot
AI
Dec 16, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The bundle path is missing the .tar.zst file extension. The diagnose command in the code expects an output path that includes the extension (as shown in documentation and Rust tests), but this code passes /tmp/bundle without an extension. This will likely create a file without the standard extension, which could confuse users and tools. Consider changing to /tmp/bundle.tar.zst.
| ```bash | ||
| sudo trident diagnose --output /tmp/trident-diagnostics.tar.zst | ||
| ``` |
Copilot
AI
Dec 16, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These examples recommend running sudo trident diagnose with --output /tmp/..., but the diagnostics bundle includes sensitive data (datastores, system journals, SELinux logs, configuration) and is written with default file permissions, so on a multi-user system another local user may be able to read the bundle directly from /tmp. To avoid unintentionally exposing system data locally, consider recommending an output directory that is only accessible to root and/or documenting that the bundle file’s permissions should be restricted (for example by using a non-world-readable directory or a restrictive umask).
| ```bash | ||
| ssh -i $HOME/.ssh/id_rsa tutorial-user@$TARGET_MACHINE_IP sudo trident diagnose --output /tmp/trident-diagnostics.tar.zst | ||
| ``` |
Copilot
AI
Dec 16, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This tutorial suggests generating a diagnostics bundle with sudo trident diagnose --output /tmp/trident-diagnostics.tar.zst, but the bundle contains sensitive data (Trident datastore, logs, system configuration) and is created with default file permissions, so any other local user may be able to read it from /tmp on a multi-user system. To reduce the risk of local data exposure, consider recommending an output path under a root-only directory (e.g., /root/...) or otherwise documenting that the bundle file should have restrictive permissions instead of being written in a world-accessible location.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 17 out of 17 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
/AzurePipelines run [GITHUB]-trident-pr-e2e |
|
Azure Pipelines successfully started running 1 pipeline(s). |
crates/trident/src/diagnostics.rs
Outdated
| pub struct DiagnosticsReport { | ||
| /// Timestamp when the report was generated | ||
| pub timestamp: String, | ||
| /// Trident version | ||
| pub version: String, | ||
| /// Host description (VM/baremetal) | ||
| pub host_description: HostDescription, | ||
| /// Host status from the datastore | ||
| pub host_status: Option<HostStatus>, | ||
| /// Collected files metadata | ||
| pub collected_files: Option<Vec<FileMetadata>>, | ||
| /// Collection failures that occurred during diagnostics gathering | ||
| pub collection_failures: Vec<CollectionFailure>, | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: i think none of these structs or fields need to be pub
crates/trident/src/diagnostics.rs
Outdated
| .as_ref() | ||
| .and_then(|hs| hs.spec.trident.datastore_path.parent()) | ||
| { | ||
| if let Ok(entries) = fs::read_dir(log_dir) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no error handling here
🔍 Description
This PR adds a diagnose subcommand to Trident (#285).
The command creates a tarball containing:
🤔 Rationale
This PR aims to simplify troubleshooting by providing one command that aggregates relevant debugging info into a single tarball.
📝 Checks
📌 Follow-ups
TODO (before promoting from draft):
🗒️ Notes