Skip to content

Conversation

@nbojanic
Copy link

🔍 Description

This PR adds a diagnose subcommand to Trident (#285).

The command creates a tarball containing:

  • host status
  • whether trident is in a container
  • whether trident is in a VM
  • platform info (os release, kernel version, cpus, memory)
  • disk information (lsblk)
  • log files
  • metrics
  • datastores

🤔 Rationale

This PR aims to simplify troubleshooting by providing one command that aggregates relevant debugging info into a single tarball.

📝 Checks

📌 Follow-ups

TODO (before promoting from draft):

  • check on warnings in osutils crate
  • add an end to end test for trident diagnose

🗒️ Notes

@nbojanic
Copy link
Author

@microsoft-github-policy-service agree company="Microsoft"

@frhuelsz
Copy link
Contributor

/AzurePipelines run [GITHUB]-trident-pr

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@nbojanic
Copy link
Author

/AzurePipelines run [GITHUB]-trident-pr

@azure-pipelines
Copy link

Commenter does not have sufficient privileges for PR 336 in repo microsoft/trident

@bfjelds
Copy link
Member

bfjelds commented Nov 17, 2025

/AzurePipelines run [GITHUB]-trident-pr-e2e

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@bfjelds
Copy link
Member

bfjelds commented Nov 17, 2025

/AzurePipelines run [GITHUB]-trident-pr

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@bfjelds
Copy link
Member

bfjelds commented Nov 18, 2025

@bfjelds
Copy link
Member

bfjelds commented Nov 18, 2025

i think it'd be great to invoke trident diagnose in the e2e ab-update tests here:

logrus.Infof("Trident service is in expected state")
.

something like:

out, err := utils.InvokeTrident(h.args.Env, client, h.args.EnvVars, "diagnose")
if err != nil {
    return fmt.Errorf("failed to invoke Trident diagnose: %w", err)
}

logrus.Infof("Trident service is in expected state")
return nil, nil

then we'd get at least a little testing for it. it'd be even better to scp the diagnostics to the artifacts folder or even validate the contents of the diagnostics, but i'd be happy with something simple first :)

@bfjelds
Copy link
Member

bfjelds commented Nov 18, 2025

/AzurePipelines run [GITHUB]-trident-pr

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@Britel
Copy link
Collaborator

Britel commented Nov 18, 2025

/AzurePipelines run [GITHUB]-trident-pr

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@bfjelds
Copy link
Member

bfjelds commented Nov 19, 2025

would be nice to have a docs/Explanation/Diagnostics.md file, maybe it references https://microsoft.github.io/trident/docs/How-To-Guides/View-Trident's-Background-Log to explain some of the collected diagnostics?

maybe suggest using trident diagnostics for troubleshooting in these docs:

also add note about diagnostics here: https://microsoft.github.io/trident/docs/Trident/How-Do-I-Interact-With-Trident

@nbojanic nbojanic changed the title feature: Add diagnostics command feature: Add diagnose command Nov 19, 2025
@nbojanic nbojanic force-pushed the user/nbojanic/diagnostics branch from c52d608 to 163a337 Compare November 20, 2025 03:05
@nbojanic
Copy link
Author

@bfjelds added an e2e test and docs. Promoting from draft PR.

@nbojanic nbojanic marked this pull request as ready for review November 20, 2025 03:08
@nbojanic nbojanic requested a review from a team as a code owner November 20, 2025 03:08
Copilot AI review requested due to automatic review settings November 20, 2025 03:08
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds a diagnose command to Trident that generates comprehensive diagnostic bundles for troubleshooting. The command collects system information, logs, metrics, and datastores into a compressed tarball (.tar.zst) that can be shared for support purposes.

Key Changes

  • New trident diagnose CLI command that creates compressed diagnostic bundles containing system info, logs, metrics, and datastore files
  • End-to-end test in Storm framework that validates the diagnostic bundle contents
  • Comprehensive documentation including a new How-To guide and updates to tutorials

Reviewed Changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
crates/trident/src/diagnostics.rs Core diagnostics implementation including report collection, bundle creation, and support for historical logs
crates/trident/src/cli.rs Added Diagnose command to CLI with output path parameter
crates/trident/src/main.rs Integrated diagnose command into main execution flow
crates/trident/src/lib.rs Added public diagnose method and made diagnostics module and TEMPORARY_DATASTORE_PATH public
crates/trident/src/logging/tracestream.rs Made PLATFORM_INFO publicly accessible for diagnostics collection
crates/trident_api/src/error.rs Added DiagnosticBundleGeneration error variant
tools/storm/utils/ssh/client/client.go Added CopyRemoteFileToLocal function for downloading diagnostic bundles in tests
tools/storm/helpers/ab_update.go Added checkDiagnostics test case that validates bundle contents and structure
docs/How-To-Guides/Diagnostics.md Comprehensive guide explaining bundle generation, structure, and use cases
docs/Tutorials/Trident-Hello-World.md Added troubleshooting section with diagnose command example
docs/Tutorials/Performing-an-AB-Update.md Added troubleshooting section with diagnose command example
docs/Reference/Trident-CLI.md Added complete CLI reference for diagnose command
docs/Trident/How-Do-I-Interact-With-Trident.md Added diagnose command to the list of available CLI commands

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 43 to 56
```
trident-diagnostics/
├── report.json # Diagnostic report with metadata
└── logs/
├── trident-full.log # Current Trident execution log
├── trident-metrics.jsonl # Current Trident metrics
├── historical/ # Logs from past servicing
│ ├── trident-<servicing_state>-<timestamp>.log
│ ├── trident-metrics-<servicing_state>-<timestamp>.log
│ └── ...
├── datastore.sqlite # Default datastore
├── datastore-tmp.sqlite # Temporary datastore (if applicable)
└── datastore-configured.sqlite # Configured datastore (if applicable)
```
Copy link

Copilot AI Nov 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bundle structure documentation shows datastore files as being inside the logs/ directory, but the implementation (in diagnostics.rs lines 276-292) places them at the root level of the bundle. The structure should be:

trident-diagnostics/
├── report.json
├── datastore.sqlite
├── datastore-tmp.sqlite
├── datastore-configured.sqlite
└── logs/
    ├── trident-full.log
    ├── trident-metrics.jsonl
    └── historical/
        └── ...

Copilot uses AI. Check for mistakes.
) -> Result<PathBuf, Error> {
let mut collected_files = Vec::new();
let file = osutils::files::create_file(output_path)?;
let encoder = zstd::Encoder::new(file, 0)?;
Copy link

Copilot AI Nov 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Using compression level 0 for zstd means using the default compression level (which is typically level 3). Consider adding a comment to clarify this is intentional, or explicitly setting a specific level (e.g., 3 for default) for better code clarity. Alternatively, consider using a higher level like 10 for better compression at the cost of slightly slower compression time, since diagnostics bundles are typically created infrequently and smaller file sizes are beneficial for transmission.

Suggested change
let encoder = zstd::Encoder::new(file, 0)?;
// Use compression level 10 for zstd: higher compression for diagnostics bundles, which are created infrequently.
let encoder = zstd::Encoder::new(file, 10)?;

Copilot uses AI. Check for mistakes.
@alejandro-microsoft
Copy link
Contributor

/AzurePipelines run [GITHUB]-trident-pr-e2e

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

if let Ok(entries) = fs::read_dir(log_dir) {
for entry in entries.flatten() {
if let Ok(file_name) = entry.file_name().into_string() {
if file_name.starts_with("trident-") && file_name.ends_with(".log") {
Copy link

Copilot AI Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The file extension filter only captures .log files, but historical metrics files use the .jsonl extension. This means historical metrics files (e.g., trident-metrics-<timestamp>.jsonl) will not be collected. Update the condition to also include .jsonl files: file_name.ends_with(\".log\") || file_name.ends_with(\".jsonl\")

Suggested change
if file_name.starts_with("trident-") && file_name.ends_with(".log") {
if file_name.starts_with("trident-") && (file_name.ends_with(".log") || file_name.ends_with(".jsonl")) {

Copilot uses AI. Check for mistakes.
@frhuelsz frhuelsz marked this pull request as draft November 26, 2025 23:20
@nbojanic nbojanic marked this pull request as ready for review December 16, 2025 19:24
Copilot AI review requested due to automatic review settings December 16, 2025 19:24
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated 8 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 445 to 452
```
Options:
-o, --output <OUTPUT>
Path where the support bundle will be saved
-v, --verbosity <VERBOSITY>
Logging verbosity [OFF, ERROR, WARN, INFO, DEBUG, TRACE]
[default: DEBUG]
```
Copy link

Copilot AI Dec 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CLI reference documentation is incomplete. The --full and --selinux flags are defined in the CLI code but not documented in this reference. These optional flags should be documented in the "Argument summary" section and have their own "Argument Details" subsections explaining what they do.

Copilot uses AI. Check for mistakes.
Comment on lines +146 to +147
/// This command collects diagnostic information including logs, and
/// system information. The output is packaged as a compressed tarball
Copy link

Copilot AI Dec 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment has inconsistent spacing. There's an extra space before "system information" on line 147. It should be "including logs, and system information" or better yet, "including logs and system information" (removing the comma).

Copilot uses AI. Check for mistakes.
Comment on lines 275 to 290
fn get_virtualization_info() -> (bool, String) {
if let Ok(content) = fs::read_to_string(DMI_SYS_VENDOR_FILE) {
let vendor = content.trim().to_lowercase();
if vendor.contains("qemu") {
return (true, "qemu".to_string());
}
if let Ok(product) = fs::read_to_string(DMI_PRODUCT_NAME_FILE) {
let product = product.trim().to_lowercase();
if vendor.contains("microsoft corporation") && product.contains("virtual machine") {
return (true, "hyperv".to_string());
}
}
}

(false, "none detected".to_string())
}
Copy link

Copilot AI Dec 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The virtualization detection logic is incomplete. It only checks for QEMU and Hyper-V, but common virtualization platforms like VMware, VirtualBox, Xen, and KVM (when not using QEMU) are not detected. Consider adding detection for additional platforms or using a more comprehensive method like checking /sys/class/dmi/id/product_name for "VMware", "VirtualBox", "Xen", etc.

Copilot uses AI. Check for mistakes.
/// Path where the support bundle will be saved
#[clap(short, long)]
output: PathBuf,
/// Include full system journal and dmesg output
Copy link

Copilot AI Dec 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment mentions "Include full system journal and dmesg output" but the implementation only collects the journal via collect_full_journal(). There's no dmesg collection in the code. Either update the comment to remove "and dmesg output" or add dmesg collection to the implementation.

Copilot uses AI. Check for mistakes.
Comment on lines +121 to +144
func CopyRemoteFileToLocal(client *ssh.Client, remotePath string, localFilePath string) error {
sftpClient, err := sftp.NewSftpSudoClient(client)
if err != nil {
return fmt.Errorf("failed to create SFTP client: %w", err)
}
defer sftpClient.Close()

remoteFile, err := sftpClient.Open(remotePath)
if err != nil {
return fmt.Errorf("failed to open remote file %s: %w", remotePath, err)
}
defer remoteFile.Close()

localFile, err := os.Create(localFilePath)
if err != nil {
return fmt.Errorf("failed to create local file: %w", err)
}
defer localFile.Close()

if _, err := io.Copy(localFile, remoteFile); err != nil {
return fmt.Errorf("failed to copy file to local system: %w", err)
}

return nil
Copy link

Copilot AI Dec 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the io.Copy operation fails, the partially written local file is not cleaned up. Consider removing the local file on error to avoid leaving incomplete files on disk. You could either manually delete the file in an error path, or track the success and defer a conditional cleanup.

Copilot uses AI. Check for mistakes.
Comment on lines +117 to +120
bundlePath := "/tmp/bundle"
if env == stormenv.TridentEnvironmentContainer {
bundlePath = "/host/tmp/bundle"
}
Copy link

Copilot AI Dec 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bundle path is missing the .tar.zst file extension. The diagnose command in the code expects an output path that includes the extension (as shown in documentation and Rust tests), but this code passes /tmp/bundle without an extension. This will likely create a file without the standard extension, which could confuse users and tools. Consider changing to /tmp/bundle.tar.zst.

Copilot uses AI. Check for mistakes.
Comment on lines +9 to +11
```bash
sudo trident diagnose --output /tmp/trident-diagnostics.tar.zst
```
Copy link

Copilot AI Dec 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These examples recommend running sudo trident diagnose with --output /tmp/..., but the diagnostics bundle includes sensitive data (datastores, system journals, SELinux logs, configuration) and is written with default file permissions, so on a multi-user system another local user may be able to read the bundle directly from /tmp. To avoid unintentionally exposing system data locally, consider recommending an output directory that is only accessible to root and/or documenting that the bundle file’s permissions should be restricted (for example by using a non-world-readable directory or a restrictive umask).

Copilot uses AI. Check for mistakes.
Comment on lines +313 to +315
```bash
ssh -i $HOME/.ssh/id_rsa tutorial-user@$TARGET_MACHINE_IP sudo trident diagnose --output /tmp/trident-diagnostics.tar.zst
```
Copy link

Copilot AI Dec 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This tutorial suggests generating a diagnostics bundle with sudo trident diagnose --output /tmp/trident-diagnostics.tar.zst, but the bundle contains sensitive data (Trident datastore, logs, system configuration) and is created with default file permissions, so any other local user may be able to read it from /tmp on a multi-user system. To reduce the risk of local data exposure, consider recommending an output path under a root-only directory (e.g., /root/...) or otherwise documenting that the bundle file should have restrictive permissions instead of being written in a world-accessible location.

Copilot uses AI. Check for mistakes.
Copilot AI review requested due to automatic review settings December 17, 2025 22:22
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@frhuelsz
Copy link
Contributor

/AzurePipelines run [GITHUB]-trident-pr-e2e

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

frhuelsz
frhuelsz previously approved these changes Dec 18, 2025
Comment on lines 31 to 44
pub struct DiagnosticsReport {
/// Timestamp when the report was generated
pub timestamp: String,
/// Trident version
pub version: String,
/// Host description (VM/baremetal)
pub host_description: HostDescription,
/// Host status from the datastore
pub host_status: Option<HostStatus>,
/// Collected files metadata
pub collected_files: Option<Vec<FileMetadata>>,
/// Collection failures that occurred during diagnostics gathering
pub collection_failures: Vec<CollectionFailure>,
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: i think none of these structs or fields need to be pub

.as_ref()
.and_then(|hs| hs.spec.trident.datastore_path.parent())
{
if let Ok(entries) = fs::read_dir(log_dir) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no error handling here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants