Skip to content

Comments

feat: Add TraceLens integration for trace analysis with MLflow upload#439

Open
gphuang wants to merge 60 commits intomainfrom
feat/12-enable-tracelens-analysis
Open

feat: Add TraceLens integration for trace analysis with MLflow upload#439
gphuang wants to merge 60 commits intomainfrom
feat/12-enable-tracelens-analysis

Conversation

@gphuang
Copy link

@gphuang gphuang commented Dec 18, 2025

feat: Add TraceLens integration for trace analysis with MLflow upload

Adds TraceLens trace analysis capability to automatically generate performance
reports from PyTorch profiler traces and upload them to MLflow.

Addresses review feedback and adds tests for TraceLens report generation and MLflow artifact upload. Keeps MLflow opt-in, makes local-only TraceLens work without MLflow, and tightens safety/docs.

Features

  • Generate TraceLens analysis reports (XLSX, CSV formats)
  • Auto-install TraceLens from GitHub if not present
  • Upload reports to MLflow artifacts/trace_analysis/

Config Options

  • mlflow_upload_tracelens_report: false # Enable TraceLens analysis
  • mlflow_tracelens_output_format: all # all, xlsx, or csv

Example

Code
`
export CONFIG_NAME="deepseek_v2_lite-FP8-pretrain"

export EXP="examples/megatron/configs/MI300X/${CONFIG_NAME}.yaml"

export MLFLOW_RUN_NAME="${CONFIG_NAME}.single-node.baseline"

bash ./examples/run_pretrain.sh
--train_iters=5
--profile_step_start=2
--profile_step_end=4
--profile_ranks=ALL
--mlflow_run_name=${MLFLOW_RUN_NAME}
--mlflow_experiment_name=/Performance-data/Megatron-LM/primus-test
--mlflow_upload_performance_metrics=True
--mlflow_upload_tracelens_report=True
`

Output
Single node run
Multi (2) node

- Add TraceLens trace analysis report generation (XLSX, CSV formats)
- Add mlflow_upload_tracelens_report config option (default: false)
- Add mlflow_tracelens_ranks, mlflow_tracelens_max_reports options
- Add mlflow_tracelens_output_format option (all, xlsx, csv)
- Auto-install TraceLens from GitHub if not present
- Upload analysis reports to MLflow artifacts/trace_analysis/
Copilot AI review requested due to automatic review settings December 18, 2025 09:07
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds TraceLens integration to automatically generate performance analysis reports from PyTorch profiler traces and upload them to MLflow. TraceLens is auto-installed from GitHub if not present, and users can configure rank filtering, report limits, and output formats (XLSX/CSV/HTML).

Key changes:

  • New module for MLflow artifact management with TraceLens integration
  • Automatic TraceLens installation from GitHub with fallback CSV generation
  • Configuration options to control trace analysis (ranks, max reports, output formats)

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 14 comments.

File Description
primus/backends/megatron/training/mlflow_artifacts.py New 725-line module implementing trace/log file uploads and TraceLens report generation with fallback CSV summary
primus/backends/megatron/training/global_vars.py Adds import and wrapper function upload_mlflow_artifacts to expose artifact upload functionality
primus/modules/trainer/megatron/trainer.py Calls upload_mlflow_artifacts before ending MLflow run with configuration parameters from args
primus/configs/modules/megatron/primus_megatron_module.yaml Adds 6 new configuration options for controlling trace/log uploads and TraceLens analysis
Comments suppressed due to low confidence (2)

primus/backends/megatron/training/mlflow_artifacts.py:382

  • Variable dfs is not used.
            dfs = generate_perf_report_pytorch(trace_file, output_xlsx_path=xlsx_path)

primus/backends/megatron/training/mlflow_artifacts.py:370

  • This assignment to 'dfs' is unnecessary as it is redefined before this value is used.
            dfs = generate_perf_report_pytorch(trace_file, output_csvs_dir=csv_subdir)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot AI review requested due to automatic review settings December 18, 2025 09:25
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 20 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 6 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot AI review requested due to automatic review settings December 18, 2025 10:31
@gphuang gphuang force-pushed the feat/12-enable-tracelens-analysis branch from df2e40a to 2861bdf Compare December 18, 2025 10:33
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 8 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…config parser

Addresses Copilot review comment: if mlflow_tracelens_ranks is configured
as a string in YAML (e.g., '[0,8]' instead of [0, 8]), the code would
receive a string instead of a list, causing _filter_traces_by_rank to
silently filter out all trace files.

Added ast.literal_eval() conversion in:
- generate_tracelens_reports()
- upload_tracelens_reports_to_mlflow()

Falls back to None (process all ranks) with a warning if parsing fails.
When output_format='all', previously the trace file was parsed twice:
- Once for XLSX generation
- Once for CSV generation

Now when format is 'all', we call generate_perf_report_pytorch once
with both output_xlsx_path and output_csvs_dir parameters, parsing
the trace file only once and generating both formats from the same data.

This improves performance significantly for the common use case of
generating both report formats.
Copilot AI review requested due to automatic review settings December 18, 2025 12:39
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

After TraceLens reports are successfully uploaded to MLflow, the local
tracelens_reports directory is automatically cleaned up to save disk space.

This addresses the issue of temporary directories not being cleaned up
after artifact upload. The reports remain accessible in MLflow while
freeing up local storage.

Other directories checked:
- tensorboard_dir: Contains original trace files, NOT temporary
- exp_root_path/logs: Contains original log files, NOT temporary
- tracelens_reports: Processed reports uploaded to MLflow, safe to cleanup
Added mlflow_tracelens_cleanup_after_upload parameter to control whether
local TraceLens reports are removed after upload to MLflow.

Default: True (cleanup to save disk space)
Set to False to keep reports locally for inspection/debugging

Changes:
- Added cleanup_after_upload parameter to upload_tracelens_reports_to_mlflow()
- Added tracelens_cleanup_after_upload to upload_artifacts_to_mlflow()
- Added mlflow_tracelens_cleanup_after_upload config in YAML (default: true)
- Updated trainer to pass through the parameter

Use cases:
- True (default): Production runs, save disk space
- False: Development/debugging, keep local copies for inspection
Copilot AI review requested due to automatic review settings December 18, 2025 13:08
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

gphuang and others added 2 commits February 11, 2026 08:57
Normalize TraceLens output formats, improve CSV handling, add auto-install
controls and timeouts, and harden MLflow/ROCm metric handling and docs.

Co-authored-by: Cursor <cursoragent@cursor.com>
Avoid duplicate local generation in distributed runs, align default
output_format with xlsx, and downgrade to CSV when openpyxl is unavailable.

Co-authored-by: Cursor <cursoragent@cursor.com>
Copilot AI review requested due to automatic review settings February 11, 2026 09:25
Update docs to state the last rank (writer) performs TraceLens artifact uploads
in distributed runs.

Co-authored-by: Cursor <cursoragent@cursor.com>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 11 comments.

gphuang and others added 2 commits February 11, 2026 09:45
Align mlflow_setup.py docstring with the actual default of 'xlsx'.

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Copilot AI review requested due to automatic review settings February 11, 2026 09:52
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 5 comments.

gphuang and others added 2 commits February 11, 2026 13:18
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Copilot AI review requested due to automatic review settings February 11, 2026 13:23
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.

Co-authored-by: Cursor <cursoragent@cursor.com>
@gphuang
Copy link
Author

gphuang commented Feb 18, 2026

@wenxie-amd @Xiaoming-AMD @limou102 Could you please review?

Copilot AI review requested due to automatic review settings February 20, 2026 07:54
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (2)

primus/configs/modules/megatron/primus_megatron_module.yaml:36

  • The comment text on line 35-36 spans multiple physical lines in the YAML file. Consider breaking this into separate comment lines (each starting with #) for better readability and to avoid potential YAML parsing issues in some parsers, e.g.:
# TraceLens report format: xlsx (default; single parse, fastest), csv, or all
# (xlsx+csv; parses each trace twice so ~2x processing time; use only when both formats are needed)
# TraceLens report format: xlsx (default; single parse, fastest), csv, or all (xlsx+csv;
# parses each trace twice so ~2x processing time; use only when both formats are needed)

primus/modules/trainer/megatron/trainer.py:2228

  • The all_gather operation at line 2228 introduces a synchronization point that occurs on every log_interval when mlflow_upload_performance_metrics is enabled. This could impact training performance, especially with frequent logging. The config documentation mentions this (line 66), but consider also adding a code comment here explaining the performance trade-off and why all ranks must participate regardless of whether they have mlflow_writer.
                    dist.all_gather(gathered_utils, util_tensor)

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Copilot AI review requested due to automatic review settings February 23, 2026 07:47
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Comment on lines +241 to +247
subprocess.run(
[sys.executable, "-m", "pip", "install", "openpyxl", "-q"],
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
text=True,
check=True,
)
Copy link

Copilot AI Feb 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pip install openpyxl subprocess call has no timeout and only catches CalledProcessError. In offline / stalled pip scenarios this can hang training shutdown indefinitely; add a timeout and handle TimeoutExpired (similar to the TraceLens install path).

Copilot uses AI. Check for mistakes.
Comment on lines +553 to +557
# Only ensure openpyxl when XLSX output is requested (avoids pip install in CSV-only or restricted envs)
if output_format in ("xlsx", "all"):
if not _ensure_openpyxl_installed():
warning_rank_0("[TraceLens] openpyxl unavailable; downgrading output_format to 'csv'.")
output_format = "csv"
Copy link

Copilot AI Feb 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mlflow_tracelens_auto_install controls whether the code should attempt runtime installs, but XLSX generation will still attempt to install openpyxl via _ensure_openpyxl_installed(). This makes "auto_install=false" an incomplete opt-out; consider gating the openpyxl install attempt behind the same flag (or a separate explicit flag) and falling back to CSV when installs are disabled.

Suggested change
# Only ensure openpyxl when XLSX output is requested (avoids pip install in CSV-only or restricted envs)
if output_format in ("xlsx", "all"):
if not _ensure_openpyxl_installed():
warning_rank_0("[TraceLens] openpyxl unavailable; downgrading output_format to 'csv'.")
output_format = "csv"
# Only ensure openpyxl when XLSX output is requested (avoids pip install in CSV-only or restricted envs).
# Respect mlflow_tracelens_auto_install to prevent unwanted runtime installations.
if output_format in ("xlsx", "all"):
if not mlflow_tracelens_auto_install:
warning_rank_0(
"[TraceLens] mlflow_tracelens_auto_install is False; "
"falling back to CSV output to avoid installing openpyxl."
)
output_format = "csv"
elif not _ensure_openpyxl_installed():
warning_rank_0(
"[TraceLens] openpyxl unavailable; downgrading output_format to 'csv'."
)
output_format = "csv"

Copilot uses AI. Check for mistakes.
Comment on lines +88 to +92
should_generate_locally = is_rank_zero and (not mlflow_expected or not is_distributed)
if should_generate_locally and generate_tracelens_report and tensorboard_dir and exp_root_path:
generate_tracelens_reports_locally(
tensorboard_dir=tensorboard_dir,
exp_root_path=exp_root_path,
Copy link

Copilot AI Feb 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The branching logic for local-only TraceLens generation when mlflow_writer is None (rank selection + mlflow_expected/is_distributed gating) isn’t covered by unit tests. Adding tests for the key scenarios (distributed vs single-rank; mlflow_run_name set/unset; generate_tracelens_report true/false) would help prevent regressions.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants