Skip to content

Conversation

@bangtianliu
Copy link

@bangtianliu bangtianliu commented Jan 26, 2026

This PR adds the support for tuning specs provided by tuner to enable the performance improvement through tuning.

Local testing shows ~8.2% performance improvement on bf16 matrix multiplication (matmul -M 8192 -N 2048 -K 4096 --transA --a_type bf16 --b_type bf16 --out_type bf16) when using tuned transform dialect specs.

Assisted-by: Claude

@bangtianliu bangtianliu marked this pull request as draft January 26, 2026 07:15
@bangtianliu bangtianliu force-pushed the bangtian/add_tuning_spec_support branch from a7292b7 to aeb017b Compare January 26, 2026 07:17
@bangtianliu
Copy link
Author

The used td spec produced by tuner:

module attributes {iree_codegen.tuning_spec_with_default_entrypoint, transform.with_named_sequence} {
  transform.named_sequence @apply_op_config(%arg0: !transform.any_op {transform.readonly}, %arg1: !transform.any_param {transform.readonly}) {
    transform.annotate %arg0 "compilation_info" = %arg1 : !transform.any_op, !transform.any_param
    transform.yield 
  }
  transform.named_sequence @match_main$async_dispatch_0_matmul_8192x2048x4096_bf16xbf16xf32(%arg0: !transform.any_op {transform.readonly}) -> (!transform.any_op, !transform.any_param) {
    %batch_dims, %m_dims, %n_dims, %k_dims = transform.iree.match.contraction %arg0, lhs_type = bf16, rhs_type = bf16, output_type = f32, indexing_maps = [affine_map<(d0, d1, d2) -> (d2, d0)>, affine_map<(d0, d1, d2) -> (d2, d1)>, affine_map<(d0, d1, d2) -> (d0, d1)>] : !transform.any_op -> !transform.param<i64>
    transform.iree.match.dims_equal %batch_dims, [] : !transform.param<i64>
    transform.iree.match.dims_equal %m_dims, [8192] : !transform.param<i64>
    transform.iree.match.dims_equal %n_dims, [2048] : !transform.param<i64>
    transform.iree.match.dims_equal %k_dims, [4096] : !transform.param<i64>
    %0 = transform.param.constant #iree_codegen.compilation_info<lowering_config = #iree_gpu.lowering_config<{mma_kind = #iree_gpu.mma_layout<WMMAR4_F32_16x16x16_BF16>, promote_operands = [0, 1], reduction = [0, 0, 1], subgroup = [4, 4, 0], subgroup_basis = [[4, 2, 1], [0, 1, 2]], workgroup = [256, 128, 0]}>, translation_info = <pipeline = LLVMGPUTileAndFuse workgroup_size = [256, 1, 1] subgroup_size = 32, {gpu_pipeline_options = #iree_gpu.pipeline_options<prefetch_num_stages = 2>, llvm_func_attrs = {"amdgpu-waves-per-eu" = "2"}}>> -> !transform.any_param
    transform.yield %arg0, %0 : !transform.any_op, !transform.any_param
  }
  transform.named_sequence @__kernel_config(%arg0: !transform.any_op {transform.consumed}) -> !transform.any_op attributes {iree_codegen.tuning_spec_entrypoint} {
    %updated_root = transform.foreach_match in %arg0 
        @match_main$async_dispatch_0_matmul_8192x2048x4096_bf16xbf16xf32 -> @apply_op_config : (!transform.any_op) -> !transform.any_op
    transform.yield %updated_root : !transform.any_op
  }
}

Results (AMD gfx1201):

Configuration Mean Latency Min Latency Max Latency Improvement
Baseline (no tuning spec) 1775.95 μs 1751.00 μs 1839.00 μs -
With tuning spec 1630.05 μs 1612.00 μs 1682.00 μs 8.2% faster

@bangtianliu bangtianliu force-pushed the bangtian/add_tuning_spec_support branch 3 times, most recently from db0456f to a974e07 Compare January 26, 2026 22:39
Copy link
Member

@sjain-stanford sjain-stanford left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is in draft but I just wanted to leave a high level "directionary" feedback before you go down this path further. At this stage while the tuning integration is in a POC, we don't want to bake the flags into fusilli/backend/*. You could instead have a catch-all variable for any additional flags to be specified (e.g. through env variables / CLI or so). This let's us evaluate forward looking things without baking the design in too early making it hard to refactor to a more optimal integration later.

@bangtianliu bangtianliu force-pushed the bangtian/add_tuning_spec_support branch 6 times, most recently from d9fe6db to a3e39a2 Compare January 26, 2026 23:40
@bangtianliu bangtianliu marked this pull request as ready for review January 26, 2026 23:40
@bangtianliu bangtianliu force-pushed the bangtian/add_tuning_spec_support branch 5 times, most recently from 484f316 to 60069c4 Compare January 27, 2026 18:11
Copy link
Member

@kuhar kuhar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to have a generic escape hatch for passing extra compiler flags? In other projects we use --Xiree-compile='....' or --Xiree_compile='...': https://github.com/search?q=repo%3Anod-ai%2Firee-kernel-benchmark%20xiree&type=code , similar to -Xclang and others: https://clang.llvm.org/docs/ClangCommandLineReference.html#compilation-options

@bangtianliu
Copy link
Author

Would it make sense to have a generic escape hatch for passing extra compiler flags? In other projects we use --Xiree-compile='....' or --Xiree_compile='...': https://github.com/search?q=repo%3Anod-ai%2Firee-kernel-benchmark%20xiree&type=code

I think it's a good suggestion. The generic --Xiree-compile='...' escape hatch would be more flexible.

Actually, looking at the current hardcoded flags in backend.h, some like --iree-dispatch-creation-enable-split-reduction might be better as user-configurable options rather than always-on. The --Xiree-compile approach would let users experiment with enabling/disabling these optimizations.

@sjain-stanford what do you think? we could also introduce --Xiree-compile in a separate PR to keep this PR focused.

@sjain-stanford
Copy link
Member

Yeah that's what I was initially thinking - a fully general escape hatch for compiler flags. This would be mainly used for tuning for now, but could be extended for debug purposes. Thanks @kuhar for the suggestion.

@sjain-stanford
Copy link
Member

I think using a single environment variable to accept all extra flags seems reasonable - @kuhar do you see concerns with this?
Then we can just document that one variable in the README and not go into specifics for tuning etc. I'd still update the README in the nod-ai/fusilli-benchmarks repo with the exact flag to set so folks who run benchmarks will know how to use it.

@bangtianliu bangtianliu requested a review from kuhar January 28, 2026 16:10
Comment on lines +235 to +243
std::vector<std::string> flags;
std::istringstream iss(flagsStr);
std::string token;

while (iss >> std::quoted(token)) {
if (!token.empty()) {
flags.push_back(token);
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we trying to parse the flags? Is there no API to pass a string to the compiler? I'm asking because I'm not sure we can meaningfully split the flags (for example with spaces inside).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we trying to parse the flags? Is there no API to pass a string to the compiler? I'm asking because I'm not sure we can meaningfully split the flags (for example with spaces inside).

Here is the existing fusilli implmentation

inline std::span<const std::string> getBackendFlags(Backend backend) {
static std::once_flag initFlag;
static std::unordered_map<Backend, std::vector<std::string>> kBackendFlags;
std::call_once(initFlag, []() {
std::vector<std::string> cpuFlags = {
"--iree-hal-target-backends=llvm-cpu",
"--iree-llvmcpu-target-cpu=host",
};
// Specify a HIP target for AMD GPU. First attempts to get the SKU name
// (e.g., `mi300x`) via `amd-smi` for optimal compiler tuning, then falls
// back to architecture (e.g., `gfx942`) via `rocm_agent_enumerator`.
// See:
// https://iree.dev/guides/deployment-configurations/gpu-rocm/#choosing-hip-targets
auto hipTarget = getIreeHipTargetForAmdgpu();
std::vector<std::string> amdGpuFlags = {
// clang-format off
"--iree-hal-target-backends=rocm",
std::format("--iree-hip-target={}", hipTarget),
"--iree-opt-level=O3",
"--iree-preprocessing-pass-pipeline=builtin.module(util.func(iree-preprocessing-convert-conv-filter-to-channels-last))",
"--iree-flow-enable-pad-handling",
"--iree-global-opt-propagate-transposes-through-conv",
"--iree-global-opt-enable-sink-transpose-through-pad",
"--iree-dispatch-creation-enable-fuse-padding-into-linalg-consumer-ops",
"--iree-dispatch-creation-enable-aggressive-reshape-movement",
"--iree-dispatch-creation-enable-split-reduction",
// clang-format on
};
kBackendFlags[Backend::CPU] = std::move(cpuFlags);
kBackendFlags[Backend::AMDGPU] = std::move(amdGpuFlags);
});
return kBackendFlags.at(backend);
}

Both the CLI and C API compilation backends expect individual flag strings in a vector, not a single concatenated string.

CLI Backend (CompileCommand::build()):

std::vector<std::string> args = {...};
for (const auto &flag : getBackendFlags()) {
    args.push_back(escapeArgument(flag));  // Each flag is separate
}

C API Backend (CompileSession::addFlag()):

ErrorObject addFlag(const std::string &flag) {
     ireeCompilerSessionSetFlags(session, 1, &flagCStr);  // Takes single flag
 }

Since getBackendFlags() returns std::vectorstd::string, we need to split the environment variable string into individual flags.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we can meaningfully split the flags (for example with spaces inside)

The std::quoted manipulator handles typical IREE compiler flags in --flag=value format, covering common cases. However, it relies on users to properly quote flag values that contain spaces.

Copy link
Member

@kuhar kuhar Jan 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm worried about things like --pass-pipeline=(...) that may contain both spaces and options with quotes inside. But maybe what you have is good enough for the time being. We should add more tests though -- ideally unittests for parsing only. @sjain-stanford does fusilli support units tests?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way we solved the same issue in the tuner was through using flag file where each line is treated as a separate argument so that we don't have to do any parsing

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you rely on CLI flags only, -Xiree-compile will already split the flags for you. I think we are only having this issue because we are adding env vars on top of this?

Copy link
Author

@bangtianliu bangtianliu Jan 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, -Xiree-compile is used for run_benchmark.py. Currently we have this issue with FUSILLI_EXTRA_COMPILER_FLAGS, which is currently only used for passing td spec. We can have a separate a PR about using flag file and then move all the flags

    std::vector<std::string> amdGpuFlags = {
        // clang-format off
                "--iree-hal-target-backends=rocm",
                std::format("--iree-hip-target={}", hipTarget),
                "--iree-opt-level=O3",
                "--iree-preprocessing-pass-pipeline=builtin.module(util.func(iree-preprocessing-convert-conv-filter-to-channels-last))",
                "--iree-flow-enable-pad-handling",
                "--iree-global-opt-propagate-transposes-through-conv",
                "--iree-global-opt-enable-sink-transpose-through-pad",
                "--iree-dispatch-creation-enable-fuse-padding-into-linalg-consumer-ops",
                "--iree-dispatch-creation-enable-aggressive-reshape-movement",
                "--iree-dispatch-creation-enable-split-reduction",
        // clang-format on
    };

to the flag file.

Copy link
Member

@sjain-stanford sjain-stanford Jan 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if we treat the extra flags as a single string, that'd only work for CLI (provided escapeArgument doesn't mangle it). I think the CAPI approach to adding flags is more restrictive in that ireeCompilerSessionSetFlags expects single flags. So parsing + splitting is needed.

That said, @kuhar 's concern is valid - parsing is fragile and assumes users to properly quote flags with spaces.

I'm fine with this under the following assumptions:

  • we don't plan on using this "extra flags" escape hatch in user-facing production use-cases and it's more for experimental use by fusilli developers and superusers who understand how to pass extra flags
  • we document the restrictions for passing extra flags clearly (both README as well as checks in the code)
  • add unit tests for parseCompilerFlags for all tricky use-cases and make sure it handles incorrectly formatted strings and propagates the errors up

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can have a separate a PR about using flag file and then move all the flags to the flag file.

I'd rather not have the baked-in / permanent flags exposed for manipulation by the end-users through a flag file, so my suggestion is let's not jump the gun on this yet. Ideally we test the extra flags flow with the current env variable setup a bit more to surface pain points / gaps.

Signed-off-by: Bangtian Liu <liubangtian@gmail.com>
@bangtianliu bangtianliu force-pushed the bangtian/add_tuning_spec_support branch from 9d79198 to 91f77c4 Compare January 28, 2026 17:52
@bangtianliu bangtianliu force-pushed the bangtian/add_tuning_spec_support branch from 04f6fb5 to efc27e9 Compare January 28, 2026 18:53
Signed-off-by: Bangtian Liu <liubangtian@gmail.com>
@bangtianliu bangtianliu force-pushed the bangtian/add_tuning_spec_support branch from efc27e9 to a51ee85 Compare January 28, 2026 18:57
REQUIRE(result[2] == "--iree-hip-target=mi300x");
}
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also add some case(s) with single-quoted contents?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

// Tests for parseCompilerFlags
//===----------------------------------------------------------------------===//

TEST_CASE("parseCompilerFlags", "[backend][flags]") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +293 to +295
// Add extra flags to both CPU and AMDGPU backends.
addExtraFlags(cpuFlags);
addExtraFlags(amdGpuFlags);
Copy link
Member

@sjain-stanford sjain-stanford Jan 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd asked this elsewhere, but what happens if users specify a GPU specific flag (say --iree-hip-target..) through the FUSILLI_EXTRA_COMPILER_FLAGS? It get's added to both cpuFlags and amdGpuFlags - would it crash the compiler when the selected backend is CPU, or would it silently ignore it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, I don't understand the distinction -- I don't think the compiler c api has separate entry points for these

Copy link
Author

@bangtianliu bangtianliu Jan 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then how about adding extra flags to AMDGPU backend only?

Copy link
Member

@sjain-stanford sjain-stanford Jan 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the compiler c api has separate entry points for these

OK then it's probably fine to say this is user responsibility (to make sure the flags provided match the backend selected).

I don't understand the distinction

The reason we have a map that stores the cpuFlags and amdGpuFlags separately is to allow multiple sessions+graphs to co-exist. So say there's a machine with AMDGPU, users might create graph 1 on the cpu handle and graph 2 on the gpu handle. In such cases, they'd want to make sure the extra flags are not commingled for both backends.

Copy link
Member

@sjain-stanford sjain-stanford Jan 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It get's added to both cpuFlags and amdGpuFlags - would it crash the compiler when the selected backend is CPU, or would it silently ignore it?

I thought about this. Since the compiler is basically cross-compiling for whatever target is specified, it likely won't crash the compilation itself and a vmfb should get generated. The issue is when the vmfb is executed on a backend that it was not compiled for (through flags targeting a different backend). So again it becomes a user responsibility to make sure they specify extra flags pertaining to the correct backend.

I think we can leave it as is for now and if this becomes a prevalent pattern we could revisit with alternate approaches (e.g. use backend specific env variable or read extra flags from backend specific files).

Signed-off-by: Bangtian Liu <liubangtian@gmail.com>
@bangtianliu bangtianliu force-pushed the bangtian/add_tuning_spec_support branch from 7b6b4ac to f6700a2 Compare January 28, 2026 21:22
Signed-off-by: Bangtian Liu <liubangtian@gmail.com>
@bangtianliu bangtianliu force-pushed the bangtian/add_tuning_spec_support branch from a7995bb to 2f71414 Compare January 28, 2026 22:46
@bangtianliu
Copy link
Author

bangtianliu commented Jan 28, 2026

I've updated the tests to follow the established pattern in the codebase. After examining other test files like test_conv_attributes.cpp, I found that the codebase uses simple REQUIRE with == for comparing vectors, rather than REQUIRE_THAT with matchers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants