Trtllm-gen Fp8 MoE Autotunner #1494

aleozlx · 2025-08-15T05:47:57Z

📌 Description

Followed this but for fp8
#1475

🔍 Related Issues

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

I have installed pre-commit by running pip install pre-commit (or used your preferred method).
I have installed the hooks with pre-commit install.
I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

Tests have been added or updated as needed.
All tests are passing (unittest, etc.).

Reviewer Notes

…Rein/flashinfer into trtllm-gen-moe-autotuner

gemini-code-assist

Summary of Changes

Hello @aleozlx, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces an autotuner specifically designed for FP8 Mixture-of-Experts (MoE) operations within the trtllm-gen framework. It extends existing autotuning capabilities by enabling automatic selection of optimal kernel configurations for various FP8 MoE variants, including both per-tensor and block-scale quantization. The changes involve significant refactoring of the autotuner's core API to support more complex dynamic tensor specifications and custom initialization during profiling, alongside modifications to the underlying C++ kernels to allow external configuration control. A new benchmark script has been added to validate the performance improvements across different quantization modes.

Highlights

FP8 MoE Autotuning: Implemented a dedicated MoERunner for FP8 Mixture-of-Experts (MoE) operations, enabling automatic selection of optimal kernel configurations based on input shapes and data types.
Flexible Autotuner API: Enhanced the core autotuner framework to support more complex tuning scenarios, including linking multiple dynamic tensor dimensions and providing custom tensor initialization during profiling.
C++ Kernel Integration: Modified C++ MoE kernels to accept a config_index parameter, allowing the Python autotuner to directly control kernel selection. New C++ functions were added to expose valid kernel configurations.
Comprehensive Benchmarking and Validation: Introduced a new benchmark script to evaluate the performance gains of the autotuner across different FP8 and FP4 MoE quantization modes, demonstrating its effectiveness.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces an autotuner for FP8 MoE kernels and refactors the autotuning framework to be more general and extensible. The changes are well-structured, adding support for configurable kernel selection at runtime to optimize performance. My feedback focuses on improving maintainability by addressing code duplication and enhancing readability by replacing magic numbers with named constants.

gemini-code-assist · 2025-08-15T05:49:55Z

csrc/trtllm_fused_moe_kernel_launcher.cu

+int64_t trtllm_get_default_moe_configs(int64_t const tile_tokens_dim, int64_t const dtype_act_,
+                                       int64_t const dtype_weights_, bool const useDeepSeekFp8,
+                                       int64_t const top_k, int64_t const hidden_size,
+                                       int64_t const intermediate_size,
+                                       int64_t const num_local_experts, int64_t const num_tokens) {
+  auto dtype_act = static_cast<btg::Dtype>(dtype_act_);
+  auto dtype_weights = static_cast<btg::Dtype>(dtype_weights_);
+  tensorrt_llm::kernels::trtllmgen_moe::MoE::Runner moe_runner(
+      dtype_act, dtype_weights, useDeepSeekFp8, (int32_t)tile_tokens_dim,
+      tensorrt_llm::kernels::ActType::SwiGlu, /*useShuffledMatrixA*/ true);
+  return moe_runner.getDefaultValidConfigIndex(top_k, hidden_size, intermediate_size,
+                                               num_local_experts, num_tokens);
+}
+
+std::vector<int64_t> trtllm_get_valid_moe_configs(
+    int64_t const tile_tokens_dim, int64_t const dtype_act_, int64_t const dtype_weights_,
+    bool const useDeepSeekFp8, int64_t const top_k, int64_t const hidden_size,
+    int64_t const intermediate_size, int64_t const num_local_experts, int64_t const num_tokens) {
+  auto dtype_act = static_cast<btg::Dtype>(dtype_act_);
+  auto dtype_weights = static_cast<btg::Dtype>(dtype_weights_);
+  tensorrt_llm::kernels::trtllmgen_moe::MoE::Runner moe_runner(
+      dtype_act, dtype_weights, useDeepSeekFp8, (int32_t)tile_tokens_dim,
+      tensorrt_llm::kernels::ActType::SwiGlu, /*useShuffledMatrixA*/ true);
+  return moe_runner.getValidConfigIndices(top_k, hidden_size, intermediate_size, num_local_experts,
+                                          num_tokens);
 }


The functions trtllm_get_default_moe_configs and trtllm_get_valid_moe_configs contain duplicated code for creating the tensorrt_llm::kernels::trtllmgen_moe::MoE::Runner. To improve maintainability and reduce redundancy, consider extracting the runner creation logic into a common helper function. This helper could then be called by both functions.

gemini-code-assist · 2025-08-15T05:49:55Z

flashinfer/autotuner.py

+    def __hash__(self) -> int:
+        # FIXME: currently not hasing tensor_initializers
+        return hash(
+            (
+                self.input_idx,
+                self.dim_idx,
+                # For gen_tuning_buckets, only hash if it's a tuple, otherwise hash its id
+                self.gen_tuning_buckets
+                if isinstance(self.gen_tuning_buckets, tuple)
+                else id(self.gen_tuning_buckets),
+                id(self.map_to_tuning_buckets),
+            )
+        )


The DynamicTensorSpec class has been significantly improved to handle multiple tensors and custom initializers. However, the custom __hash__ method does not include tensor_initializers. While the FIXME comment acknowledges this, it's worth highlighting that this could lead to incorrect caching behavior if different initializers are used with otherwise identical specs. If tensor_initializers can affect performance or correctness, they should be part of the hash key. If they are complex to hash, you could consider hashing a representation of them, like their __name__ or id().

gemini-code-assist · 2025-08-15T05:49:56Z

flashinfer/fused_moe/core.py

+    if dtype in [
+        DtypeTrtllmGen.MxE4m3,
+        DtypeTrtllmGen.E2m1,
+        DtypeTrtllmGen.MxE2m1,
+        DtypeTrtllmGen.MxE4m3,
+    ]:
+        return True
+    else:
+        return False


The list of dtypes contains a duplicate DtypeTrtllmGen.MxE4m3. Using a set for the membership check is more idiomatic and efficient, and it also naturally handles the duplicate entry. The entire if/else block can also be simplified to a single return statement.

return dtype in { DtypeTrtllmGen.MxE4m3, DtypeTrtllmGen.E2m1, DtypeTrtllmGen.MxE2m1, }

gemini-code-assist · 2025-08-15T05:49:56Z

flashinfer/fused_moe/core.py

+            # - > 1.0 means some experts have more
+            #     tokens than the perfect distribution.
+            # - < 1.0 does not make sense.
+            imbalance_factor = 1.3


The imbalance_factor is a magic number. Consider defining it as a named constant at the class or module level (e.g., IMBALANCE_FACTOR = 1.3) to improve readability and make it clear what this value represents. The accompanying comments already provide a good explanation for it.

gemini-code-assist · 2025-08-15T05:49:56Z

tests/bench_trtllm_gen_fused_moe_autotuner.py

+def get_tile_tokens_dim(num_tokens, num_experts, top_k):
+    # Factor to account for the imbalance of the experts.
+    # factor equals to the
+    # max_real_num_tokens_per_expert / perfect_num_tokens_per_expert
+    # - 1.0 means perfect expert distribution.
+    # - > 1.0 means some experts have more
+    #     tokens than the perfect distribution.
+    # - < 1.0 does not make sense.
+    imbalance_factor = 1.3
+    # Calculate the number of tokens per expert
+    # assuming perfect distribution.
+    num_tokens_per_expert = (num_tokens * top_k) // num_experts
+    # Apply the imbalance factor.
+    num_tokens_per_expert = int(num_tokens_per_expert * imbalance_factor)
+    # And pad the number to the next power of 2.
+    tile_tokens_dim = next_positive_power_of_2(num_tokens_per_expert)
+    # Cap to 8-64 tokens per CTA tile
+    # as it's the range supported by the kernel.
+    tile_tokens_dim = min(max(tile_tokens_dim, 8), 64)
+    return tile_tokens_dim


This get_tile_tokens_dim function is a duplicate of the one defined in flashinfer.fused_moe.core.MoERunner. To avoid code duplication and improve maintainability, consider moving this function to a shared utility module (e.g., flashinfer.fused_moe.utils) and importing it in both places.

gemini-code-assist · 2025-08-15T05:49:56Z

include/flashinfer/trtllm/fused_moe/RoutingKernelTopK.cuh

+    TypeCmp compactTmp;
+    memcpy(&compactTmp, &valueBits, sizeof(valueBits));


This change from reinterpret_cast to memcpy is a good practice to avoid potential strict aliasing violations, improving code correctness and robustness.

IwakuraRein

.to(torch.bfloat16), breaks the DeepSeekV3 routing. After removing it, there is no need to limit the number of tokens to [8...1024]

IwakuraRein · 2025-08-21T06:58:41Z

flashinfer/fused_moe/core.py

+                else:
+                    # FP8 per tensor scale
+                    return moe_op.trtllm_fp8_per_tensor_scale_moe(
+                        routing_logits.to(torch.bfloat16),


Suggested change

routing_logits.to(torch.bfloat16),

routing_logits,

IwakuraRein · 2025-08-21T06:59:09Z

flashinfer/fused_moe/core.py

+            else:
+                # FP4 operations
+                return moe_op.trtllm_fp4_block_scale_moe(
+                    routing_logits.to(torch.bfloat16),


Suggested change

routing_logits.to(torch.bfloat16),

routing_logits,

IwakuraRein · 2025-08-21T06:59:45Z

flashinfer/fused_moe/core.py

+                    DynamicTensorSpec(
+                        (0, 1, 2, 3, 4, 5),
+                        (0, 0, 0, 0, 0, 0),
+                        get_last_power_of_2_num_tokens_buckets(tune_max_num_tokens, 8),


Suggested change

get_last_power_of_2_num_tokens_buckets(tune_max_num_tokens, 8),

get_last_power_of_2_num_tokens_buckets(tune_max_num_tokens, 1),

IwakuraRein · 2025-08-21T06:59:51Z

flashinfer/fused_moe/core.py

+                    DynamicTensorSpec(
+                        (0, 1, 2, 3, 4),
+                        (0, 0, 0, 0, 0),
+                        get_last_power_of_2_num_tokens_buckets(tune_max_num_tokens, 8),


Suggested change

get_last_power_of_2_num_tokens_buckets(tune_max_num_tokens, 8),

get_last_power_of_2_num_tokens_buckets(tune_max_num_tokens, 1),

IwakuraRein · 2025-08-21T07:02:12Z

flashinfer/fused_moe/core.py

+                    get_last_power_of_2_num_tokens_buckets(1024, 8),
+                    lambda x: min(last_positive_power_of_2(x), 1024),


Suggested change

get_last_power_of_2_num_tokens_buckets(1024, 8),

lambda x: min(last_positive_power_of_2(x), 1024),

get_last_power_of_2_num_tokens_buckets(8192, 1),

lambda x: min(last_positive_power_of_2(x), 8192),

IwakuraRein · 2025-08-21T07:04:16Z

flashinfer/fused_moe/core.py

        enable_pdl: Whether to enable Programmatic Dependent Launch (PDL). Auto-enabled for >= sm90.
+        tune_max_num_tokens: Maximum number of tokens for tuning. (default: 1024)


Suggested change

tune_max_num_tokens: Maximum number of tokens for tuning. (default: 1024)

tune_max_num_tokens(int): Maximum number of tokens for tuning. (default: 8192)

IwakuraRein · 2025-08-21T07:04:35Z

flashinfer/fused_moe/core.py

@@ -1367,6 +1935,7 @@ def trtllm_fp4_block_scale_moe(
    routing_method_type: int = 0,
    do_finalize: bool = True,
    enable_pdl: Optional[bool] = None,
+    tune_max_num_tokens: int = 1024,


Suggested change

tune_max_num_tokens: int = 1024,

tune_max_num_tokens: int = 8192,

IwakuraRein · 2025-08-21T07:04:37Z

flashinfer/fused_moe/core.py

@@ -1481,6 +2052,7 @@ def trtllm_fp4_block_scale_routed_moe(
    routing_method_type: int = 0,
    do_finalize: bool = True,
    enable_pdl: Optional[bool] = None,
+    tune_max_num_tokens: int = 1024,


Suggested change

tune_max_num_tokens: int = 1024,

tune_max_num_tokens: int = 8192,

IwakuraRein · 2025-08-21T07:04:47Z

flashinfer/fused_moe/core.py

@@ -1410,6 +1979,7 @@ def trtllm_fp4_block_scale_moe(
            - 3: Llama4 (Top1 -> Sigmoid)
            - 4: RenormalizeNaive (Softmax -> TopK -> Renormalize)
        do_finalize (bool): Whether to finalize the output (default: False)
+        tune_max_num_tokens(int): Maximum number of tokens for tuning. (default: 1024)


Suggested change

tune_max_num_tokens(int): Maximum number of tokens for tuning. (default: 1024)

tune_max_num_tokens(int): Maximum number of tokens for tuning. (default: 8192)

IwakuraRein · 2025-08-21T07:05:02Z

flashinfer/fused_moe/core.py

@@ -1526,6 +2098,7 @@ def trtllm_fp4_block_scale_routed_moe(
            - 3: Llama4 (Top1 -> Sigmoid)
            - 4: RenormalizeNaive (Softmax -> TopK -> Renormalize)
        do_finalize (bool): Whether to finalize the output (default: False)
+        tune_max_num_tokens(int): Maximum number of tokens for tuning. (default: 1024)


Suggested change

tune_max_num_tokens(int): Maximum number of tokens for tuning. (default: 1024)

tune_max_num_tokens(int): Maximum number of tokens for tuning. (default: 8192)

IwakuraRein · 2025-08-21T07:07:51Z

Thanks for the followup pr!

yzh119 · 2025-08-23T00:00:25Z

Do we have any updates on this PR?

IwakuraRein and others added 30 commits August 14, 2025 12:05

add trtllm-gen fp4 moe autotuner

32433d1

minor

553b95e

minor

30059a4

upd

e805e6f

minor

37316e9

add bench

bfd42ab

address pre-commit

48b7786

use kwargs to make signature consistent

1f68223

update gemm.py

18e0f98

update DtypeTrtllmGen

1abfddc

add fp8

c7cc897

debug

470a49c

debug

79cd978

debug

1eae278

debug

920bf1e

debug

311438f

debug

dcc6c86

debug

bd60563

debug

669467f

debug

629806c

debug

5481d5a

unify

72230d5

unify

c3fb35d

unify2

f24887b

unify2

05a08a3

unify2

47824e3

debug

19e1013

debug

763fb02

debug

885e5d9

debug

e30c074

aleozlx added 11 commits August 15, 2025 03:32

debug

f5f0ad2

debug

25c865e

debug

8b8d375

..

0186a01

debug

195a30c

debug

100e495

debug

c0b7cff

debug

d0b81c5

clean

3b62bd9

clean

3efd3de

Merge branch 'trtllm-gen-moe-autotuner' of https://github.com/Iwakura…

5871a6e

…Rein/flashinfer into trtllm-gen-moe-autotuner

gemini-code-assist bot reviewed Aug 15, 2025

View reviewed changes

clean

306cddb

yzh119 mentioned this pull request Aug 19, 2025

tuner: Trtllm-gen Fp4 MoE Autotunner #1475

Merged

5 tasks

IwakuraRein suggested changes Aug 21, 2025

View reviewed changes

		TypeCmp compactTmp;
		memcpy(&compactTmp, &valueBits, sizeof(valueBits));

	get_last_power_of_2_num_tokens_buckets(tune_max_num_tokens, 8),
	get_last_power_of_2_num_tokens_buckets(tune_max_num_tokens, 1),

		get_last_power_of_2_num_tokens_buckets(1024, 8),
		lambda x: min(last_positive_power_of_2(x), 1024),

		enable_pdl: Whether to enable Programmatic Dependent Launch (PDL). Auto-enabled for >= sm90.
		tune_max_num_tokens: Maximum number of tokens for tuning. (default: 1024)

	tune_max_num_tokens: Maximum number of tokens for tuning. (default: 1024)
	tune_max_num_tokens(int): Maximum number of tokens for tuning. (default: 8192)

	tune_max_num_tokens: int = 1024,
	tune_max_num_tokens: int = 8192,

Trtllm-gen Fp8 MoE Autotunner #1494

Are you sure you want to change the base?

Trtllm-gen Fp8 MoE Autotunner #1494

Uh oh!

Conversation

aleozlx commented Aug 15, 2025

📌 Description

🔍 Related Issues

🚀 Pull Request Checklist

✅ Pre-commit Checks

🧪 Tests

Reviewer Notes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Aug 15, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 15, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 15, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 15, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 15, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 15, 2025

Choose a reason for hiding this comment

Uh oh!

IwakuraRein left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

IwakuraRein commented Aug 21, 2025

Uh oh!

yzh119 commented Aug 23, 2025

Uh oh!

Uh oh!