benchmark: add allreduce_fusion benchmark #1450

yyihuang · 2025-08-10T23:51:33Z

📌 Description

🔍 Related Issues

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

I have installed pre-commit by running pip install pre-commit (or used your preferred method).
I have installed the hooks with pre-commit install.
I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

Tests have been added or updated as needed.
All tests are passing (unittest, etc.).

Reviewer Notes

gemini-code-assist

Summary of Changes

Hello @yyihuang, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

I am adding a new benchmark to assess the performance and correctness of the trtllm_allreduce_fusion functionality. This benchmark is crucial for validating the efficiency of various all-reduce fusion patterns, including those with integrated residual RMS normalization and different quantization levels (FP8 and FP4). It will run across diverse distributed configurations and data types, providing valuable insights into the operation's behavior under various conditions.

Highlights

New Benchmark Addition: I have introduced a new benchmark file, benchmarks/bench_trtllm_allreduce_fusion.py, to evaluate the performance and correctness of the trtllm_allreduce_fusion operation.
Comprehensive Fusion Pattern Testing: The benchmark covers a wide range of AllReduceFusionPattern types, including standard all-reduce, residual RMS normalization, and various FP8/FP4 quantization schemes, ensuring comprehensive testing of the fusion capabilities.
Distributed and Parameterized Testing: The benchmark is designed to run in a distributed environment, testing different world_size (2, 4, 8), dtype (float16, bfloat16), and hidden_dim (1024 to 8192) configurations, leveraging torch.distributed for multi-GPU setups.
CUDA Graph Integration: To ensure accurate and repeatable performance measurements, the benchmark incorporates CUDA graph capture and replay for the comm.trtllm_allreduce_fusion calls.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request adds a new benchmark for allreduce_fusion. My review focuses on improving the clarity and correctness of the new benchmark script. I've identified several issues, including misleading variable and function names, the lack of actual performance measurement, a mismatched resource cleanup function, and unresolved TODOs. Addressing these points will make the benchmark more robust and useful.

gemini-code-assist · 2025-08-10T23:53:13Z

benchmarks/bench_trtllm_allreduce_fusion.py

+                                    if test_passed:
+                                        print(
+                                            f"test RANK {rank}: token{token_num}-hidden_dim{hidden_dim}-dtype{dtype}-pattern{pattern_code}-layout{swizzled_layout_code}-pdl{launch_with_pdl} passed"
+                                        )
+                                    else:
+                                        print(
+                                            f"test RANK {rank}: token{token_num}-hidden_dim{hidden_dim}-dtype{dtype}-pattern{pattern_code}-layout{swizzled_layout_code}-pdl{launch_with_pdl} failed"
+                                        )


The test_passed variable (defined on line 88) is always True, making the else block unreachable. This logic is misleading in a benchmark and should be removed along with the variable definition.

Additionally, a benchmark should ideally measure and report performance. Consider adding timing using torch.cuda.Event around the g.replay() call to make this benchmark more useful.

print( f"Benchmark run finished for RANK {rank}: token{token_num}-hidden_dim{hidden_dim}-dtype{dtype}-pattern{pattern_code}-layout{swizzled_layout_code}-pdl{launch_with_pdl}" )

gemini-code-assist · 2025-08-10T23:53:13Z

benchmarks/bench_trtllm_allreduce_fusion.py

+SCALE_FACTOR_RANGE = (-1, 1)
+
+
+def _run_correctness_worker(world_size, rank, dtype, hidden_dim, distributed_init_port):


The function name _run_correctness_worker is misleading for a benchmark file. It suggests a correctness test, but there are no assertions. Please rename it to something like _run_benchmark_worker to reflect its purpose.

Suggested change

def _run_correctness_worker(world_size, rank, dtype, hidden_dim, distributed_init_port):

def _run_benchmark_worker(world_size, rank, dtype, hidden_dim, distributed_init_port):

gemini-code-assist · 2025-08-10T23:53:14Z

benchmarks/bench_trtllm_allreduce_fusion.py

+                                        swizzled_layout_code
+                                        == comm.FP4QuantizationSFLayout.SWIZZLED
+                                    ):
+                                        # TODO(Yingyi): check this


This file contains unresolved TODO comments (here and at line 217), which suggests the implementation may be incomplete. These should be addressed to ensure the benchmark is correct and provides reliable results.

gemini-code-assist · 2025-08-10T23:53:14Z

benchmarks/bench_trtllm_allreduce_fusion.py

+    finally:
+        dist.barrier(group=group)
+
+        comm.trtllm_destroy_ipc_workspace_for_all_reduce(ipc_handles, group=group)


The workspace was created with trtllm_create_ipc_workspace_for_all_reduce_fusion, but is being destroyed with trtllm_destroy_ipc_workspace_for_all_reduce. For consistency and to prevent potential issues if the implementations diverge, you should use the corresponding trtllm_destroy_ipc_workspace_for_all_reduce_fusion function.

comm.trtllm_destroy_ipc_workspace_for_all_reduce_fusion(ipc_handles, group=group)

init

695f711

gemini-code-assist bot reviewed Aug 10, 2025

View reviewed changes

yyihuang self-assigned this Aug 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

benchmark: add allreduce_fusion benchmark #1450

benchmark: add allreduce_fusion benchmark #1450

Uh oh!

yyihuang commented Aug 10, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Aug 10, 2025

Uh oh!

gemini-code-assist bot Aug 10, 2025

Uh oh!

gemini-code-assist bot Aug 10, 2025

Uh oh!

gemini-code-assist bot Aug 10, 2025

Uh oh!

Uh oh!

		SCALE_FACTOR_RANGE = (-1, 1)


		def _run_correctness_worker(world_size, rank, dtype, hidden_dim, distributed_init_port):

	def _run_correctness_worker(world_size, rank, dtype, hidden_dim, distributed_init_port):
	def _run_benchmark_worker(world_size, rank, dtype, hidden_dim, distributed_init_port):

benchmark: add allreduce_fusion benchmark #1450

Are you sure you want to change the base?

benchmark: add allreduce_fusion benchmark #1450

Uh oh!

Conversation

yyihuang commented Aug 10, 2025

📌 Description

🔍 Related Issues

🚀 Pull Request Checklist

✅ Pre-commit Checks

🧪 Tests

Reviewer Notes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Aug 10, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 10, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 10, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 10, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!