-
Notifications
You must be signed in to change notification settings - Fork 470
benchmark: add allreduce_fusion benchmark #1450
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @yyihuang, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
I am adding a new benchmark to assess the performance and correctness of the trtllm_allreduce_fusion
functionality. This benchmark is crucial for validating the efficiency of various all-reduce fusion patterns, including those with integrated residual RMS normalization and different quantization levels (FP8 and FP4). It will run across diverse distributed configurations and data types, providing valuable insights into the operation's behavior under various conditions.
Highlights
- New Benchmark Addition: I have introduced a new benchmark file,
benchmarks/bench_trtllm_allreduce_fusion.py
, to evaluate the performance and correctness of thetrtllm_allreduce_fusion
operation. - Comprehensive Fusion Pattern Testing: The benchmark covers a wide range of
AllReduceFusionPattern
types, including standard all-reduce, residual RMS normalization, and various FP8/FP4 quantization schemes, ensuring comprehensive testing of the fusion capabilities. - Distributed and Parameterized Testing: The benchmark is designed to run in a distributed environment, testing different
world_size
(2, 4, 8),dtype
(float16, bfloat16), andhidden_dim
(1024 to 8192) configurations, leveragingtorch.distributed
for multi-GPU setups. - CUDA Graph Integration: To ensure accurate and repeatable performance measurements, the benchmark incorporates CUDA graph capture and replay for the
comm.trtllm_allreduce_fusion
calls.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with π and π on @gemini-code-assist comments or fill out our survey to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. β©
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds a new benchmark for allreduce_fusion
. My review focuses on improving the clarity and correctness of the new benchmark script. I've identified several issues, including misleading variable and function names, the lack of actual performance measurement, a mismatched resource cleanup function, and unresolved TODOs. Addressing these points will make the benchmark more robust and useful.
if test_passed: | ||
print( | ||
f"test RANK {rank}: token{token_num}-hidden_dim{hidden_dim}-dtype{dtype}-pattern{pattern_code}-layout{swizzled_layout_code}-pdl{launch_with_pdl} passed" | ||
) | ||
else: | ||
print( | ||
f"test RANK {rank}: token{token_num}-hidden_dim{hidden_dim}-dtype{dtype}-pattern{pattern_code}-layout{swizzled_layout_code}-pdl{launch_with_pdl} failed" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test_passed
variable (defined on line 88) is always True
, making the else
block unreachable. This logic is misleading in a benchmark and should be removed along with the variable definition.
Additionally, a benchmark should ideally measure and report performance. Consider adding timing using torch.cuda.Event
around the g.replay()
call to make this benchmark more useful.
print(
f"Benchmark run finished for RANK {rank}: token{token_num}-hidden_dim{hidden_dim}-dtype{dtype}-pattern{pattern_code}-layout{swizzled_layout_code}-pdl{launch_with_pdl}"
)
SCALE_FACTOR_RANGE = (-1, 1) | ||
|
||
|
||
def _run_correctness_worker(world_size, rank, dtype, hidden_dim, distributed_init_port): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function name _run_correctness_worker
is misleading for a benchmark file. It suggests a correctness test, but there are no assertions. Please rename it to something like _run_benchmark_worker
to reflect its purpose.
def _run_correctness_worker(world_size, rank, dtype, hidden_dim, distributed_init_port): | |
def _run_benchmark_worker(world_size, rank, dtype, hidden_dim, distributed_init_port): |
swizzled_layout_code | ||
== comm.FP4QuantizationSFLayout.SWIZZLED | ||
): | ||
# TODO(Yingyi): check this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
finally: | ||
dist.barrier(group=group) | ||
|
||
comm.trtllm_destroy_ipc_workspace_for_all_reduce(ipc_handles, group=group) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The workspace was created with trtllm_create_ipc_workspace_for_all_reduce_fusion
, but is being destroyed with trtllm_destroy_ipc_workspace_for_all_reduce
. For consistency and to prevent potential issues if the implementations diverge, you should use the corresponding trtllm_destroy_ipc_workspace_for_all_reduce_fusion
function.
comm.trtllm_destroy_ipc_workspace_for_all_reduce_fusion(ipc_handles, group=group)
π Description
π Related Issues
π Pull Request Checklist
Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.
β Pre-commit Checks
pre-commit
by runningpip install pre-commit
(or used your preferred method).pre-commit install
.pre-commit run --all-files
and fixed any reported issues.π§ͺ Tests
unittest
, etc.).Reviewer Notes