You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
CPU's threading model is distinct from GPU's thread group model: GPU
shared memory is not shared beyond one GPU thread group.
Whenever nested parallelism is enabled in the Mullapudi2016
auto-scheduler, always implement parallelizable loop dimensions as
`gpu_block`. This can be implemented by splitting the dimensions by a
factor 1: `f.split(z, zi, zo, 1)`.
This makes the autoscheduler's `last_level_cache` estimates per GPU warp
more robust against variations of the nested parallelism.
In the folder `*/apps/`, remove all manual override of
`last_level_cache_size`. Use the default estimate: 47kB per thread
group.
0 commit comments