Skip to content

Releases: celerity/celerity-runtime

v0.7.0 - Glorious Ginger

18 Aug 15:06
Compare
Choose a tag to compare

This Celerity release introduces multiple improvements to runtime performance, developer experience, and compatibility. This version requires C++20, and upgrading may also require minor adjustments in buffer access handling and usage of deprecated features.

HIGHLIGHTS

  • Celerity can now be built without MPI for single-node, multi-device setups.
    A single process can manage multiple devices without spawning extra MPI ranks.
  • Tracy integration has been improved, providing clearer warnings for uninitialized reads and better executor starvation reporting.
  • Substantial performance optimizations, including per-device submission threads, thread pinning, and reduced MPI transfer overhead.
  • celerity::distr_queue has been replaced by celerity::queue.
    Multiple instances of celerity::queue are now supported, with behavior more closely aligned with SYCL.
  • Runtime shutdown can now be explicitly controlled via celerity::shutdown().
    This complements celerity::init() for finer control over the runtime lifecycle.
  • Celerity now uses and requires C++20

Changelog

This release includes changes that may require adjustments when upgrading:

  • Celerity now requires C++20
  • celerity::distr_queue has been replaced by celerity::queue.
    Multiple instances of celerity::queue are now supported, with behavior more closely aligned with SYCL.
  • Buffer access handling has been refactored: celerity::access_mode is now a dedicated enum.
    Using sycl::access_mode on Celerity buffers is no longer supported.
  • Coordinate-list constructors of access::neighborhood have been deprecated in favor of the range overload.
  • We recommend performing a clean build when updating Celerity to ensure all updated submodule dependencies are properly propagated.

We recommend using the following SYCL versions with this release:

  • DPC++: ad494e9d or newer
  • AdaptiveCpp (formerly hipSYCL): v24.06
  • SimSYCL: master

See our platform support guide for a complete list of all officially supported configurations.

Added

  • Support builds for single-node multi-device setups without MPI by specifying -DCELERITY_ENABLE_MPI=0 in CMake (#282)
  • Add celerity::once tag type for host tasks (equivalent to range<0>{}) as a replacement for on_master_node (#282)
  • Replace celerity::distr_queue with celerity::queue, which permits multiple instances and aligns closer with SYCL (#283)
  • The runtime can be explicitly shut down using celerity::shutdown(), complementing celerity::init() (#283)
  • handler::parallel_for(size_t, [size_t,] ...) now acts as a shorthand for parallel_for(range<1>, [id<1>,] ...) (#288)
  • Experimental support for the AdaptiveCpp generic single-pass compiler (#294)
  • Constructor overloads to the access::neighborhood range mapper for reads in 3/5/7-point stencil codes (#292)
  • The SYCL backend now uses per-device submission threads to dispatch commands for better performance.
    This new behaviour is enabled by default, and can be disabled via CELERITY_BACKEND_DEVICE_SUBMISSION_THREADS (#303)
  • Celerity now has a thread pinning mechanism to control how threads are pinned to CPU cores.
    This can be controlled via the CELERITY_THREAD_PINNING environment variable (#309)

Changed

  • Update Tracy dependency to v0.11.1 (#281)
  • Update libenvpp dependency to 1.5 (#312)
  • Update fmt dependency to 11.1.2 (#328)
  • Update spdlog dependency to HEAD > 1.15.0 (#328)
  • Celerity now requires C++20 (#291)
  • Automatic runtime shutdown, which was previously triggered by the last queue / buffer / host object going out of scope,
    is now postponed until process termination (atexit()). This allows multiple non-overlapping sections of Celerity code
    to execute in the same process (#283)
  • Celerity warns on excessive calls to queue::wait() or distr_queue::slow_full_sync() in a long running program.
    This operation has a much more pronounced performance penalty than its SYCL counterpart (#283)
  • On systems that do not support device-to-device copies, data is now staged in linearized buffers for better performance (#287)
  • Removed the flush_async workaround for newer ACPP versions, keeping compatibility with older versions (#333)
  • The access::neighborhood built-in range mapper now receives a range instead of a coordinate list (#292)
  • Overhauled the installation and configuration documentation (#309)
  • Celerity will now queue up several command groups in order to combine allocations and elide resize operations.
    This behavior can be influenced using the new experimental::set_lookahead and experimental::flush APIs (#298)
  • Reduced small host-buffer allocations in MPI transfers by accumulating touched boxes during anticipate() (#313)
  • Celerity internals are no longer exposed to users through installed headers (#308)
  • Buffer access_mode is now a dedicated celerity::access_mode enum instead of an alias of sycl::access_mode, simplifying
    the include tree and removing namespace ambiguity. sycl::access_mode can no longer be used with Celerity buffers. (#315)
  • Uninitialized read warnings now provide more helpful information (#321)
  • Improved Tracy integration for executor starvation. Celerity now also prints a warning when execution time exceeds a
    given percentage threshold, indicating that the application might be scheduler-bound (#322)

Fixed

  • Host-initialized buffers will not read from user-provided memory after the last reference to the buffer has been dropped (#283)
  • Fix a build issue on macOS where moving a std::function did not clear the source, causing failing test cases (#285)
  • Fix a path hint for finding AdaptiveCpp when using an installed Celerity (#286)
  • Fix a race condition in unit tests by updating last_epoch_reached before signalling the epoch promise, ensuring proper synchronization (#307)
  • Fix a build issue with (rare) configurations which enable both Tracy and OOB-checks (#331)

Deprecated

  • celerity::distr_queue is deprecated in favor of celerity::queue (#283)
  • The coordinate-list constructors of access::neighborhood are deprecated in favor of the range overload (#292)

Internal

  • Command graphs generate a single "fat" push command instead of a septate push for each write and target node. (#290)
  • Event polling now only happens for instructions that are actively executing (#293)
  • Task management now uses epoch-based structures, removes the ring buffer size limit, and handles tasks via
    stable pointers, simplifying scheduler and application thread interactions (#295)
  • Command graph now uses command instead of abstract_command, moves CDAG-related pruning to the scheduler,
    and maintains command pointers in the CDAG generator (#297)
  • buffer_access_map now works in terms of consumed and produced regions instead of access modes.
    This includes various related improvements to task requirements, execution ranges, and graph printing (#300)
  • Use region_map::update_box instead of update_region where applicable (#302)
  • Improved "system" benchmarks to better capture effects that are highly significant in real-world workloads (#304)
  • Unified thread code, with a single source of truth for thread names and Tracy thread ordering (#310)
  • Optimize perform_task_buffer_accesses to skip redundant last-writers updates and transpose loops,
    yielding minor performance improvements in scheduler-bound workloads (#317)
  • The SimSYCL workaround for thread safety has been removed (#318)
  • Prevent unbounded growth in receive_arbiter by caching active transfers (#319)
  • Centralize definition of Tracy colors (#320)
  • Change split functions to work on box instead of chunk (#323)
  • Align await-pushes with pushes by computing the union of regions for remote chunks executed on the same node (#324)
  • Celerity now uses SYCL_IS_* macros instead of defined(__SYCL_COMPILER_VERSION) for checking the SYCL version (#329)
  • Removed internal branches on CELERITY_FEATURE_UNNAMED_KERNELS, which now only exists for backwards compatibility in
    applications (#329)

v0.6.0 - Fantastic Fennel

12 Aug 12:56
Compare
Choose a tag to compare

This is release includes major overhauls to many of Celerity's core internals, improving performance, debuggability as well as laying the groundwork for future optimizations.

HIGHLIGHTS

  • Celerity now supports SimSYCL, a SYCL implementation focused on debugging and verification (#238).
  • Multiple devices can now be managed by a single Celerity process, which allows for more efficient device-to-device communication (#265).
  • The Celerity runtime can now be configured to log detailed tracing events for the Tracy hybrid profiler (#267).
  • Reductions are now supported across all SYCL implementations (#265).
  • The new experimental::hints::oversubscribe hint can be used to improve computation-communication overlapping (#249).
  • API documentation is now available, generated by 🥬doc.

Changelog

This release includes changes that may require adjustments when upgrading:

  • A single Celerity process can now manage multiple devices.
    This means that on a cluster with 4 GPUs per node, only a single MPI rank needs to be spawned per node.
  • The previous behavior of having a separate process per device is still supported but discouraged, as it incurs additional overhead.
  • It is no longer possible to assign a device to a Celerity process using the CELERITY_DEVICES environment variable.
    Please use vendor-specific mechanisms (such as CUDA_VISIBLE_DEVICES) for limiting the set of visible devices instead.
  • We recommend performing a clean build when updating Celerity so that updated submodule dependencies are properly propagated.

We recommend using the following SYCL versions with this release:

  • DPC++: 89327e0a or newer
  • AdaptiveCpp (formerly hipSYCL): v24.06
  • SimSYCL: master

See our platform support guide for a complete list of all officially supported configurations.

Added

  • Add support for SimSYCL as a SYCL implementation (#238)
  • Extend compiler support to GCC (optionally with sanitizers) and C++20 code bases (#238)
  • celerity::hints::oversubscribe can be passed to a command group to increase split granularity and improve computation-communication overlap (#249)
  • Reductions are now unconditionally supported on all SYCL implementations (#265)
  • Add support for profiling with Tracy, via CELERITY_TRACY_SUPPORT and environment variable CELERITY_TRACY (#267)
  • The active SYCL implementation can now be queried via CELERITY_SYCL_IS_* macros (#277)

Changed

  • All low-level host / device operations such as memory allocations, copies, and kernel launches are now represented in the single Instruction Graph for improved asynchronicity (#249)
  • Celerity can now maintain multiple disjoint backing allocations per buffer, so disjoint accesses to the same buffer do not trigger bounding-box allocations (#249)
  • The previous implicit size limit of 128 GiB on buffer transfers is lifted (#249, #252)
  • Celerity now manages multiple devices per node / MPI rank. This significantly reduces overhead in multi-GPU setups (#265)
  • Runtime lifetime is extended until destruction of the last queue, buffer, or host object (#265)
  • Host object instances are now destroyed from a runtime background thread instead of the application thread (#265)
  • Collective host tasks in the same collective group continue to execute on the same communicator, but not necessarily on the same background thread anymore (#265)
  • Updated the internal libenvpp dependency to 1.4.1 and use its new features (#271)
  • Celerity's compile-time feature flags and options are now written to version.h instead of being passed on the command line (#277)

Fixed

  • Scheduler tracking structures are now garbage-collected after buffers and host objects go out of scope (#246)
  • The previous requirement to order accessors by access mode is lifted (#265)
  • SYCL reductions to which only some Celerity nodes contribute partial results would read uninitialized data (#265)

Removed

  • Celerity does not attempt to spill device allocations to the host if resizing buffers fails due to an out-of-memory condition (#265)
  • The CELERITY_DEVICES environment variable is removed in favor of platform-specific visibility specifiers such as CUDA_VISIBLE_DEVICES (#265)
  • The obsolete experimental::user_benchmarker infrastructure has been removed (#268).

v0.5.0 - Enchanting Elderberry

21 Dec 14:36
Compare
Choose a tag to compare

Right on time for the holidays we bring you a new major release with several new features, quality of life improvements and debugging facilities.

Thanks to everybody who contributed to this release: @fknorr, @GagaLP, @PeterTh, @psalz!

HIGHLIGHTS

  • The distr_queue::fence and buffer_snapshot APIs introduced in Celerity 0.4.0 are now stable (#225).
  • It some situations it may be necessary to prevent kernels from being split in a certain way (for example to prevent overlapping writes); this can now be achieved using the new experimental::constrain_split API (#212).
  • Speaking of splits, the new experimental:hint API can be used to control how a kernel is split across worker nodes (#227).
  • Celerity now warns at runtime when a task declares reads from uninitialized buffers or writes with overlapping ranges between nodes (#224).
  • The accessor out-of-bounds detection first introduced in Celerity 0.4.0 now also supports host tasks (#211).

Changelog

We recommend using the following SYCL versions with this release:

  • DPC++: 61e51015 or newer
  • hipSYCL: d2bd9fc7 or newer

Added

  • Add new environment variable CELERITY_PRINT_GRAPHS to control whether task and command graphs are logged (#197, #236)
  • Introduce new experimental for_each_item utility to iterate over a celerity range (#199)
  • Add new environment variables CELERITY_HORIZON_STEP and CELERITY_HORIZON_MAX_PARALLELISM to control Horizon generation (#199)
  • Add support for out-of-bounds checking for host accessors (also enabled via CELERITY_ACCESSOR_BOUNDARY_CHECK) (#211)
  • Add new debug::set_task_name utility for naming tasks to aid debugging (#213)
  • Add new experimental::constrain_split API to limit how a kernel can be split (#212)
  • Add GDB pretty-printers for common Celerity types (#207)
  • distr_queue::fence and buffer_snapshot are now stable, subsuming the experimental:: APIs of the same name (#225)
  • Celerity now warns at runtime when a task declares reads from uninitialized buffers or writes with overlapping ranges between nodes (#224)
  • Introduce new experimental::hint API for providing the runtime with additional information on how to execute a task (#227)
  • Introduce new experimental::hints::split_1d and experimental::hints::split_2d task hints for controlling how a task is split into chunks (#227)

Changed

  • Horizons can now also be triggered by graph breadth. This improves performance in some scenarios, and prevents programs with many independent tasks from running out of task queue space (#199)

Fixed

  • In edge cases, command graph generation would fail to generate await-push commands when re-distributing reduction results (#223)
  • Command graph generation was missing an anti-dependency between push-commands of partial reduction results and the final reduction command (#223)
  • Don't create multiple smaller push-commands instead of a single large one in some rare situations (#229)
  • Unit tests that inspect logs contained a race that would cause spurious failures (#234)

Internal

  • Improve command graph testing infrastructure (#198)
  • Overhaul internal grid region and box representation, remove AllScale dependency (#204)

v0.4.1

08 Sep 10:56
Compare
Choose a tag to compare

This is a small bugfix release primarily restoring Celerity's compatibility with the most recent versions of hipSYCL and DPC++.

Changelog

We recommend using the following SYCL versions with this release:

  • DPC++: 61e51015 or newer
  • hipSYCL: d2bd9fc7 or newer

See our platform support guide for a complete list of all officially supported configurations.

Fixed

  • Fix the behavior of dry runs (CELERITY_DRY_RUN_NODES) in the presence of fences or graph horizons (#196, 069f502)
  • Compatibility with recent hipSYCL >= d2bd9fc7 (#200, b174df7)
  • Compatibility with recent versions of Intel oneAPI and Arc-series dedicated GPUs (requires deactivating mimalloc, #203, c151962)
  • Work around a bug in DPC++ that breaks selection of the non-default device (#210, 2b652f8)

Removed

  • Remove outdated workarounds for unsupported SYCL versions (#200, 85b7479)

v0.4.0 - Delightful Daikon

13 Jul 17:46
Compare
Choose a tag to compare

We are back with a major release that touches all aspects of Celerity, bringing considerable improvements to its APIs, usability and performance.

Thanks to everybody who contributed to this release: @almightyvats @BlackMark29A @facuMH @fknorr @PeterTh @psalz!

HIGHLIGHTS

  • Celerity 0.4.0 uses a fully distributed scheduling model replacing the old master-worker approach. This improves the scheduling complexity of applications with all-to-all communication from O(N^2) to O(N), solving a central scaling bottleneck for many Celerity applications (#186).
  • Objects shared between multiple host_tasks, such as file handles for I/O operations, can now be explicitly managed by the runtime through a new experimental declarative API: A host_object encapsulates arbitrary host-side objects, while side_effects are used to read and/or mutate them, analogously to buffer and accessor. Embracing this new pattern will guarantee correct lifetimes and synchronization around these objects. (#68).
  • The new experimental fence API allows accessing buffer and host-object data from the main thread without manual synchronization and reimagines SYCL's host accessors in a way that is more compatible with Celerity's asynchronous execution model (#151).
  • The new CMake option CELERITY_ACCESSOR_BOUNDARY_CHECK can be set to enable out-of-bounds buffer access detection at runtime inside device kernels to detect errors such as incorrectly-specified range-mappers, at the cost of some runtime overhead. This check is enabled by default for debug builds of Celerity (#178).
  • Celerity now expects buffers (and the new host-objects) to be captured by reference into command group functions, where it previously required by-value captures. This is in accordance with SYCL 2020 and removes one common source of user errors (#173).
  • Last but not least, several significant performance improvements make Celerity even more competitive for real-world HPC applications (#100, #111, #112, #115, #133, #137, #138, #145, #184).

Changelog

We recommend using the following SYCL versions with this release:

  • DPC++: 61e51015 or newer
  • hipSYCL: 24980221 or newer

See our platform support guide for a complete list of all officially supported configurations.

Added

  • Introduce new experimental host_object and side_effect APIs to express non-buffer dependencies between host tasks (#68, 7a5326a)
  • Add new CELERITY_GRAPH_PRINT_MAX_VERTS config options (#80, d3dd722)
  • Named threads for better debugging (#98, 25d769d, #131, ff5fbec)
  • Add support for passing device selectors to distr_queue constructor (#113, 556b6f2)
  • Add new CELERITY_DRY_RUN_NODES environment variable to simulate the scheduling of an application on a large number of nodes (without execution or data transfers) (#125, 299ebbf)
  • Add ability to name buffers for debugging (#132, 1076522)
  • Introduce experimental fence API for accessing buffer and host-object data from the main thread (#151, 6b803f8)
  • Introduce backend system for vendor-specific code paths (#162, 750f32a)
  • Add CELERITY_USE_MIMALLOC CMake configuration option to use the mimalloc allocator (enabled by default) (#170, 234e3d2)
  • Support 0-dimensional buffers, accessors and kernels (#163, 0685d94)
  • Introduce new diagnostics utility for detecting erroneous reference captures into kernel functions, as well as unused accessors (#173, ff7ed02)
  • Introduce CELERITY_ACCESSOR_BOUNDARY_CHECK CMake option to detect out-of-bounds buffer accesses inside device kernels (enabled by default for debug builds) (#178, 2c738c8)
  • Print more helpful error message when buffer allocations exceed available device memory (#179, 79f97c2)

Changed

  • Update spdlog to 1.9.2 (#80, a178828)
  • Overhaul logging mechanism (#80, 1b19bfc)
  • Improve graph dependency tracking performance (#100, c9dab18)
  • Improve task lookup performance (#112, 5139256)
  • Introduce epochs as a mechanism for in-graph synchronization (#86, 61dd07e)
  • Miscellaneous performance improvements (#115, 9a099d2, #137, b0254fd, #138, 02258c0, #145, f0b53ce)
  • Improve scheduler performance by reducing lock contention (#111, 4547b5f)
  • Improve graph generation and printing performance (#133, 8122798)
  • Use libenvpp to validate all CELERITY_* environment variables (#158, b2ced9b)
  • Use native ("USM") pointers instead of SYCL buffers for backing buffer allocations (#162, 44497b3)
  • Implement range and id types instead of aliasing SYCL types (#163, 0685d94)
  • Disallow in-source builds (#176, 0a96d15)
  • Lift restrictions on reductions for DPC++ (#175, efff21b)
  • Remove multi-pass mechanism to allow reference capture of buffers and host-objects into command group functions, in alignment with the SYCL 2020 API (#173, 0a743c7)
  • Drastically improve performance of buffer data location tracking (#184, adff79e)
  • Switch to distributed scheduling model (#186, 0970bff)

Deprecated

  • Passing sycl::device to distr_queue constructor (use a device selector instead) (#113, 556b6f2)
  • Capturing buffers and host objects by value into command group functions (capture by reference instead) (#173, 0a743c7)
  • allow_by_ref is no longer required to capture references into command group functions (#173, 0a743c7)

Removed

  • Removed support for ComputeCpp (discontinued) (#167, 68367dd)
  • Removed deprecated host_memory_layout (use buffer_allocation_window instead) (#187, f5e6510)
  • Removed deprecated kernel dimension template parameter on one_to_one, fixed and all range mappers (#187, 40a12a4)
  • Kernels can no longer receive sycl::item (use celerity::item instead), this was already broken in 0.3.2 (#163, 67ccacc)

Fixed

  • Improve performance for buffer transfers on IBM Spectrum MPI (#114, c60527f)
  • Increase size limit on individual buffer transfer operations from 2 GiB to 128 GiB (#153, 972682f)
  • Fix race between creating collective groups and submitting host tasks (#152, 0a4fca5)
  • Align read-accessor operator[] with SYCL 2020 spec by returning const-reference instead of value (#156, 5011ded)

Internal

v0.3.2

17 Feb 11:27
Compare
Choose a tag to compare

This release fixes several bugs discovered since the v0.3.1 release. It also improves SYCL backend support and adds minor debugging features.

Added

  • Add support for ComputeCpp 2.7.0 and 2.8.0 with stable and experimental compilers. (2831b2a)
  • Add support for using local memory with ComputeCpp. (8e2fce4)
  • Print Celerity version upon runtime startup. (0681c16)
  • Print warning when too few logical cores are available. (113e688)

Fixed

  • Fix race condition around reference-capture in matmul example. (76f49c9)
  • Reduce hardware requirements for maximum work-group size in tests. (008a868, f0cf3f4)
  • Update Catch2 submodule to v2.13.8 as a bugfix. (26ca089)
  • Do not create empty chunks when splitting tasks with a small execution range in dimension 0. (15fa929)
  • Correctly handle empty buffers and buffer requirements with empty ranges. (ad99522)
  • Suppress unhelpful deprecation warnings around sycl::atomic from DPC++. (39dacdf)
  • Throw when submitting compute tasks with an empty execution range instead of accepting SYCL misbehavior. (baa242a)

v0.3.1

04 Jan 19:00
Compare
Choose a tag to compare

This release contains several fixes for bugs discovered since the v0.3.0 release.

Fixed

  • Remove blanket-statement error message upon early buffer deallocation, which in many cases is now legal. (6851145)
  • Properly apply horizons to collective host task order-dependencies. (4488724)
  • Avoid race between horizon task generation and horizon command execution. (f670868)
  • Fix data race in task_manager::notify_horizon_executed (only in debug builds). (f641bcb)
  • Don't rely on static destruction order for user_benchmarker. (d1c9e51)
  • Restructure wave_sim example to avoid host side race condition for certain --sample-rate configurations. (d226b95)
  • Hard-code paths for CMake dependencies in installed Celerity config to avoid mismatches. (4e88657)

v0.3.0 - Crunchy Celery

16 Nov 22:35
Compare
Choose a tag to compare

The 0.3.0 release brings with it many new features and improvements, including several changes that further align Celerity with the new SYCL 2020 API.

See the changelog for a full summary of all changes.

HIGHLIGHTS

  • Intel's DPC++ joins hipSYCL and ComputeCpp as the third SYCL implementation that is officially supported by Celerity.
  • Celerity now has support for SYCL 2020-style scalar reductions! See our documentation for more information on how to get started.
  • Changes in the SYCL 2020 API allow us to finally support nd-range parallel_for alongside local_accessor and group functions.
  • We are introducing aliases or custom implementations into the celerity namespace for most supported SYCL features (e.g. celerity::access_mode, celerity::item and so on). This means that Celerity code no longer has to mix names from the sycl and celerity namespaces. While still possible for the most part, we recommend sticking to celerity::!
  • Accessors can now also be created using SYCL 2020-style CTAD constructors: celerity::accessor acc{my_buffer, cgh, celerity::access::one_to_one{}, celerity::write_only}.

Note: We recommend using the following SYCL versions with this release:

  • ComputeCpp: 2.6.0 or newer
  • DPC++: 7735139 or newer
  • hipSYCL: 7b00e2e or newer

v0.2.1

09 Sep 15:57
Compare
Choose a tag to compare

This release contains a few fixes for bugs introduced in version 0.2.0.

Fixed

  • Re-enable ComputeCpp workaround for explicit copy operations. (f58b146)
  • Fix compilation on Windows by avoiding the TRUE literal as enum value. (8de922e)
  • Fix compilation with Boost < 1.67 by using backwards compatible header. (a51c98a)

Minor Version, Major Update

04 Sep 15:44
Compare
Choose a tag to compare

It's been a while since our last release, but we have been busy!

See the changelog for a full summary of all changes.

HIGHLIGHTS

  • The somewhat clunky with_master_access has been replaced by a much more powerful host_task API, which can schedule host-side code both on the master node as well as on worker nodes. It also features experimental support for integration with collective operations, such as parallel HDF5 I/O.
  • Celerity buffers are now fully virtualized, meaning that each worker can take full advantage of the available device memory for its local workload.

See our docs on how to get started with Celerity (available either as markdown or on our website).

Note: If you are using hipSYCL, make sure to use the newest version from the develop branch.