Skip to content

Conversation

@colluca
Copy link
Contributor

@colluca colluca commented Dec 17, 2025

This PR includes all the developments that were done for the MLSys paper. This includes 1) the hardware extensions required to support (performant) multicast and reduction, 2) software benchmarks and tests and 3) an experiments' framework for Picobello derived from Snitch's, used to develop the experiments for the paper.

In detail:

  • Bump Snitch w/ support for reduction and DCA.
  • Bump AXI to enable rerouting all collective communications outside the cluster.
  • Bump FlooNoC w/ support for VCs, reductions and DCA.
  • Bump common_cells, iDMA and LLVM toolchain to align with new versions in previous IPs.
  • Update Snitch's configuration, with clearer fields for enabling narrow and wide collectives and setting the AWUSER width.
  • Pass cluster base offset to Snitch, needed internally to calculate end address of its address space.
  • Add path to experiments in PYTHONPATH, alternative to creating a proper Picobello Python package.
  • Install Snitch Python package in editable mode (useful when Snitch is bender-cloned and making modifications to it, and cause why not?).
  • Simplify build of Picobello's software tests by reusing Snitch's Make rules
  • Enable running simulations in directories other than PB_ROOT (this is IMO the best method for ensuring all simulation artifacts are collected under a different directory).
  • Create a proper Picobello accelerator runtime/library. src/ contains potentially reusable (across 2D tile-based accelerators) sources, impl/ contains a Picobello-specific implementation of the runtime/library (providing a Picobello-specific HAL and stitching together a Picobello-specific selection of the reusable sources, including Snitch's).
  • Implement a communicator-based (inspired by MPI) collective communication API (sync) for 2D tile-based accelerators.
  • Extend the team API.
  • Add barrier_benchmark.c, reduction_benchmark.c and dma_multicast_v2.c benchmarks.
  • Test that multiple outstanding barriers that overlap in their participating clusters function correctly (overlapping_barriers.c). This test used to fail when this feature was not supported, as desired.
  • Stress test row, column and global barriers happening in rapid succession (parallel_row_col_barriers.c).
  • Alias Snitch targets for visual trace generation (remove sn- prefix).

TODOs

  • Wait for VCs and reduction to be merged in FlooNoC (the reset this PR to Picobello's link-exploration branch)
  • Wait for dual PC on eject port to be re-implemented, and merged in FlooNoC
  • Auto-generate NoC configuration for reductions
  • Bump to proper peakrdl-rawheader release
  • Rebase this PR on Picobello's main
  • What to do with reduction_benchmark_hyperbank.c?
  • What to do with summa_gemm.c and gemm_2d.c implementations?

colluca and others added 30 commits September 26, 2025 18:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants