Why Hespas

One Workload Representation, Multiple Backends

ML performance prediction is a multi-dimensional, cross-stack problem. Empirical evaluation across all combinations of models, compilers, and hardware is too costly. Meanwhile, the current simulation space is fragmented:

Different simulators operate at different fidelity levels
Different hardware architectures require different tools
Different workload abstractions make cross-validation difficult

This means workloads are reimplemented or approximated per tool, which makes comparison across hardware architectures less meaningful.

Hespas takes a different approach: a single StableHLO workload representation that works across a range backends.

../_images/stablehlo_across_frameworks.svg — StableHLO: stable, framework and hardware agnostic workload representation

A StableHLO workload exported from a production ML framework JAX captures the actual computation graph, including collective communication for distributed training. From this single representation, Hespas can:

Profile on real hardware — compile and execute via XLA or IREE to obtain measured runtimes on GPUs.
Estimate analytically — apply a roofline model for fast, hardware-free exploration.
Simulate — feed into architectural simulators like COCOSSim or ONNXim for detailed bottleneck analysis.

The workload stays the same across all fidelity levels.

StableHLO

../_images/workload_repr_ecosystem.png — Workload abstraction levels

Unlike configuration-based descriptions that target narrow workload classes, or trace-based formats that require prior execution, StableHLO offers several advantages:

A real, compiler-compatible IR exported from production frameworks
Explicit collective communication operators for distributed training
Compiler optimizations can be applied before estimation
Framework-agnostic: works with JAX, PyTorch/XLA, and other OpenXLA frontends
Ahead-of-time: the workload can be obtained without access to target hardware

Hespas currently uses JAX as the primary frontend for exporting StableHLO workloads. JAX is particularly well suited for this purpose:

SPMD-first by default — distributed parallelism is a core part of the programming model
One global, static program — the entire computation is expressed as a single program
Static parallelism via XLA and StableHLO — parallelization decisions are made at compile time, producing a deterministic workload representation that is easy to analyze
TPU support — enables cross-architecture studies alongside GPUs