Exo: Exocompilation for productive programming of hardware accelerators

Exo: Exocompilation for productive programming of hardware accelerators

by gnabgib

alex7o

Halide does something similar but as C++ for it dsl langauge: https://halide-lang.org/

gotoeleven

The Exo docs mention Halide as an example of a language similar to Exo but is "lowering-based", while Exo is "rewrite-based." This seems to mean that Halide is more of a DSL where what you want is specified at a higher level while Exo is a set of transformations you can apply to an existing kernel of code (though at least in the examples the kernel of code is written in python and then somehow C is generated from it, after transformations are applied). Do you know what the relative strengths of these two approaches are?

Also, are there any languages in this vein that take more of a declarative (as opposed to imperative) approach?

alex7o

I have mostly a passing interest in the topic, however from what I understand Halide first lowers to an intermediate language then they do the scheduling and optimisation. While exo, does the optimisations directly on the python code. Also Halide needs you to tell it how to lower the code and how to do the scheduling. Which is something that exo can determine itself.

DiabloD3

Neat idea, but why is this implemented in Python instead of a more suitable language? Also, there are many DSLs and APIs already for this, some even supported by industries and provided as part of the SDKs for various hardware blobs, so why Exo?

Like, I hate SYCL, Kokkos, and all the other similar things as much as the next guy, but major companies and orgs support and use those already.

erdaniels

I'm not the target audience but the GitHub and website getting started page feel so poorly explained. What the hell is a schedule?

gnabgib

This MIT article covers it a bit more (with a slightly too generic title) High-performance computing, with much less code https://news.mit.edu/2025/high-performance-computing-with-mu... (https://news.ycombinator.com/item?id=43357091)

EVa5I7bHFq9mnYK

Word "schedule" is already taken for thread scheduling in the kernel, so reuse of that word is confusing. This is a code generator that operates on nested loops - allows to reorder and split them, replace instructions etc. All to maximize performance.

ajb

A schedule is the order in which machine instructions get executed.

So, I've done this professionally (written assembler code, and then scheduled it manually to improve performance). Normally you don't need to do that these days, as even mobile CPUs use out-of-order cores which dynamically schedule at runtime.

It's only going to be useful if you're writing code for some machine that doesn't do that (they give examples of TPU etc)

almostgotcaught

> out-of-order cores which dynamically schedule at runtime.

OOO architectures don't reschedule dynamically - that's impossible - they just have multiple instruction buffers that can issue the instructions. So scheduling is still important for OOO it's just at the level of DDG instead of literally linear order in the binary.

Edit: just want to emphasize

> It's only going to be useful if you're writing code for some machine that doesn't do that

There is no architecture for which instruction scheduling isn't crucial.

achierius

> There is no architecture for which instruction scheduling isn't crucial.

In my experience doing back-end compiler work, it's definitely last on the list of major concerns. Obviously you can't ignore it, but it's not where any significant gains are coming from, and anything 'fancy' you do there is likely to be picked up by future generations of hardware.

almostgotcaught

> but it's not where any significant gains are coming from

i have no clue what you're saying here - if your scheduler fucks up your pipelining you're gonna have a bad time (conversely if you're scheduler recognizes pipelining opportunities you're gonna have a good time). as usual anyone who says something like "in my experience it's not a big deal" simply does not have enough experience.

ForTheKidz

Yes, clearly we should have spent more time trying to coax our cpus into fucking up pipelining.

ajb

If you're talking about modifying the DDG, I would not call that scheduling. Because then you need to do serious work to prove that your code is actually doing the same thing. But I haven't spent a lot of time in the compiler world, so maybe they do call it that. Perhaps you could give your definition?

dzaima

Needing to do serious work would correlate with Exo having multiple publications about its design. It's a broader sense of scheduling than "reorder trivially-independent instructions", but, if you remove the "trivially-" part and replace "instructions" with "operations", it's the same concept of doing transformations that only move around when certain things happen.

Comment was deleted :(

imtringued

https://github.com/exo-lang/exo/blob/main/examples/avx2_matm...

I personally am not convinced.

lostmsu

Agreed. It looks like if you need to optimize it would be much easier to just modify the code directly. The result will also be more readable and therefore easier to support in the future.

Comment was deleted :(

stanleykm

You guys might want to explain what this is a little better. My first question was “what do they mean by ‘hardware accelerators’?”

It wasnt until I saw the matmul example that I realized this is (probably, its still unclear) for GPUs.

Netcob

Probably any hardware that requires you to use some intermediate code. Nowadays probably some TPU or NPU or whatever.

This made me think of what "accelerators" I've come across (just as a regular consumer):

In the late 90s, that's what we called GPUs - "3D accelerators". For a short time, they were cards separate from the actual graphics card, and often would involve routing your VGA cable through that thing. Before it all merged into one device. I was very slightly disappointed as a kid that I narrowly missed that time, but bilinear filtering and higher resolutions and framerates helped me get over the fact I couldn't cram more cool weird tech into my PCI slots.

Then you had sound cards with "audio accelerators" using technologies like EAX. All that eventually migrated back to the CPU I think.

For a while you could by a "physics accelerator" for PhysX, then acquired by nvidia and moved to the GPU using CUDA. I never had one, but one time I kept around an older GPU after upgrading as a dedicated physx processor. Now that's the only way to run older 32bit games with physx turned up, since that's not supported in 5000 series GPUs.

Finally, I got this "Coral TPU", a USB device with 4GB RAM (and I think around 4 TOPS or something?) for very efficient inferencing. There are some open source projects supporting this, like frigate, which lets you process object detection with multiple surveillance camera streams on a raspberry pi. I never really used it though.

And of course now we have NPUs as sub-systems in CPUs and GPUs. I'd love to have a dedicated PCIe card with that, but of course having yet another computer architecture with dozens/hundreds of GB of redundant RAM is kind of a bummer.

cubefox

A bummer yes, but two NPUs, one on the CPU, another one on the GPU, that just sounds silly.

almostgotcaught

"[compilation] for productive programming of hardware accelerators"

But 93% of the codebase is Python lol. Whatever you think about Python, it is not a systems programming language. Takeaway: this is not a serious compiler project (and it's clearly not, it's a PhD student project).

Deeper take: this is just a DSL that behind the scenes calls a (SMT) solver for tuning (what they call "scheduling"). There are a million examples of this approach for every architecture under the sun. My org is literally building out the same thing right now. Performance is directly a function of how good your model of the architecture is. At such a high-level it's very likely to produce suboptimal results because you have no control over ISA/intrinsic level decisions. Lower-level implementations are much more robustly "powerful".

https://dl.acm.org/doi/10.1145/3332373

rscho

Well, this is clearly an attempt at abstracting the kind of low-level stuff you describe. Perhaps it doesn't work (yet), but that shouldn't prevent people from trying ? Involving a SMT solver suggests that the solver is doing the heavy-lifting, not python. PhDs often produce inapplicable stuff, but they are also the basis for industry/application R&D, such as what your org is doing... PhDs are the slaves of science. They make stuff happen for peanuts in return and deserve our respect for that, even if what happens is oftentimes a dead-end. It's really sad seeing people shitting on PhDs.

pclmulqdq

Unfortunately, the comment you are responding to is more correct on this than I think you are. The python thing was stupid, though - a lot of high-performance code gets written in libraries like numpy (that call C 99.99% of the time) or pytorch (JIT-ed before executing) that keep Python our of the critical path.

The problem with this research is that many similar ideas have been tried before in the contexts of supercomputers or CPU compilers. They ultimately all fail because they end up (1) being more work to program in, and (2) not being any faster because real life happens. Networks drop packets, clock frequencies jitter, and all sorts of non-determinism happens when you have large scale. A static scheduler forces you to stall the whole program for any of these faults. All the gain you got by painstakingly optimizing things goes away.

PhD theses, in a best-case scenario, are the basis for new applications. Most of them amount to nothing. This one belongs on that pile. The sad part about that is that it isn't the student's fault that the professor sent them down a research direction that is guaranteed to amount to nothing of use. This one is on the professor.

fancyfredbot

I don't think the paper is about statically scheduled architectures. In fact they mention it's for modern accelerators. These switch between threads in a dynamic way rather than stalling. The scheduling being referred to seemed to mean the order in which instructions should be fed to a potentially dynamic scheduler to enable efficient usage of caches etc.

So I'm not sure you can dismiss it as a thesis which will amount to nothing on the basis that static scheduling is a bad idea!

I could easily have missed something though. It's not a particularly clear or succinct write-up and I have only read some of it. If it does say that it only works for strictly deterministic in-order architectures somewhere please can you point out where?

pclmulqdq

If you read the actual github, linked from the article, you will realize that it is a manual static scheduler. Example here:

https://github.com/exo-lang/exo/blob/main/examples/avx2_matm...

The hardware still does what it does, but this is a tool for manually ordering operations.

I did, in fact, read the materials before dismissing them as useless, since this is very relevant to what I do professionally.

fancyfredbot

It's me who hadn't read all of the materials. But infact I think we both agree about what the tool does, which is scheduling an instruction stream ahead of time.

I'm confused because that approach seems identical to all mainstream compilers I know of. GCC/Clang also schedule instructions statically and the right schedule will improve performance. Why won't it work here? What kind of dynamic scheduling do you think it needs in order to be useful? Like a tracing JIT? Or they need to implement the reordering in hardware and reschedule as instructions execute?

pclmulqdq

The issue is that it takes manual programmer effort for no gain over what a compiler gives you.

The selling point of most of these tools is that you can be smarter than the compiler, and this one (despite the example) is selling static scheduling of compute kernels, too. That would normally be up to the OS/driver with some flexibility.

fancyfredbot

Most CUDA kernels are hand tuned anyway so it's not clear this is a lot more effort for the programmer. Most compilers can't perform these kind of loop transformation and tiling optimizations for you. The CUDA compiler certainly doesn't do this. So it is actually both possible and very worthwhile to try and be smarter than the compiler in this case.

In terms of scheduling compute kernels, the project doesn't remove the ability of the OS/driver to schedule execution. It's only affecting the instruction sequence within the kernel, which is something OS/drivers don't typically control. They retain full control over when the kernel is executed by the hardware.

(PTX is sometimes compiled to SAAS by the Nvidia driver rather than ahead of time. That does allow the driver to reschedule instructions, but using this project doesn't prevent it from doing that. The driver will compile PTX emitted from this project in the same way as any other)

pclmulqdq

Yes, micro-optimization with manual instruction scheduling is usually a great idea. No, macro-optimization with manual instruction scheduling is usually a bad idea. That it "doesn't remove" something isn't an argument that this is a good idea when the selling point is that it enables removal.

The "novel" idea here is the macro-optimization, with some overtures about making the micro-optimization easier - as you likely know, this is not true since the complexity here is more about understanding how to do better, not about what language features you use to make those changes.

Comment was deleted :(

fancyfredbot

Your take seems to contradict the article? You say SMT solvers give "no control over ISA/intrinsic level decisions" but their design.md says "user-defined scheduling operations can encapsulate common optimization patterns and hardware-specific transformations". Are they wrong about this? Can you explain why?

QuadmasterXLII

any sufficiently powerful compiler is going to run an interpreted language at compile time, and there’s no reason it can’t be Python instead of C++ template metaprograms or CMake

Comment was deleted :(

saagarjha

Maybe a takeaway that boils down to “how could anyone ever write a compiler on Python” is the wrong one to have.

Comment was deleted :(

k_bx

Spent 5 minutes browsing but still couldn't figure this out. Would it be able to target FPGA?

skavi

are the transforms one applies with this always correct (not necessarily optimal)? or is it possible to introduce a bug which didn’t exist in the original kernel.

Comment was deleted :(

Crafted by Rajat

Source Code

hckrnws

Exo: Exocompilation for productive programming of hardware accelerators