Chat with Philippe Tillet, Author of Triton

MLSys Reading Group.

Summary. Triton is an open source python library for writing highly efficient GPU code. It requires no CUDA expertise and its performance is on-par with expert hand-tuned kernel libraries.

We document a summary of our exchange with Philippe. We write this summary based on a combination of notes & memory so this is definitely not a precise characterization of our conversation, but we try our best.

There are a few suggestions Philippe gave for making future compilers. The editor of this summary believes that much of these lessons can generalize to building any user-facing research tools/products.

Be the power user. Philippe advises aspiring compiler developers to use their compiler often to empathize with potential pain points the user may feel. When developing Triton, he puts a lot of emphasis on making the compiler usable out of the box with a simple pip install.

Find a niche. Triton saw its initial success in excelling at one particular use case — blocksparse matrix multiplication. Philippe believes that if you can do one thing really really well that will be enough to dramatically increase adoption of your compiler.

Users that push the limit. Philippe mentioned that he particularly enjoyed working at OpenAI where his colleagues were experimenting with so many different things that often stress-tests his compiler to the limit. By having demanding users he was motivated to quickly evolve the design/implementation of his compiler.

Slow growth. Philippe warns us the overhead of managing an open source community and advises us to not care too much about growing your community early on during open source project development.

Q/A Highlights:

Q: Performance engineering on NVIDIA GPU appears to involve tricks and knowledge of architectural details that only insiders know. How did you overcome this CUDA moat?
Philippe thinks that this is largely a myth. Though in the early days, CUDA compiler was largely not capable enough and writing GPU kernels in lower level assembly (PTX/SASS) was necessary, things have improved drastically in recent years. Writing in the high-level CUDA language is sufficient to achieve high performance in most cases. In present day, even NVIDIA themselves is unlikely to be writing their kernel libraries in assembly directly.

Q: Many researchers tried to make attention faster. Is attention done?
Philippe mentioned that he was independently trying to do what flash attention did, and did saw that paper, and then dropped everything to recreate it because it was what he wanted but better. Philippe believes that anything could happen and we don’t know for certain.

Q: What are some hardware architecture modifications that may prove useful for accelerating neural network models?
Philippe doesn’t believe in making larger & larger SRAMs to fit all transformer weights inside them. Low-precision arithmetics are clearly good because they scale down memory & compute at the same time without changing arithmetic intensity. The latest Hopper architecture is becoming asynchronous so working on good asynchronous programming models will also be a fruitful direction.

Q: Is there any chance unstructured sparsity may help accelerate neural network?
Philippe think it may work if sparse weights can fit inside SRAM. This may be the case for certain vision/speech workload.

Q: As the scale of training grows, performance engineering will represent a smaller & smaller fraction of the cost for training a LLM. How does this affect people working on compilers?
Philippe thinks that it is in general hard to find people who has the know-how of compilers and low-level programming. The fact that the decreasing labor cost of performance engineering as a percentage makes these skills all the more valuable because companies will not hire a large number of performance engineers due to the communication overhead, but instead will be willing to spend more to hire the best ones.