TL;DR: The methodology for extracting high performance is changing with AI. Gone are the days of software abstractions, rapidly being replaced by agentic code generation. An automated performance engineering stack has become necessary to adapt to new models, algorithms, and hardware.

Introduction

A few weeks ago, I was invited to speak at the Frontiers of AI Summit, joined by a stellar lineup of founders and researchers across academia and industry. This gave me the opportunity to present Makora’s vision on AI performance engineering, bringing together the different technologies and products that we have been developing over the past 2 years. The talk was also recorded and made available publicly for people to watch. This blog post will summarize the talk and expand upon some of its main points, laying a foundation for Makora’s approach to AI performance engineering.

Software Abstractions → Agentic Code Generation

AI is advancing at a pace that is difficult for traditional software stacks to match. Models are growing larger, workloads are becoming more diverse, and hardware platforms are evolving rapidly across GPUs, accelerators, memory systems, and datacenter-scale interconnects. Yet the performance of AI systems still depends heavily on low-level software: kernels, schedules, memory layouts, communication patterns, and hardware-specific optimizations.

This creates a widening gap. On one side, hardware vendors continue to expose increasingly powerful capabilities. On the other, most developers and even many performance engineers cannot manually keep up with the complexity needed to fully exploit them. The traditional solution to this problem involves building easy-to-program software abstractions and compilers to access hardware performance. The central argument of this talk is that the next generation of AI infrastructure will require a new approach: automatic GPU performance engineering is powered by agentic code generation and automatic search.

Why is AI Performance Engineering difficult?

The AI performance engineering stack contains multiple layers, all of which are challenging to manually optimize and tune for performance. In the following subsections, we outline the challenges at three layers of the AI software stack: GPU code, serving engines, and algorithms.

GPU Code (Kernels)

At the lowest level, low-level code is needed to program different types of accelerators, from CPUs, to GPUs, and specialized AI accelerators. Heterogeneity has become a reality for high performance AI workloads, making the ability to target multiple hardware types crucial to achieve the best performance/$. However, low-level code is notoriously difficult to write, exacerbated by the fast hardware release cycle, and rapidly-changing AI workloads. For example, NVIDIA B200 matrix units can only be accessed using the tcgen05 PTX instruction, and it is neither forward nor backward compatible with other NVIDIA GPUs. This makes the update of low-level code a large and time-consuming burden, that delays the use of new AI chips. A well-known example is the Flashattention4 (FA4) kernels that are necessary for the deployment of all transformer-based AI models. However, FA4 took months for release after the B200 GPUs were already installed, causing servers to sit idle for weeks or months.

Our solution uses scalable agentic code generation using LLMs, including our own fine-tuned GPT-5 variants [1]. Our product, MakoraGenerate, can be accessed at https://generate.makora.com/, and we have seen multiple successes in rapidly adding support for new models, new hardware, or new algorithms. Furthermore, our automated flow has improved multiple hand-optimized GPU kernels from both NVIDIA and AMD.

Inference Serving Engines

One layer up the stack, inference servers such as vLLM, SGLang, TensorRT-LLM, ATOM, and others, present a lot of opportunities for performance optimizations. In fact, incorrectly setting the server parameters can matter as much as a 10X performance difference! A recent article shows how enterprises are struggling with adapting inference servers to AI workloads [2]. The plot above highlights the large performance design space possible for a single AI model. Navigating that space is non-trivial, yet pivotal for AI performance engineering.

At Makora, we have developed MakoraOptimize to autonomously navigate that large design space, including (1) investigating different model sharding strategies, (2) implementing different quantization and speculative decoding algorithms whilst co-optimizing other serving parameters, and (3) together with MakoraGenerate, swapping in shape-optimized (for batch and sequence length) kernels for maximum performance.

Algorithm Innovation

The AI research community is on fire! There are new papers every day on improving efficiency through algorithm innovation. This includes new quantization schemes, datatype innovations, speculative decoding algorithms, inference algorithms, and more. However, the main blocker to the implementation of these algorithms, is the complexity of the AI performance stack. A new algorithm will inevitably require low-level kernel support, inference server support including scheduling, memory management, and other optimizations to fully realize performance improvements from a new research idea.

A perfect example is our own SMC-SD inference algorithm developed within Makora with collaborators [3]. To realize an implementation for this new algorithm, we needed to significantly modify KV-cache management, add GPU-side resampling kernels, in addition to many other inference server optimizations to overlap CPU and GPU execution (more blog posts on that later!). Without Makora’s automated software stack, it would’ve been impossible to deploy SMC-SD in production so quickly. The power MakoraGenerate and MakoraOptimize shines in deploying new algorithms, and we are hard at work bringing many cool new research ideas into production. The plot below shows that we were able to achieve faster performance using 4xMi325X compared to 576 Groq chips on Llama3.3-70B using SMC-SD and our automated software stack.

Conclusion

Our automated software stack culminates into MakoraInference, our token serving products for popular AI models that runs on multiple different hardware backends. It is the ultimate test of the efficacy of our performance engineering approach, and it leverages MakoraGenerate and MakoraOptimize for maximum performance. In addition, orchestration and scaling optimizations at the web level, including prefix-cache management, model-level and request-level disaggregation, and optimized load balancing ensure a fast and scalable inference product. It is online and accessible now at https://app.makora.com/.

References

[1] Tehrani, Ali, Yahya Emara, Essam Wissam, Wojciech Paluch, Waleed Atallah, and Mohamed S. Abdelfattah. "Fine-Tuning GPT-5 for GPU Kernel Generation." arXiv preprint arXiv:2602.11000 (2026).

[2] Business Insider, Emails show Bank of America's struggles with Nvidia AI: 'You have to help us as local car mechanics drive the race car!' https://www.businessinsider.com/bank-of-america-nvidia-ai-internal-emails-2026-1

[3] Emara, Yahya, et al. "Faster LLM Inference via Sequential Monte Carlo." arXiv preprint arXiv:2604.15672 (2026). https://arxiv.org/pdf/2604.15672.

Latest