Open-sourcing 600,000 Triton kernels via Hugging Face

Open-sourcing 600,000 Triton kernels via Hugging Face

Open-sourcing 600,000 Triton kernels via Hugging Face

Open-sourcing 600,000 Triton kernels via Hugging Face

triton-gpu-latency is a dataset with 600,000 Triton kernels with full evaluation results

triton-gpu-latency is a dataset with 600,000 Triton kernels with full evaluation results

Written by

Blazej Tez

Blazej Tez

Mohamed Abdelfattah

Mohamed Abdelfattah

Published on

TL;DR: We're releasing triton-gpu-latency, a dataset containing 600k Triton kernels with full evaluation results, designed to provide a representative distribution of LLM-generated GPU kernels for an extended Kernelbench dataset. It's available now on Hugging Face at makora-ai/triton-gpu-latency.

What's in it

Each row is a complete, self-contained Python file. It contains a reference PyTorch Model defining a problem (matmul, normalization, attention variants, and many more), and a candidate ModelNew that re-implements the same forward pass — almost always with an AI-written Triton kernel. The label y is the measured wall-clock runtime of executing that candidate, or None when the kernel failed to compile, didn't pass the correctness check, or otherwise couldn't be benchmarked.

The reference problems are drawn from the extended KernelBench suite introduced in METR's Measuring Automated Kernel Engineering report [1] — a broadened version of the original KernelBench problem set [2] used to evaluate how well AI systems can write optimized GPU kernels.

  • 544,028 train rows, 57,024 test rows. ~1.97 GB of Parquet.

  • ~35% failures, distributed across both splits. The failures are themselves signal — they teach a model what doesn't work.

  • 30 distinct test problems, partitioned across three holdout regimes (full_holdout, ninety_holdout, half_holdout) so you can measure generalization at different levels of overlap between train and test.

The candidate kernels are the complete auto-generated output of our production agent during its first few months of operation. They reflect that early window and do not represent the current performance of MakoraGenerate.

We previously used an extended, proprietary version of this dataset to fine-tune GPT-5 for GPU kernel generation [3]. The public release here is a starting point for the same kind of work in the open. Please get in touch if you are interested in the proprietary dataset, that expands the number of problems and solutions by two orders of magnitude.

What it's good for

  • Supervised fine-tuning a Triton-kernel generator on the ~363k successful candidates.

  • Training a latency reward model: regress on y to rank candidates without running them. Hint: more on this will be released soon!

  • Training a correctness verifier: the y is None label is a free correctness signal you can use as a filter inside synthesis loops.

  • Benchmarking generalization across the three holdout regimes — full_holdout is the hardest, holding out entire problems.

Get started

from datasets import load_dataset

ds = load_dataset("makora-ai/triton-gpu-latency")
from datasets import load_dataset

ds = load_dataset("makora-ai/triton-gpu-latency")
from datasets import load_dataset

ds = load_dataset("makora-ai/triton-gpu-latency")
from datasets import load_dataset

ds = load_dataset("makora-ai/triton-gpu-latency")

The dataset card has the full schema, distribution statistics, and code snippets for the patterns above.

We're excited to see what people build on top of it. If you train a model on this dataset or use it for research, we'd love to hear about it.

References

[1] METR. "Measuring Automated Kernel Engineering." February 14, 2025. https://metr.org/blog/2025-02-14-measuring-automated-kernel-engineering

[2] Ouyang, Anne, Simon Guo, Simran Arora, Alex L. Zhang, William Hu, Christopher Ré, and Azalia Mirhoseini. "Kernelbench: Can llms write efficient gpu kernels?." arXiv preprint arXiv:2502.10517 (2025).

[3] Tehrani, Ali, Yahya Emara, Essam Wissam, Wojciech Paluch, Waleed Atallah, and Mohamed S. Abdelfattah. "Fine-Tuning GPT-5 for GPU Kernel Generation." arXiv preprint arXiv:2602.11000 (2026).