
Written by
Published on
TLDR: We developed hierarchical SMC speculative decoding to speed up Qwen3-32B by 2× by chaining two speculative decoding algorithms; Eagle3 and SMC-SD.
In our previous blog post [1] and paper [2], we introduced Sequential Monte-Carlo Speculative Decoding, or SMC-SD, a new algorithmic optimization that unlocks previously unreachable token speeds on GPUs.
Using Makora’s automated inference optimization stack, we were able to take SMC-SD from research to production quickly. Today, we’ve deployed an SMC-SD-powered Makora Inference endpoint on app.makora.com, achieving the fastest Llama-3.3-70B-Instruct inference in the world on just 4x AMD MI325X GPUs. To put this in context, our deployment is even faster than custom chip inference provider Groq, who is reportedly using 576 Groq chips [3] to power their Llama-3.3-70B inference endpoint.

We've been hard at work adding support for SMC-SD on newer models such as the Qwen3 model family. Along the way, we found another opportunity to push performance even further: accelerating the drafting step itself with an additional layer of speculative decoding.
Quick Recap: SMC-SD
Classic speculative decoding uses a small draft model to guess the next few tokens, then a big target model verifies them in one batched pass, accepting the matching prefix and rejecting the rest. Generation then rolls back to the last matching token and continues from there. This method speeds up LLM decoding, primarily because it trades the slow and memory-bound autoregressive decoding of the large target model, with a parallel verification pass with high computational intensity — a much better fit to GPU compute. However, the performance improvement depends on a strong alignment between draft and target models, and a high acceptance rate of the drafted tokens [4].
Sequential Monte Carlo (SMC) speculative decoding reframes the draft-verify process. Instead of accept-or-reject on a token-level, we keep a population of N candidate continuations (particles) and always accept all the draft K tokens (therefore guaranteeing non-fluctuating performance). The target model then scores each particle, and the highest quality particles are duplicated, while the lowest quality particles are discarded in a resampling process. In a nutshell, SMC-SD provides guaranteed high performance at the cost of approximating inference with a draft particle population. Empirically, accuracy degradation is minimal, and the speed-accuracy tunable Pareto frontier dominates prior speculative decoding solutions as outlined in our paper [2]. Here’s a summary of the algorithm:
A cheap model
qdrafts a block of tokens forNparticle.Commit every drafted token. Nothing is rejected.
Reweight each particle by an importance weight, which signifies how much the expensive target
plikes that block versus how likely the draftqwas to write it (p/q).When the weights get lopsided, where a few particles dominate the distribution, we resample: kill the losers, clone the winners, continue.
Hierarchical SMC-SD
In SMC-SD, drafting is computationally intensive, typically taking as much time as scoring. This is because SMC-SD draft models are usually chosen to be larger than conventional SD ones to be able to produce high-quality drafts, and because we maintain N draft particles at once. This led us to investigate ways to speed up the drafting process.

Hierarchical SMC-SD performs Eagle3 speculative decoding [5] to speed up draft generation for SMC-SD [2].
The method that we explore here was right in front of us all along. We can use Eagle3-style speculative decoding [5], to speed up the generation of draft particles, which are then scored by a target model. This hierarchical speculative decoding is inspired by prior work, Triforce [6], which performs hierarchical verification in the conventional speculative decoding setting. In our case, we use conventional speculative decoding for the first level of the hierarchy, to speed up draft generation, then the second level of the hierarchy steers the drafted particles with SMC-SD as described in our prior work [2].
Results
To test out this idea, we set up a Qwen3-32B-FP8 target with a Qwen3-8B-FP8 draft, and we tested the variations in the table below on an NVIDIA H100 GPU. This preliminary experiment shows that we can indeed push beyond the high performance of SMC-SD by speeding up the drafting process with Eagle3. Note that {N,K} for SMC-SD below is {6,16}.
Config | tok/s | Speedup |
|---|---|---|
| 96.3 | 1.00× |
EAGLE3-SD [5] (EAGLE3 head → | 121.3 | 1.26× |
SMC-SD [2] ( | 173.1 | 1.80× |
Hierarchical SMC-SD (EAGLE3 → 8B → 32B) | 200.2 | 2.08× |
Our hierarchical scheme leverages the strengths of each method of speculative decoding. Eagle3 is a small co-trained sub-model that is tuned to draft short continuations (3-4 tokens) very well. Conversely, SMC-SD draft models are generally larger and capable of drafting for much longer continuations (16 - 48 tokens). Their composition in our hierarchical scheme plays to both of their strengths to boost performance beyond what each method is capable of by itself. All the schemes above perfectly preserve target model accuracy on both GSM8k [7] and HumanEval [8] benchmarks.
References
[1] SMC-SD: The Fastest GPU-based LLM Inference in the World. ****https://www.makora.com/blog/smc-sd.
[2] Emara, Yahya, et al. "Faster LLM Inference via Sequential Monte Carlo." arXiv preprint arXiv:2604.15672 (2026). https://arxiv.org/pdf/2604.15672.
[3] SambaNova vs. Groq: The AI Inference Face-Off. ****https://sambanova.ai/blog/sambanova-vs-groq.
[4] Wang, Junxiong, et al. "When RL Meets Adaptive Speculative Training: A Unified Training-Serving System." arXiv preprint arXiv:2602.06932 (2026).
[5] Li, Yuhui, et al. "Eagle-3: Scaling up inference acceleration of large language models via training-time test." Advances in Neural Information Processing Systems 38 (2026): 136737-136756.
[6] Sun, Hanshi, et al. "Triforce: Lossless acceleration of long sequence generation with hierarchical speculative decoding." Conference on Language Modeling (CoLM). 2024.
[7] Cobbe, Karl, et al. "Training Verifiers to Solve Math Word Problems." arXiv preprint arXiv:2110.14168 (2021).
[8] Chen, Mark, et al. "Evaluating Large Language Models Trained on Code." arXiv preprint arXiv:2107.03374 (2021).
Latest
From the blog
The latest industry news, interviews, technologies, and resources.




