One Data Type is Not All You Need for 4-bit Quantization

One Data Type is Not All You Need for 4-bit Quantization

One Data Type is Not All You Need for 4-bit Quantization

One Data Type is Not All You Need for 4-bit Quantization

MixFP4 is an extension to NVFP4 that improves accuracy with no additional memory cost

MixFP4 is an extension to NVFP4 that improves accuracy with no additional memory cost

Written by

Tripp Lyons

Tripp Lyons

Published on

TLDR: We are releasing makora-ai/Qwen3.6-35B-A3B-MixFP4, a quantized checkpoint of Qwen3.6-35B-A3B that uses the experimental MixFP4 format. It requires no calibration data to produce and scores higher on accuracy benchmarks than NVFP4 quantized checkpoints from Nvidia and Unsloth. The model is available for free on Hugging Face: https://huggingface.co/makora-ai/Qwen3.6-35B-A3B-MixFP4

Introducing MixFP4, an accuracy-improving extension to NVFP4

MixFP4 (Zou et al., 2026) is an adaptive extension to NVFP4 that enables block-wise selection between NVFP4 (E2M1 floating point) and INT4 (4-bit signed integer) representations to better match local tensor statistics.

Modern LLM tensors contain blocks with dramatically different value distributions. Blocks with large outliers benefit from exponent-heavy NVFP4 representations, while flatter blocks are better represented by an INT4 codebook. Rather than forcing every block to use the same numerical format, MixFP4 adaptively selects between the two formats for each block without introducing additional parameters.

What MixFP4 changes

Similar to NVFP4, MixFP4 quantizes values together in 16-element blocks. Each block is then accompanied with a 8-bit value composed of an unsigned FP7 E4M3 scaling factor with its sign bit instead used to indicate the data type of the block (0 for INT4 and 1 for NVFP4), giving the loader enough information to recover the selected representation without adding a separate metadata tensor for the block type.

No calibration dataset is required. The block format is selected from quantization error on the weights themselves, so the conversion does not need a representative prompt dataset, a calibration pass, or application-specific tuning before it can run.

Implementation

Our model is available through the HuggingFace transformers library for easy usage (see our example later). This implementation is designed primarily for others to verify our checkpoints quality, not to deliver fast inference. Maintaining performance requires implementing custom MixFP4 kernels. At Makora, we have done this using MakoraGenerate, our automated kernel generation software, and are serving the fast implementation on MakoraInference. With the custom MixFP4 kernels, we get an accuracy improvement with minimal loss in model speeds.

Results

The MixFP4 checkpoint is about one third the size of the BF16 target model while matching it on MMLU-Pro instruct-mode evaluation. Among the Qwen3.6 quantized checkpoints we compared against, MixFP4 has the highest MMLU-Pro score, lowest KL divergence, and lowest WikiText-2 perplexity, as seen in the table below.

Model

Checkpoint size (lower is better)

MMLU-Pro, instruct (higher is better)

KL divergence (lower is better)

WikiText-2 perplexity (lower is better)

Qwen/Qwen3.6-35B-A3B (target model)

66.99 GiB

62.43%

N/A

6.4574

makora-ai/Qwen3.6-35B-A3B-MixFP4 (ours)

21.29 GiB

62.62%

0.026935

6.5022

nvidia/Qwen3.6-35B-A3B-NVFP4

21.85 GiB

61.80%

0.038476

6.6129

unsloth/Qwen3.6-35B-A3B-NVFP4

23.01 GiB

60.80%

0.061846

6.6984

We measured MMLU-Pro on all 12,032 problems in instruct mode with reasoning disabled for all checkpoints. For KL divergence, we used 100 conversations of 256 tokens from Aeala/ShareGPT_Vicuna_unfiltered to compare each model’s token probabilities to the base model. For WikiText-2 perplexity, we used the full Salesforce/wikitext dataset’s wikitext-2-raw-v1 test split with Qwen tokenization, non-empty rows joined by blank-line separators, sequence length 2048, and stride 2048.

Get started

Our checkpoint can be used in transformers with it’s Hugging Face model ID: makora-ai/Qwen3.6-35B-A3B-MixFP4.

from transformers import AutoModelForCausalLM, AutoTokenizer

repo_id = "makora-ai/Qwen3.6-35B-A3B-MixFP4"

tokenizer = AutoTokenizer.from_pretrained(repo_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    repo_id,
    trust_remote_code=True,
    device_map="auto",
)

messages = [
    {
        "role": "user",
        "content": "What does int4 mean in the context of quantization? Respond in under 10 words."
    }
]

inputs = tokenizer.apply_chat_template(messages, enable_thinking=False, add_generation_prompt=True, return_tensors="pt").input_ids.to(model.device)

outputs = model.generate(inputs, max_new_tokens=16, do_sample=False)

print(tokenizer.decode(outputs[0][inputs[0].shape[0]:], skip_special_tokens=True))
# Output:
# It represents 4-bit integer quantization.
from transformers import AutoModelForCausalLM, AutoTokenizer

repo_id = "makora-ai/Qwen3.6-35B-A3B-MixFP4"

tokenizer = AutoTokenizer.from_pretrained(repo_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    repo_id,
    trust_remote_code=True,
    device_map="auto",
)

messages = [
    {
        "role": "user",
        "content": "What does int4 mean in the context of quantization? Respond in under 10 words."
    }
]

inputs = tokenizer.apply_chat_template(messages, enable_thinking=False, add_generation_prompt=True, return_tensors="pt").input_ids.to(model.device)

outputs = model.generate(inputs, max_new_tokens=16, do_sample=False)

print(tokenizer.decode(outputs[0][inputs[0].shape[0]:], skip_special_tokens=True))
# Output:
# It represents 4-bit integer quantization.
from transformers import AutoModelForCausalLM, AutoTokenizer

repo_id = "makora-ai/Qwen3.6-35B-A3B-MixFP4"

tokenizer = AutoTokenizer.from_pretrained(repo_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    repo_id,
    trust_remote_code=True,
    device_map="auto",
)

messages = [
    {
        "role": "user",
        "content": "What does int4 mean in the context of quantization? Respond in under 10 words."
    }
]

inputs = tokenizer.apply_chat_template(messages, enable_thinking=False, add_generation_prompt=True, return_tensors="pt").input_ids.to(model.device)

outputs = model.generate(inputs, max_new_tokens=16, do_sample=False)

print(tokenizer.decode(outputs[0][inputs[0].shape[0]:], skip_special_tokens=True))
# Output:
# It represents 4-bit integer quantization.
from transformers import AutoModelForCausalLM, AutoTokenizer

repo_id = "makora-ai/Qwen3.6-35B-A3B-MixFP4"

tokenizer = AutoTokenizer.from_pretrained(repo_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    repo_id,
    trust_remote_code=True,
    device_map="auto",
)

messages = [
    {
        "role": "user",
        "content": "What does int4 mean in the context of quantization? Respond in under 10 words."
    }
]

inputs = tokenizer.apply_chat_template(messages, enable_thinking=False, add_generation_prompt=True, return_tensors="pt").input_ids.to(model.device)

outputs = model.generate(inputs, max_new_tokens=16, do_sample=False)

print(tokenizer.decode(outputs[0][inputs[0].shape[0]:], skip_special_tokens=True))
# Output:
# It represents 4-bit integer quantization.

Closing

Our checkpoint is available on Hugging Face now: https://huggingface.co/makora-ai/Qwen3.6-35B-A3B-MixFP4

Try it, measure it’s accuracy, or try it at full speed through MakoraInference!

References