Project
Benchmarking Quantized Qwen3-0.6B
keywords: Qwen3, LLM, post-training quantization, SmoothQuant, GSM8K, HellaSwag, MMLU, IFEval, lm-eval-harness, int8 W8A8
date: March.2026
WikiText-2 perplexity is a coarse stress test — it scores the model on a single forward pass over text it never has to generate. To see whether the quantized model can actually decode correct answers, we ran four lm-eval-harness benchmarks via axmo.evalsuite with the actual decode loop and KV cache, not just prefill logits. The companion to P25 (methods on WikiText-2) and P26 (size scaling): same model, real generation tasks.
Setup
- Model:
Qwen/Qwen3-0.6B-Base(the Qwen3-Base 0.6B, not the Qwen3.5-0.8B). - Tasks: 200-sample subset per benchmark, ±~3 pt stderr.
- Decoding:
batch_size=4,max_sequence_length=4096, A10 GPU, greedy decoding for generation tasks. - Calibration: 1000 WikiText-2 samples, 256-token blocks. INT8 path uses
torch._int_mm. - Recipes evaluated (carried over from P25):
sqα=0.8 — per-tensor Q/K + per-channel V (the weak +179% PPL line on 0.6B).sq-phα=0.65 — per-head Q/K/V (the +16% PPL line; static activations).
Results
| Recipe | GSM8K (strict) | HellaSwag (acc / acc_norm) | MMLU (acc) | IFEval (prompt-strict / inst-strict) |
|---|---|---|---|---|
| FP16 baseline | 0.460 | 0.416 / 0.432 | 0.367 | 0.225 / 0.349 |
FQ sq α=0.8 | 0.055 | 0.346 / 0.364 | OOM1 | 0.160 / 0.296 |
Int8 sq α=0.8 | 0.050 | 0.348 / 0.366 | OOM1 | 0.150 / 0.302 |
FQ sq-ph α=0.65 | 0.330 | 0.378 / 0.426 | OOM1 | 0.250 / 0.368 |
Int8 sq-ph α=0.65 | 0.340 | running2 | running2 | running2 |
1 MMLU loglikelihood OOM'd in the quantized runs (a 1.6 GiB process on the GPU + batch_size=4 × seq=4096 left no headroom). Re-run with --batch_size 1 is queued.
2 Int8 sq-ph still in its IFEval generation phase at time of writing.
1. The PPL recipe ranking holds across every task
Per-head Q/K/V (sq-ph) beats per-tensor Q/K (sq) on every benchmark — but the size of the gap varies wildly:
| Task | FP16 | sq-ph (FQ) | sq (FQ) | sq-ph − sq |
|---|---|---|---|---|
| GSM8K (strict) | 0.460 | 0.330 | 0.055 | +0.275 |
| HellaSwag (acc_norm) | 0.432 | 0.426 | 0.364 | +0.062 |
| IFEval (prompt-strict) | 0.225 | 0.250 | 0.160 | +0.090 |
| IFEval (inst-strict) | 0.349 | 0.368 | 0.296 | +0.072 |
2. GSM8K exposes the main quantization gap
The clearest separation between the quantization recipes appears on GSM8K. sq-ph outperforms sq by 27.5 points, whereas the gap is only 3–9 points on HellaSwag and IFEval. This suggests that multi-step, chain-of-thought generation is substantially more sensitive to per-tensor activation quantization than likelihood scoring or constrained generation.
On tasks that do not rely heavily on extended reasoning, sq-ph is approximately lossless relative to FP16. HellaSwag acc_norm is 0.426 versus 0.432 for FP16, while IFEval inst-strict is 0.368 versus 0.349. The latter is slightly above the FP16 result, but the difference is within the evaluation uncertainty and should not be interpreted as an actual quantization improvement. Overall, per-head Q/K/V quantization preserves the baseline accuracy well on these tasks.
The real-INT8 results also closely match fake quantization. For sq α=0.8, the completed comparisons are GSM8K 0.055→0.050, HellaSwag 0.346→0.348, and IFEval prompt-strict 0.160→0.150. All differences are within approximately 1σ, indicating that convert_fake_to_real introduces no measurable additional accuracy degradation. This is consistent with the conclusion from the WikiText-2 experiments across the evaluated Qwen3 model sizes.
In practice, per-tensor Q/K quantization is unsuitable for math and reasoning workloads: it loses more than 40 points on GSM8K. Its degradation on HellaSwag and IFEval is much smaller, at roughly 5–10 points, so it may remain usable for cheaper likelihood-oriented deployment. However, because sq-ph has little additional practical cost while providing consistently better accuracy, it is the stronger default.
If only one behavioral benchmark can be run, GSM8K provides the most sensitive regression signal among the tasks evaluated here. HellaSwag remains useful for checking likelihood-based accuracy. MMLU should be treated only as a coarse sanity check for Qwen3-0.6B because its FP16 score is close to the random-choice floor, while IFEval is difficult to interpret for a base model whose instruction-following capability is weak to begin with.
Remaining coverage gaps:
- MMLU for the three quantized configurations: re-run with
--batch_size 1. Because the FP16 baseline is only modestly above random chance, small quantization deltas should not be over-interpreted. -
sq-phα=0.65 real-INT8: currently running and will be added to the comparison. -
sq-ph-dynα=0.65: the best INT8 configuration by perplexity, at approximately +7%, but not yet evaluated on these benchmarks. It is expected to reduce the remaining GSM8K gap, although this still needs to be confirmed experimentally. - Other model sizes: the current results cover only Qwen3-0.6B. The perplexity experiments suggest that the quantization gap decreases with model scale, but downstream evaluation is still needed before concluding that Qwen3-4B is near FP16.
Practical takeaways
-
sqα=0.8: usable for likelihood-based tasks, although it incurs an absolute accuracy drop of roughly 5–10 points. However, it severely degrades chain-of-thought reasoning and should not be used for math-focused workloads. -
sq-phα=0.65: effectively lossless on likelihood-based tasks, but still shows a roughly 12-point degradation on GSM8K. It is the most practical baseline for general INT8 deployment. -
sq-ph-dynα=0.65 (pending): expected to better preserve reasoning performance and is the strongest candidate for math-aware INT8 deployment. - MMLU: useful mainly as a coarse regression check for Qwen3-0.6B. The FP baseline is only about 12 points above the 0.25 random-choice floor, so small differences between FP and INT8 are difficult to distinguish from benchmark variance.
- IFEval: not a reliable quantization metric for Qwen3-0.6B-Base. IFEval measures instruction-following, which is not a stable capability of the base model. It is more meaningful for an instruction-tuned checkpoint.
Cross-links: P25 — full design-space ablation on Qwen3-0.6B; P26 — PPL scaling 0.6B → 14B.
Qwen VLM Quantization: Text-Decoder INT8 on Multimodal Benchmarks
keywords: Qwen3-VL, Qwen2.5-VL, VLM, vision-language model, post-training quantization, SmoothQuant, int8 W8A8, docvqa, chartqa, OCRBench-v2
date: May.2026
The companion to the Qwen3 text-only posts, on vision-language models. The question is whether the SmoothQuant int8 recipes that hold up on Qwen3 LLMs (sq vs sq-ph vs sq-ph-dyn) still hold up when the model also has a vision tower and a multimodal merger step in the pipeline. We evaluate on Qwen3-VL-2B-Instruct and Qwen2.5-VL-3B-Instruct using docvqa, chartqa, ocr_recognition, and ocr_e2e from the OCRBench-v2 family.
Scope note — only the text decoder is quantized. The vision encoder (ViT-style) and the inputs_merger step that mixes image embeddings into the text-token stream stay in bfloat16. Calibration uses multimodal samples (image + text) but applies fake-quant nodes only to LLM-side linears. This bounds the int8 speedup story but matches what's typically shipped in VLM deployment today.
Setup
- Models:
Qwen/Qwen3-VL-2B-InstructandQwen/Qwen2.5-VL-3B-Instruct, both bf16. - Quantization: W8A8 fake-quant; real-int8 deploy path only validated on Qwen3-VL-2B.
- Calibration:
calib_medium_2048— 2048 multimodal samples from the OCRBench-v2 calibration split. - Evaluation: Qwen3-VL-2B uses
eval_medium_2048(N = 1536 aggregated + 512 per-sub-type ocr_e2e). Qwen2.5-VL-3B uses the smallereval_small_512(N = 384 + 128). - Hardware: A10, CUDA 12.8.
Qwen3-VL-2B-Instruct (Qwen3-VL series)
| Config | Overall | docvqa (ANLS@0.5) | chartqa (Relaxed Acc) | ocr_recognition (1−NED) | Δ vs float |
|---|---|---|---|---|---|
| Float baseline (bf16) | 0.4291 | 0.4856 | 0.5371 | 0.2645 | — |
FQ W8A8 sq α=0.8 (per-tensor Q/K) | 0.3739 | 0.4463 | 0.4668 | 0.2085 | −5.5 pts |
FQ W8A8 sq-ph α=0.8 (per-head Q/K/V) | 0.4140 | 0.4845 | 0.5273 | 0.2303 | −1.5 pts |
Cross-size float baseline only — Qwen/Qwen3-VL-4B-Instruct fake-quant OOM'd on the A10 during smoothquant statistics collection. Captured for the cross-size signal: 4B Overall = 0.4641 (+3.5 pts vs 2B at bf16); most of the lift is on chartqa (+8.4) and docvqa (+4.7). Fake-quant 4B is a re-run-when-GPU-frees-up TODO.
Qwen2.5-VL-3B-Instruct (Qwen2.5-VL series)
| Config | Overall | docvqa | chartqa | ocr_recognition | Δ vs float |
|---|---|---|---|---|---|
| Float baseline (bf16) | 0.4564 | 0.6079 | 0.5078 | 0.2536 | — |
FQ W8A8 sq α=0.65 (per-tensor Q/K) | 0.3656 | 0.5346 | 0.3672 | 0.1951 | −9.1 pts |
FQ W8A8 sq-ph α=0.8† | 0.4871 | 0.6109 | 0.6016 | 0.2488 | (suspect) |
† The sq-ph row is from an older run with no matching float-baseline log and outscores bf16, which shouldn't happen for an honest quantization. Most likely a sampling / checkpoint-state mismatch between the two runs. Reported for completeness but needs to be re-measured in one invocation alongside the baseline before being treated as a real finding.
Real-int8 deploy — speed vs accuracy tradeoff (Qwen3-VL-2B)
Converting fake-quant to real int8×int8 matmuls (via convert_fake_to_real) exposes a dtype-management choice in the dequantizer. The current state has two extremes:
| Run | Overall | Wall time | ms / item | Speed vs FQ | Δ vs FQ |
|---|---|---|---|---|---|
Fake-quant W8A8 (sq-ph α=0.8) | 0.3942 | 58.5 min | 9 140 | 1.00× | — |
| Real-int8 + fp32 cast (full model upcast) | 0.3871 | 2h 56.7 min | 27 600 | 3.02× slower | −0.7 pts |
| Real-int8 + bf16 dequant | 0.3694 | 58.9 min | 9 210 | ~parity | −2.5 pts |
Either you keep the model upcast to fp32 (paying 3× wall-time) or you do the dequant in bf16 (parity speed but eating 2.5 accuracy points). The fix being explored is fp32-internal-math + bf16-output in RealDequantizer — promote the int32 accumulator to fp32 for the scale arithmetic, then cast to bf16 only on the final output. Preliminary expectation: lands near the fp32-cast accuracy without the speed tax.
Findings
- Recipe ranking from text-only LLM transfers to VLM. Per-head Q/K/V (
sq-ph) beats per-tensor Q/K (sq) on every bucket and on overall. Qwen3-VL-2B:sq-phrecovers 4 of the 5.5-pt drop. Qwen2.5-VL-3B:sqalone costs 9 pts — almost certainly fixable withsq-phonce that row is re-measured. - OCR e2e sub-types are noisy at N ≈ 30 per sub-type. 6 of 17 sub-types sit at 0.0 even at float (APP-agent, VQA-with-position, KIE, key-info-mapping, math-QA, cognition-VQA on the 2B) — these are noise floor, not quantization signal. The diagnostic sub-types are chart-parsing, document-parsing, table-parsing (1−NED metrics that move smoothly), and text-grounding — the most quant-fragile sub-task (IoU drops to ~0 under
sq, recovered partly undersq-ph). - Vision encoder being left float doesn't hurt accuracy but bounds the int8 speedup story — text-decoder linears are quantized, but per-image vision-tower work and the bf16
inputs_mergercap the achievable speedup. Quantizing the vision tower is straightforward (sameprepare_model_for_optimizationflow we use on DINOv2-Seg / DINOv2-Depth), just not wired up in the VLM scripts yet. - Float-baseline scaling holds. Qwen3-VL going 2B → 4B at bf16 is +3.5 pts overall — chartqa and docvqa gain the most, ocr_recognition slightly regresses (likely a 4B checkpoint quirk on that specific bucket).
Coverage gaps
- Qwen3-VL-4B fake-quant — OOM'd on A10; needs a larger card or
--batch_size 1re-run. - Qwen2.5-VL-3B
sq-phα=0.8 with a matching baseline in one invocation. sq-ph-dyn(per-token dynamic activations — the best int8 line for text-only LLMs at +7% PPL on Qwen3-0.6B) — not measured on any VLM yet.- Vision-tower quantization — currently bf16 everywhere; capability exists in
quantlib(validated on DINOv2-Seg / DINOv2-Depth), not wired in the VLM scripts.
For text-only Qwen3 LLM design-space ablation and cross-size scaling that motivate these VLM recipes, see the Qwen3 methods and size-scaling posts. The recipe ranking story is the same; what changes for VLMs is the deployment shape (vision tower + merger + LLM) and which buckets are quant-fragile.
INT8 Whisper Speech-to-Text
keywords: Whisper, speech-to-text, ASR, post-training quantization, int8
date: TBD
INT8 Whisper Speech-to-Text
Update later