Welcome to Ruhui’s web!

Project

Benchmarking Quantized Qwen3-0.6B

keywords: Qwen3, LLM, post-training quantization, SmoothQuant, GSM8K, HellaSwag, MMLU, IFEval, lm-eval-harness, int8 W8A8
date: March.2026

WikiText-2 perplexity is a coarse stress test — it scores the model on a single forward pass over text it never has to generate. To see whether the quantized model can actually decode correct answers, we ran four lm-eval-harness benchmarks via axmo.evalsuite with the actual decode loop and KV cache, not just prefill logits. The companion to P25 (methods on WikiText-2) and P26 (size scaling): same model, real generation tasks.

Setup

Model: Qwen/Qwen3-0.6B-Base (the Qwen3-Base 0.6B, not the Qwen3.5-0.8B).
Tasks: 200-sample subset per benchmark, ±~3 pt stderr.
Decoding: batch_size=4, max_sequence_length=4096, A10 GPU, greedy decoding for generation tasks.
Calibration: 1000 WikiText-2 samples, 256-token blocks. INT8 path uses torch._int_mm.
Recipes evaluated (carried over from P25):
- sq α=0.8 — per-tensor Q/K + per-channel V (the weak +179% PPL line on 0.6B).
- sq-ph α=0.65 — per-head Q/K/V (the +16% PPL line; static activations).

Results

Recipe	GSM8K (strict)	HellaSwag (acc / acc_norm)	MMLU (acc)	IFEval (prompt-strict / inst-strict)
FP16 baseline	0.460	0.416 / 0.432	0.367	0.225 / 0.349
FQ `sq` α=0.8	0.055	0.346 / 0.364	OOM¹	0.160 / 0.296
Int8 `sq` α=0.8	0.050	0.348 / 0.366	OOM¹	0.150 / 0.302
FQ `sq-ph` α=0.65	0.330	0.378 / 0.426	OOM¹	0.250 / 0.368
Int8 `sq-ph` α=0.65	0.340	running²	running²	running²

¹ MMLU loglikelihood OOM'd in the quantized runs (a 1.6 GiB process on the GPU + batch_size=4 × seq=4096 left no headroom). Re-run with --batch_size 1 is queued.
² Int8 sq-ph still in its IFEval generation phase at time of writing.

1. The PPL recipe ranking holds across every task

Per-head Q/K/V (sq-ph) beats per-tensor Q/K (sq) on every benchmark — but the size of the gap varies wildly:

Task	FP16	`sq-ph` (FQ)	`sq` (FQ)	`sq-ph` − `sq`
GSM8K (strict)	0.460	0.330	0.055	+0.275
HellaSwag (acc_norm)	0.432	0.426	0.364	+0.062
IFEval (prompt-strict)	0.225	0.250	0.160	+0.090
IFEval (inst-strict)	0.349	0.368	0.296	+0.072

2. GSM8K exposes the main quantization gap

The clearest separation between the quantization recipes appears on GSM8K. sq-ph outperforms sq by 27.5 points, whereas the gap is only 3–9 points on HellaSwag and IFEval. This suggests that multi-step, chain-of-thought generation is substantially more sensitive to per-tensor activation quantization than likelihood scoring or constrained generation.

On tasks that do not rely heavily on extended reasoning, sq-ph is approximately lossless relative to FP16. HellaSwag acc_norm is 0.426 versus 0.432 for FP16, while IFEval inst-strict is 0.368 versus 0.349. The latter is slightly above the FP16 result, but the difference is within the evaluation uncertainty and should not be interpreted as an actual quantization improvement. Overall, per-head Q/K/V quantization preserves the baseline accuracy well on these tasks.

The real-INT8 results also closely match fake quantization. For sq α=0.8, the completed comparisons are GSM8K 0.055→0.050, HellaSwag 0.346→0.348, and IFEval prompt-strict 0.160→0.150. All differences are within approximately 1σ, indicating that convert_fake_to_real introduces no measurable additional accuracy degradation. This is consistent with the conclusion from the WikiText-2 experiments across the evaluated Qwen3 model sizes.

In practice, per-tensor Q/K quantization is unsuitable for math and reasoning workloads: it loses more than 40 points on GSM8K. Its degradation on HellaSwag and IFEval is much smaller, at roughly 5–10 points, so it may remain usable for cheaper likelihood-oriented deployment. However, because sq-ph has little additional practical cost while providing consistently better accuracy, it is the stronger default.

If only one behavioral benchmark can be run, GSM8K provides the most sensitive regression signal among the tasks evaluated here. HellaSwag remains useful for checking likelihood-based accuracy. MMLU should be treated only as a coarse sanity check for Qwen3-0.6B because its FP16 score is close to the random-choice floor, while IFEval is difficult to interpret for a base model whose instruction-following capability is weak to begin with.

Remaining coverage gaps:

MMLU for the three quantized configurations: re-run with --batch_size 1. Because the FP16 baseline is only modestly above random chance, small quantization deltas should not be over-interpreted.
sq-ph α=0.65 real-INT8: currently running and will be added to the comparison.
sq-ph-dyn α=0.65: the best INT8 configuration by perplexity, at approximately +7%, but not yet evaluated on these benchmarks. It is expected to reduce the remaining GSM8K gap, although this still needs to be confirmed experimentally.
Other model sizes: the current results cover only Qwen3-0.6B. The perplexity experiments suggest that the quantization gap decreases with model scale, but downstream evaluation is still needed before concluding that Qwen3-4B is near FP16.

Practical takeaways

sq α=0.8: usable for likelihood-based tasks, although it incurs an absolute accuracy drop of roughly 5–10 points. However, it severely degrades chain-of-thought reasoning and should not be used for math-focused workloads.
sq-ph α=0.65: effectively lossless on likelihood-based tasks, but still shows a roughly 12-point degradation on GSM8K. It is the most practical baseline for general INT8 deployment.
sq-ph-dyn α=0.65 (pending): expected to better preserve reasoning performance and is the strongest candidate for math-aware INT8 deployment.
MMLU: useful mainly as a coarse regression check for Qwen3-0.6B. The FP baseline is only about 12 points above the 0.25 random-choice floor, so small differences between FP and INT8 are difficult to distinguish from benchmark variance.
IFEval: not a reliable quantization metric for Qwen3-0.6B-Base. IFEval measures instruction-following, which is not a stable capability of the base model. It is more meaningful for an instruction-tuned checkpoint.

Cross-links: P25 — full design-space ablation on Qwen3-0.6B; P26 — PPL scaling 0.6B → 14B.

Qwen VLM Quantization: Text-Decoder INT8 on Multimodal Benchmarks

keywords: Qwen3-VL, Qwen2.5-VL, VLM, vision-language model, post-training quantization, SmoothQuant, int8 W8A8, docvqa, chartqa, OCRBench-v2
date: May.2026

The companion to the Qwen3 text-only posts, on vision-language models. The question is whether the SmoothQuant int8 recipes that hold up on Qwen3 LLMs (sq vs sq-ph vs sq-ph-dyn) still hold up when the model also has a vision tower and a multimodal merger step in the pipeline. We evaluate on Qwen3-VL-2B-Instruct and Qwen2.5-VL-3B-Instruct using docvqa, chartqa, ocr_recognition, and ocr_e2e from the OCRBench-v2 family.

Scope note — only the text decoder is quantized. The vision encoder (ViT-style) and the inputs_merger step that mixes image embeddings into the text-token stream stay in bfloat16. Calibration uses multimodal samples (image + text) but applies fake-quant nodes only to LLM-side linears. This bounds the int8 speedup story but matches what's typically shipped in VLM deployment today.

Setup

Models: Qwen/Qwen3-VL-2B-Instruct and Qwen/Qwen2.5-VL-3B-Instruct, both bf16.
Quantization: W8A8 fake-quant; real-int8 deploy path only validated on Qwen3-VL-2B.
Calibration: calib_medium_2048 — 2048 multimodal samples from the OCRBench-v2 calibration split.
Evaluation: Qwen3-VL-2B uses eval_medium_2048 (N = 1536 aggregated + 512 per-sub-type ocr_e2e). Qwen2.5-VL-3B uses the smaller eval_small_512 (N = 384 + 128).
Hardware: A10, CUDA 12.8.

Qwen3-VL-2B-Instruct (Qwen3-VL series)

Config	Overall	docvqa (ANLS@0.5)	chartqa (Relaxed Acc)	ocr_recognition (1−NED)	Δ vs float
Float baseline (bf16)	0.4291	0.4856	0.5371	0.2645	—
FQ W8A8 `sq` α=0.8 (per-tensor Q/K)	0.3739	0.4463	0.4668	0.2085	−5.5 pts
FQ W8A8 `sq-ph` α=0.8 (per-head Q/K/V)	0.4140	0.4845	0.5273	0.2303	−1.5 pts

Cross-size float baseline only — Qwen/Qwen3-VL-4B-Instruct fake-quant OOM'd on the A10 during smoothquant statistics collection. Captured for the cross-size signal: 4B Overall = 0.4641 (+3.5 pts vs 2B at bf16); most of the lift is on chartqa (+8.4) and docvqa (+4.7). Fake-quant 4B is a re-run-when-GPU-frees-up TODO.

Qwen2.5-VL-3B-Instruct (Qwen2.5-VL series)

Config	Overall	docvqa	chartqa	ocr_recognition	Δ vs float
Float baseline (bf16)	0.4564	0.6079	0.5078	0.2536	—
FQ W8A8 `sq` α=0.65 (per-tensor Q/K)	0.3656	0.5346	0.3672	0.1951	−9.1 pts
FQ W8A8 `sq-ph` α=0.8^†	0.4871	0.6109	0.6016	0.2488	(suspect)

^† The sq-ph row is from an older run with no matching float-baseline log and outscores bf16, which shouldn't happen for an honest quantization. Most likely a sampling / checkpoint-state mismatch between the two runs. Reported for completeness but needs to be re-measured in one invocation alongside the baseline before being treated as a real finding.

Real-int8 deploy — speed vs accuracy tradeoff (Qwen3-VL-2B)

Converting fake-quant to real int8×int8 matmuls (via convert_fake_to_real) exposes a dtype-management choice in the dequantizer. The current state has two extremes:

Run	Overall	Wall time	ms / item	Speed vs FQ	Δ vs FQ
Fake-quant W8A8 (`sq-ph` α=0.8)	0.3942	58.5 min	9 140	1.00×	—
Real-int8 + fp32 cast (full model upcast)	0.3871	2h 56.7 min	27 600	3.02× slower	−0.7 pts
Real-int8 + bf16 dequant	0.3694	58.9 min	9 210	~parity	−2.5 pts

Either you keep the model upcast to fp32 (paying 3× wall-time) or you do the dequant in bf16 (parity speed but eating 2.5 accuracy points). The fix being explored is fp32-internal-math + bf16-output in RealDequantizer — promote the int32 accumulator to fp32 for the scale arithmetic, then cast to bf16 only on the final output. Preliminary expectation: lands near the fp32-cast accuracy without the speed tax.

Findings

Recipe ranking from text-only LLM transfers to VLM. Per-head Q/K/V (sq-ph) beats per-tensor Q/K (sq) on every bucket and on overall. Qwen3-VL-2B: sq-ph recovers 4 of the 5.5-pt drop. Qwen2.5-VL-3B: sq alone costs 9 pts — almost certainly fixable with sq-ph once that row is re-measured.
OCR e2e sub-types are noisy at N ≈ 30 per sub-type. 6 of 17 sub-types sit at 0.0 even at float (APP-agent, VQA-with-position, KIE, key-info-mapping, math-QA, cognition-VQA on the 2B) — these are noise floor, not quantization signal. The diagnostic sub-types are chart-parsing, document-parsing, table-parsing (1−NED metrics that move smoothly), and text-grounding — the most quant-fragile sub-task (IoU drops to ~0 under sq, recovered partly under sq-ph).
Vision encoder being left float doesn't hurt accuracy but bounds the int8 speedup story — text-decoder linears are quantized, but per-image vision-tower work and the bf16 inputs_merger cap the achievable speedup. Quantizing the vision tower is straightforward (same prepare_model_for_optimization flow we use on DINOv2-Seg / DINOv2-Depth), just not wired up in the VLM scripts yet.
Float-baseline scaling holds. Qwen3-VL going 2B → 4B at bf16 is +3.5 pts overall — chartqa and docvqa gain the most, ocr_recognition slightly regresses (likely a 4B checkpoint quirk on that specific bucket).

Coverage gaps

Qwen3-VL-4B fake-quant — OOM'd on A10; needs a larger card or --batch_size 1 re-run.
Qwen2.5-VL-3B sq-ph α=0.8 with a matching baseline in one invocation.
sq-ph-dyn (per-token dynamic activations — the best int8 line for text-only LLMs at +7% PPL on Qwen3-0.6B) — not measured on any VLM yet.
Vision-tower quantization — currently bf16 everywhere; capability exists in quantlib (validated on DINOv2-Seg / DINOv2-Depth), not wired in the VLM scripts.

For text-only Qwen3 LLM design-space ablation and cross-size scaling that motivate these VLM recipes, see the Qwen3 methods and size-scaling posts. The recipe ranking story is the same; what changes for VLMs is the deployment shape (vision tower + merger + LLM) and which buckets are quant-fragile.

INT8 Whisper Speech-to-Text

keywords: Whisper, speech-to-text, ASR, post-training quantization, int8
date: TBD

INT8 Whisper Speech-to-Text

Update later

page 1 2 3 4 5