Hardware selection for DCPF — serving Qwen3-VL-8B-Instruct (GGUF Q4_K_M) Prepared Jun 2026
| GPU | VRAM (GB) | Price / card (USD) | Cards to fit | Deploy cost (USD) | Prefill (tok/s) | Decode c=1 | Decode c=4 | Decode np32 | $/prefill tok/s | $/decode tok/s (c4) |
|---|
Blackwell 4500 figures are estimated, not measured: prefill scaled from the Blackwell 6000 by CUDA-core ratio (10,496 / 24,064) and decode by memory-bandwidth ratio (800 / 1,792 GB/s), the two factors each phase is bound by. Treat as a planning estimate — its lower 165 W TDP means real prefill may land somewhat below the compute-scaled figure. It was not run in the np scale-out test.
At Q4_K_M the model fits on one 16 GB T4, making it the lowest deploy cost ($2,000) and a strong $/decode tok/s. Caveat: its prefill is the weakest (205 tok/s), which matters for this image-heavy VL workload — fine for text-light, latency-tolerant serving.
The L4 ($2,500) roughly doubles the T4's image prefill (471 tok/s) for ~$500 more, fits on one card, and scales to ~490 tok/s decode at np32. The safe default for mixed text+image workloads at small scale.
The DGX Spark scales furthest in the parallelization test (611 tok/s at np32) and its 128 GB unified memory leaves huge KV-cache headroom for high concurrency or larger models later.
Wins every raw metric (2,613 tok/s prefill, 466 decode at c=4) and, despite the highest sticker price, posts the best $/tok/s on both prefill and decode. The pick where image-prefill latency and rack density dominate — it was not included in the np scale-out test. The single-slot, 165 W Blackwell 4500 (32 GB, ~$4,000, estimated) is the cheaper Blackwell-class alternative — roughly half the 6000's prefill/decode for under half the price, with similar $/tok/s — attractive for dense, power-constrained server racks.
GPUS data block to re-run all figures. Sources: VideoCardz, Tom's Hardware, Thunder Compute, NVIDIA, APXML / localai.computer, Unsloth.