GPU Performance & Cost Matrix

Hardware selection for DCPF — serving Qwen3-VL-8B-Instruct (GGUF Q4_K_M) Prepared Jun 2026

Key finding — decode is memory-bandwidth bound. Single-request (c=1) decode throughput is near-identical across T4 (47), DGX Spark (45) and L4 (40 tok/s) because all three share roughly the same effective memory bandwidth. Raw compute (prefill) differs hugely — and matters here, since this is a vision-language model whose image prefill is compute-bound — but it doesn't help token generation. The parallelization test below confirms the fix: scaling out concurrent slots (np) recovers decode throughput on the bandwidth-bound cards — up to ~10× on the T4/L4 and ~14× on the DGX Spark.

$3.25

Best $/prefill tok/s · Blackwell 6000

$18.2

Best $/decode tok/s (c=4) · Blackwell 6000

$2,000

Lowest cost to deploy (Q4_K_M, 1 card) · T4

611

Best scaled decode np32 · DGX Spark

The Matrix — performance vs. cards-to-fit vs. price

Model build: Q4_K_M (actual, ~7 GB) FP16 unquantized (~18 GB) Click any column header to sort.

GPU	VRAM (GB)	Price / card (USD)	Cards to fit	Deploy cost (USD)	Prefill (tok/s)	Decode c=1	Decode c=4	Decode np32	$/prefill tok/s	$/decode tok/s (c4)

Blackwell 4500 figures are estimated, not measured: prefill scaled from the Blackwell 6000 by CUDA-core ratio (10,496 / 24,064) and decode by memory-bandwidth ratio (800 / 1,792 GB/s), the two factors each phase is bound by. Treat as a planning estimate — its lower 165 W TDP means real prefill may land somewhat below the compute-scaled figure. It was not run in the np scale-out test.

Cost vs. performance

Deploy cost vs. decode throughput (c=4)

Lower-right is better: cheaper to deploy, faster decode. Bubble size = prefill speed.

Cost efficiency — $ per tok/s (lower is better)

Per-card price ÷ throughput. Prefill vs. decode (c=4).

Raw throughput

Prefill — compute-bound (tok/s)

Driven by raw compute. Matters most for long prompts / image inputs.

Decode — c=1 vs c=4 (tok/s)

c=1 nearly flat across T4/L4/DGX → bandwidth wall. Concurrency starts to pay off at c=4.

Parallelization — scaling out concurrent slots (np)

Decode throughput vs. number of parallel slots

Blackwell 6000 was not run in this test. The bandwidth-bound cards recover throughput by serving many requests at once — the DGX Spark scales best, overtaking both T4 and L4 by np8.

T4 L4 DGX Spark

Recommendation guide for DCPF

Cheapest entry — mind the prefill

At Q4_K_M the model fits on one 16 GB T4, making it the lowest deploy cost ($2,000) and a strong $/decode tok/s. Caveat: its prefill is the weakest (205 tok/s), which matters for this image-heavy VL workload — fine for text-light, latency-tolerant serving.

Best all-round value

The L4 ($2,500) roughly doubles the T4's image prefill (471 tok/s) for ~$500 more, fits on one card, and scales to ~490 tok/s decode at np32. The safe default for mixed text+image workloads at small scale.

Best concurrent throughput / headroom

The DGX Spark scales furthest in the parallelization test (611 tok/s at np32) and its 128 GB unified memory leaves huge KV-cache headroom for high concurrency or larger models later.

Top performance & best $-efficiency at the high end — Blackwell Pro 6000

Wins every raw metric (2,613 tok/s prefill, 466 decode at c=4) and, despite the highest sticker price, posts the best $/tok/s on both prefill and decode. The pick where image-prefill latency and rack density dominate — it was not included in the np scale-out test. The single-slot, 165 W Blackwell 4500 (32 GB, ~$4,000, estimated) is the cheaper Blackwell-class alternative — roughly half the 6000's prefill/decode for under half the price, with similar $/tok/s — attractive for dense, power-constrained server racks.

Data & assumptions. Model: hf.co/unsloth/Qwen3-VL-8B-Instruct-GGUF:Q4_K_M (4-bit vision-language). Performance figures from DCPF test files (Initial Performance Analysis.txt, Parallelization Tests.txt). Cards-to-fit assumes a Q4_K_M runtime footprint of ~7 GB (~5 GB 4-bit weights + vision projector + KV cache) — fits one card on all options; the FP16 toggle (~18 GB) is shown only for comparison. VRAM: T4 16 GB, L4 24 GB, DGX Spark 128 GB unified, Blackwell Pro 6000 96 GB, Blackwell 4500 Server 32 GB. Indicative purchase prices (USD, Jun 2026): T4 ~$2,000, L4 ~$2,500 (MSRP), DGX Spark $4,699 (full system, not a PCIe card), Blackwell Pro 6000 ~$8,500 street (NVIDIA list $13,250), Blackwell 4500 Server ~$4,000 (converted from €3,670+ channel pricing; no US MSRP announced). Blackwell 4500 performance is extrapolated, not measured — prefill by CUDA-core ratio, decode by memory-bandwidth ratio, anchored on the Blackwell 6000 (24,064 cores / 1,792 GB/s vs the 4500's 10,496 cores / 800 GB/s). $/tok/s uses per-card price. The DGX Spark is a complete desktop AI system rather than an add-in card; treat its "cards to fit" as "units". Prices move — overwrite the GPUS data block to re-run all figures. Sources: VideoCardz, Tom's Hardware, Thunder Compute, NVIDIA, APXML / localai.computer, Unsloth.