sparkinfer · SN74 — Blackwell inference frontier

Target GPUs

NVIDIA Blackwell, on-device — consumer sm_120 + edge sm_121 (not datacenter sm_100)

RTX Spark GB10sm_121

Personal AI PC · flagship target

"A new class of processor for the age of personal agents." Run agents locally and privately on a Windows PC.

128 GB unified · 1 PFLOP FP4 · 120B-param LLMs

NVIDIA × Microsoft ↗

Jetson Thorsm_121

Physical AI & robotics

"The ultimate platform for physical AI and robotics" — agentic reasoning at the edge, in humanoid robots.

2,070 FP4 TFLOPS · 128 GB LPDDR5X

Jetson Thor ↗

RTX 5090sm_120

Consumer Blackwell · current dev GPU

Desktop flagship — big-model local inference on a 32 GB card. Where the frontier is measured today.

32 GB GDDR7 · ~1.79 TB/s

RTX 5090 ↗

RTX PRO 6000sm_120

Workstation flagship

96 GB Blackwell — a full MoE plus all experts and a large KV cache stay resident, no paging.

96 GB GDDR7 · ~1.79 TB/s

RTX PRO ↗

Target models

two architecturally distinct MoEs — a win must hold on both, or it's overfitting

Qwen3-MoEproven

model #1 · today's frontier

Qwen3-30B-A3B / 35B-A3B — 128–256 experts, top-8, GQA head_dim 128, uniform full attention. Runs end-to-end at the current frontier with 100% token-match vs llama.cpp.

Q4_K_M · experts kept quantized · verified on RTX 5090 + PRO 6000

Gemma 4wiring

model #2 · the generality test

Gemma 4 26B-A4B — 128 experts top-8 + shared, interleaved local-SWA / global attention, head_dim 512, dual RoPE. Deliberately unlike Qwen, so an optimization can't overfit one architecture.

distinct attention + routing → the anti-overfit guard for the basket

Optimization journey

vs llama.cpp

same GPU · same GGUF

Emission weights

path-based · sums to 1.0

Auto-eval labels

deterministic — from verified speedup over the frontier

Evaluated PRs

bot labels + comments · never auto-merges

PR	area	label	decode	vs frontier