Agentic AI · on-device · all Blackwell

Agentic AI is moving on-device.
sparkinfer makes every Blackwell GPU fast for it.

The inference runtime for the age of personal agents — reproducible, source-verified MoE/LLM decode across consumer & edge Blackwell: from RTX Spark on your desk to Jetson Thor in the robot.

NVIDIA RTX Spark — personal AI computer for the age of agents

Target GPUs

NVIDIA Blackwell, on-device — consumer sm_120 + edge sm_121 (not datacenter sm_100)
NVIDIA RTX Spark

RTX Spark GB10sm_121

Personal AI PC · flagship target

"A new class of processor for the age of personal agents." Run agents locally and privately on a Windows PC.

128 GB unified · 1 PFLOP FP4 · 120B-param LLMs
NVIDIA × Microsoft ↗
NVIDIA Jetson Thor

Jetson Thorsm_121

Physical AI & robotics

"The ultimate platform for physical AI and robotics" — agentic reasoning at the edge, in humanoid robots.

2,070 FP4 TFLOPS · 128 GB LPDDR5X
Jetson Thor ↗
NVIDIA GeForce RTX 5090

RTX 5090sm_120

Consumer Blackwell · current dev GPU

Desktop flagship — big-model local inference on a 32 GB card. Where the frontier is measured today.

32 GB GDDR7 · ~1.79 TB/s
RTX 5090 ↗
NVIDIA RTX PRO 6000 Blackwell

RTX PRO 6000sm_120

Workstation flagship

96 GB Blackwell — a full MoE plus all experts and a large KV cache stay resident, no paging.

96 GB GDDR7 · ~1.79 TB/s
RTX PRO ↗

Target models

two architecturally distinct MoEs — a win must hold on both, or it's overfitting
Qwen3-MoE

Qwen3-MoEproven

model #1 · today's frontier

Qwen3-30B-A3B / 35B-A3B — 128–256 experts, top-8, GQA head_dim 128, uniform full attention. Runs end-to-end at the current frontier with 100% token-match vs llama.cpp.

Q4_K_M · experts kept quantized · verified on RTX 5090 + PRO 6000
Gemma 4

Gemma 4wiring

model #2 · the generality test

Gemma 4 26B-A4B — 128 experts top-8 + shared, interleaved local-SWA / global attention, head_dim 512, dual RoPE. Deliberately unlike Qwen, so an optimization can't overfit one architecture.

distinct attention + routing → the anti-overfit guard for the basket

Optimization journey

vs llama.cpp

same GPU · same GGUF

Emission weights

path-based · sums to 1.0

Auto-eval labels

deterministic — from verified speedup over the frontier

Evaluated PRs

bot labels + comments · never auto-merges
PRarealabeldecodevs frontier