Lucebox Hub: Hand-Tuned LLM Inference Optimization for Consumer Hardware
By
GreenGames
Pure flour-power. Hearty enough to carry you through lunch.
Summary
Lucebox Hub is an open-source optimization project focused on hand-tuning LLM inference for specific consumer hardware. The project rewrites LLM software from scratch for individual chips rather than waiting for better hardware. It includes two main projects: a megakernel for Qwen3.5 0.8B on RTX 3090 that achieves 1.87 tokens per joule, matching Apple's latest silicon performance on 2020 hardware, and a speculative decoding implementation for Llama 3.1 8B on RTX 4090 that achieves 2.5 tokens per joule. The approach involves custom kernels, speculative decoding, and quantization tailored per target hardware, with the philosophy that software optimization can extract maximum performance from existing consumer hardware.
Key quotes
· 5 pulledOpen LLM inference, rewritten by hand for one specific chip at a time.
We don't wait for better silicon. We rewrite the software.
All 24 layers of Qwen 3.5-0.8B in a single CUDA dispatch, 1.87 tok/J on a 2020 GPU, matching Apple's latest silicon at 1.9 tok/J.
Speculative decoding for Llama 3.1 8B on RTX 4090: 2.5 tok/J, 2.5x faster than baseline.
Kernels, speculative decoding, and quantization, tailored per target.
You might also wanna read
Running Gemma 4 on a 2016 Xeon Server with No GPU: A Technical Walkthrough
The article describes running Gemma 4 (a 25B-parameter Mixture-of-Experts model) on a severely outdated server with a 2016 Intel Xeon E5-262
NVIDIA Announces "Hack for Impact" London Event for Autonomous AI Agent Development
NVIDIA is hosting a "Hack for Impact" event in London, challenging participants to build autonomous agentic applications using open-source m
MerLean-Prover: A Recursive Agent Harness for Lean 4 Theorem Proving Outperforms Baselines
MerLean-Prover is an end-to-end Lean4 theorem prover that replaces 'sorry' declarations with kernel-checkable proofs using three agent types
Reflections on DwarfStar 4's rapid rise in local AI inference
The author reflects on the unexpected popularity of DwarfStar 4 (DS4), a local AI inference project. They attribute its success to the conve
Reflections on DwarfStar 4's rapid rise in local AI inference
The author reflects on the unexpected popularity of DwarfStar 4 (DS4), a local AI inference project. They attribute its success to the conve
Building a Personal AI Agent with Markdown-Based Skills and Local Models
The article describes a personal AI agent built on Pi that manages the author's inbox, calendar, deal pipeline, blog publishing, and researc
