All Topics
All Topics
Technology
Technology
Design
Design
Programming
Programming
Science
Science
News
News
Gaming
Gaming
Entertainment
Entertainment
Business
Business
Finance
Finance
Sports
Sports
Health
Health
Food
Food
Travel
Travel
Art
Art
Music
Music
Books
Books
Education
Education
Politics
Politics
Personal
Personal
No algorithm. No AI slop. No ads. Just RSS. Pro-human. Indie writers. Real journalism. Open web. Chronological. Hand toasted.

Lucebox Hub: Hand-Tuned LLM Inference Optimization for Consumer Hardware

By

GreenGames

1mo ago· 4 min readenCode

Summary

Lucebox Hub is an open-source optimization project focused on hand-tuning LLM inference for specific consumer hardware. The project rewrites LLM software from scratch for individual chips rather than waiting for better hardware. It includes two main projects: a megakernel for Qwen3.5 0.8B on RTX 3090 that achieves 1.87 tokens per joule, matching Apple's latest silicon performance on 2020 hardware, and a speculative decoding implementation for Llama 3.1 8B on RTX 4090 that achieves 2.5 tokens per joule. The approach involves custom kernels, speculative decoding, and quantization tailored per target hardware, with the philosophy that software optimization can extract maximum performance from existing consumer hardware.

Key quotes

· 5 pulled
Open LLM inference, rewritten by hand for one specific chip at a time.
We don't wait for better silicon. We rewrite the software.
All 24 layers of Qwen 3.5-0.8B in a single CUDA dispatch, 1.87 tok/J on a 2020 GPU, matching Apple's latest silicon at 1.9 tok/J.
Speculative decoding for Llama 3.1 8B on RTX 4090: 2.5 tok/J, 2.5x faster than baseline.
Kernels, speculative decoding, and quantization, tailored per target.
Snippet from the RSS feed
Lucebox optimization hub: hand-tuned LLM inference, built for specific consumer hardware. - Luce-Org/lucebox-hub

You might also wanna read