Reverse-Engineering the RK3588 NPU to Run Vision Transformers 15x Faster
By
rcarmo
Properly proved. Has structure, has flavour, has a point.
Summary
The article details the process of reverse-engineering the Rockchip RK3588 NPU to overcome hardware limitations that prevent running modern Vision Transformers like SmolVLM. The author discovered that while the chip promises 6 TOPS of NPU performance, the standard Computer Vision SDK fails with Vision Transformers due to memory constraints from large Attention matrices. Through reverse-engineering, they identified hardware limits, defeated compiler optimizations, and built a custom sharding runtime that achieved 15x faster performance for SmolVLM.
Key quotes
· 3 pulledThe standard Computer Vision SDK (rknn-toolkit2) is optimized for older, predictable CNNs (like ResNet). When I fed it the SigLIP Vision Transformer used by SmolVLM, the driver choked.
Even though the model is 'smol,' the massive Attention matrices it generates triggered cry
Reverse-engineering the Rockchip RK3588 NPU to run SmolVLM 15x faster by discovering hardware limits, defeating compiler optimizations, and building a custom sharding runtime
You might also wanna read
Sequential KV Cache Compression Using Probabilistic Language Tries and Predictive Delta Coding
This research paper introduces a novel two-layer architecture for compressing transformer key-value (KV) caches as sequences rather than ind
FPGA Implementation of 3dfx Voodoo 1 Graphics Card Using Modern Hardware Design Tools
An engineer describes successfully implementing a 3dfx Voodoo 1 graphics card using modern FPGA tools and SpinalHDL hardware description lan
Reducing MCP Costs by 94% Through CLI Conversion
The article discusses how AI agents using Model Context Protocol (MCP) are overpaying due to inefficient tool catalog loading. The author de
Analysis of Undocumented CPU Hardware Bugs and Design Flaws
The article discusses various CPU hardware bugs and design flaws found in vendor CPUs, focusing on specific examples like Intel's misspelled
ByteShape Optimizes Qwen3-30B Model for Real-Time Performance on Raspberry Pi
ByteShape has released a device-optimized version of the Qwen3-30B-A3B-Instruct-2507 model that can run in real-time on a Raspberry Pi. The
How Janet Jackson's "Rhythm Nation" music video could crash certain laptops due to resonant frequency interference
The article discusses a technical phenomenon from the late 1990s/early 2000s where Janet Jackson's music video for "Rhythm Nation" was found
