Taalas Develops ASIC Chip Running Llama 3.1 at 17,000 Tokens Per Second
By
beAroundHere
The kind of bagel that ruins lesser bagels for you.
Summary
Taalas, a startup, has developed an ASIC chip that runs Llama 3.1 8B model at 17,000 tokens per second, which is equivalent to generating about 30 A4 pages of text per second. The company claims their chip offers 10x lower ownership costs, 10x less power consumption, and 10x faster performance compared to state-of-the-art GPU-based inference systems. The key innovation is that they've "hardwired" or "printed" the model's weights directly onto the chip, essentially creating specialized hardware optimized for this specific AI model.
Key quotes
· 5 pulledA startup called Taalas, recently released an ASIC chip running Llama 3.1 8B (3/6 bit quant) at an inference rate of 17,000 tokens per seconds.
That's like writing around 30 A4 sized pages in one second.
They claim it's 10x cheaper in ownership cost than GPU based inference systems and is 10x less electricity hog.
And yeah, about 10x faster than state of art inference.
I tried to read through their blog and they've literally 'hardwired' the model's weights on chip.
You might also wanna read
EXO Labs Runs Llama 2 AI Model on 1997 Pentium II Using BitNet Optimization
EXO Labs successfully ran a lightweight Llama 2 AI model on a 1997 Pentium II processor with only 128 MB of RAM by leveraging BitNet's terna

Microsoft Launches Maia 200 AI Accelerator Chip to Compete with Amazon and Google
Microsoft announces the Maia 200, its latest in-house AI accelerator chip built on TSMC's 3nm process. The chip features over 100 billion tr
General Compute Launches ASIC-Based Inference Cloud for Faster AI Agent Performance
General Compute is an inference cloud built on ASICs (purpose-built alternatives to Nvidia GPUs) designed specifically for AI inference, not
