All Topics
All Topics
Technology
Technology
Design
Design
Programming
Programming
Science
Science
News
News
Gaming
Gaming
Entertainment
Entertainment
Business
Business
Finance
Finance
Sports
Sports
Health
Health
Food
Food
Travel
Travel
Art
Art
Music
Music
Books
Books
Education
Education
Politics
Politics
Personal
Personal
No algorithm. No AI slop. No ads. Just RSS. Pro-human. Indie writers. Real journalism. Open web. Chronological. Hand toasted.

Taalas Develops ASIC Chip Running Llama 3.1 at 17,000 Tokens Per Second

By

beAroundHere

3mo ago· 4 min readenNews

Summary

Taalas, a startup, has developed an ASIC chip that runs Llama 3.1 8B model at 17,000 tokens per second, which is equivalent to generating about 30 A4 pages of text per second. The company claims their chip offers 10x lower ownership costs, 10x less power consumption, and 10x faster performance compared to state-of-the-art GPU-based inference systems. The key innovation is that they've "hardwired" or "printed" the model's weights directly onto the chip, essentially creating specialized hardware optimized for this specific AI model.

Key quotes

· 5 pulled
A startup called Taalas, recently released an ASIC chip running Llama 3.1 8B (3/6 bit quant) at an inference rate of 17,000 tokens per seconds.
That's like writing around 30 A4 sized pages in one second.
They claim it's 10x cheaper in ownership cost than GPU based inference systems and is 10x less electricity hog.
And yeah, about 10x faster than state of art inference.
I tried to read through their blog and they've literally 'hardwired' the model's weights on chip.
Snippet from the RSS feed
or how to generate 17000 tokens per second?

You might also wanna read