All Topics
All Topics
Technology
Technology
Design
Design
Programming
Programming
Science
Science
News
News
Gaming
Gaming
Entertainment
Entertainment
Business
Business
Finance
Finance
Sports
Sports
Health
Health
Food
Food
Travel
Travel
Art
Art
Music
Music
Books
Books
Education
Education
Politics
Politics
Personal
Personal
No algorithm. No AI slop. No ads. Just RSS. Pro-human. Indie writers. Real journalism. Open web. Chronological. Hand toasted.

Kog AI Launches Inference Engine Tech Preview: 3,000 Tokens/s on AMD MI300X GPUs

By

Kog Team

2d ago· 18 min readen

Summary

Kog AI launches a tech preview of the Kog Inference Engine (KIE), achieving 3,000 output tokens/s per request on 8× AMD MI300X GPUs and 2,100 on 8× NVIDIA H200 GPUs (FP16, no speculative decoding) for a 2B model. The engine promises similar speeds for large third-party MoE models in the future. The article argues that AI inference on GPUs can reach speeds comparable to dedicated hardware, presenting benchmarks and technical details about the inference engine's architecture and performance.

Key quotes

· 3 pulled
we show that AI inference on GPUs can be super-fast, reaching the speed regime of dedicated hardware
3,000 output tokens/s per request on 8× AMD MI300X GPUs and 2,100 on 8× NVIDIA H200 (FP16, no speculative decoding)
This preview runs a 2B model, with support for large third-party MoE models coming next at similar speeds
Snippet from the RSS feed
Today, Kog AI launches a tech preview of the Kog Inference Engine (KIE): 3,000 output tokens/s per request on 8× AMD MI300X GPUs and 2,100 on 8× NVIDIA H200 (FP16, no speculative decoding). This preview runs a 2B model, with support for large third-party

You might also wanna read