Kog AI Launches Inference Engine Tech Preview: 3,000 Tokens/s on AMD MI300X GPUs
By
Kog Team
An everything bagel for the brain. Substantive, layered, well-seasoned.
Summary
Kog AI launches a tech preview of the Kog Inference Engine (KIE), achieving 3,000 output tokens/s per request on 8× AMD MI300X GPUs and 2,100 on 8× NVIDIA H200 GPUs (FP16, no speculative decoding) for a 2B model. The engine promises similar speeds for large third-party MoE models in the future. The article argues that AI inference on GPUs can reach speeds comparable to dedicated hardware, presenting benchmarks and technical details about the inference engine's architecture and performance.
Key quotes
· 3 pulledwe show that AI inference on GPUs can be super-fast, reaching the speed regime of dedicated hardware
3,000 output tokens/s per request on 8× AMD MI300X GPUs and 2,100 on 8× NVIDIA H200 (FP16, no speculative decoding)
This preview runs a 2B model, with support for large third-party MoE models coming next at similar speeds
You might also wanna read
General Compute Launches ASIC-Based Inference Cloud for Faster AI Agent Performance
General Compute is an inference cloud built on ASICs (purpose-built alternatives to Nvidia GPUs) designed specifically for AI inference, not

Microsoft Launches Maia 200 AI Accelerator Chip to Compete with Amazon and Google
Microsoft announces the Maia 200, its latest in-house AI accelerator chip built on TSMC's 3nm process. The chip features over 100 billion tr

AMD Partners with OpenAI to Supply AI Processors in Challenge to Nvidia
AMD has announced a five-year partnership with OpenAI to supply six gigawatts worth of processors for AI data centers, challenging Nvidia's

Microsoft Launches First In-House AI Models MAI-Voice-1 and MAI-1-preview
Microsoft has launched its first in-house AI models called MAI-Voice-1 and MAI-1-preview. The MAI-Voice-1 speech model can generate a minute
