All Topics
All Topics
Technology
Technology
Design
Design
Programming
Programming
Science
Science
News
News
Gaming
Gaming
Entertainment
Entertainment
Business
Business
Finance
Finance
Sports
Sports
Health
Health
Food
Food
Travel
Travel
Art
Art
Music
Music
Books
Books
Education
Education
Politics
Politics
Personal
Personal
No algorithm. No AI slop. No ads. Just RSS. Pro-human. Indie writers. Real journalism. Open web. Chronological. Hand toasted.

Zebra-Llama: Efficient Hybrid Language Models Combining SSMs and Attention Layers

By

mirrir

5mo ago· 2 min readenInsight

Summary

Researchers propose Zebra-Llama, a family of hybrid language models (1B, 3B, 8B) that combine State Space Models (SSMs) and Multi-head Latent Attention (MLA) layers to achieve Transformer-level accuracy with near-SSM efficiency. The approach uses a refined initialization and post-training pipeline to transfer knowledge from pre-trained Transformers, requiring only 7-11B training tokens instead of trillions. Zebra-Llama dramatically reduces KV cache size to 2-3.9% of original while preserving 97-100% of performance on LM Harness tasks, outperforms comparable models like MambaInLLaMA and Minitron in accuracy with fewer tokens and smaller teachers, and achieves 2.6x-3.8x higher throughput.

Key quotes

· 5 pulled
Zebra-Llama dramatically reduces KV cache size -down to 3.9%, 2%, and 2.73% of the original for the 1B, 3B, and 8B variants, respectively-while preserving 100%, 100%, and >97% of average zero-shot performance on LM Harness tasks.
Compared to models like MambaInLLaMA, X-EcoMLA, Minitron, and Llamba, Zebra-Llama consistently delivers competitive or superior accuracy while using significantly fewer tokens, smaller teachers, and vastly reduced KV cache memory.
Notably, Zebra-Llama-8B surpasses Minitron-8B in few-shot accuracy by 7% while using 8x fewer training tokens, over 12x smaller KV cache, and a smaller teacher (8B vs. 15B).
It also achieves 2.6x-3.8x higher throughput (tokens/s) than MambaInLlama up to a 32k context length.
Zebra-Llama achieves Transformer-level accuracy with near-SSM efficiency using only 7-11B training tokens (compared to trillions of tokens required for pre-training) and an 8B teacher.
Snippet from the RSS feed
With the growing demand for deploying large language models (LLMs) across diverse applications, improving their inference efficiency is crucial for sustainable and democratized access. However, retraining LLMs to meet new user-specific requirements is pro

You might also wanna read