Zebra-Llama: Efficient Hybrid Language Models Combining SSMs and Attention Layers
By
mirrir
Not artisan, but a perfectly fine bagel. Hits the spot.
Summary
Researchers propose Zebra-Llama, a family of hybrid language models (1B, 3B, 8B) that combine State Space Models (SSMs) and Multi-head Latent Attention (MLA) layers to achieve Transformer-level accuracy with near-SSM efficiency. The approach uses a refined initialization and post-training pipeline to transfer knowledge from pre-trained Transformers, requiring only 7-11B training tokens instead of trillions. Zebra-Llama dramatically reduces KV cache size to 2-3.9% of original while preserving 97-100% of performance on LM Harness tasks, outperforms comparable models like MambaInLLaMA and Minitron in accuracy with fewer tokens and smaller teachers, and achieves 2.6x-3.8x higher throughput.
Key quotes
· 5 pulledZebra-Llama dramatically reduces KV cache size -down to 3.9%, 2%, and 2.73% of the original for the 1B, 3B, and 8B variants, respectively-while preserving 100%, 100%, and >97% of average zero-shot performance on LM Harness tasks.
Compared to models like MambaInLLaMA, X-EcoMLA, Minitron, and Llamba, Zebra-Llama consistently delivers competitive or superior accuracy while using significantly fewer tokens, smaller teachers, and vastly reduced KV cache memory.
Notably, Zebra-Llama-8B surpasses Minitron-8B in few-shot accuracy by 7% while using 8x fewer training tokens, over 12x smaller KV cache, and a smaller teacher (8B vs. 15B).
It also achieves 2.6x-3.8x higher throughput (tokens/s) than MambaInLlama up to a 32k context length.
Zebra-Llama achieves Transformer-level accuracy with near-SSM efficiency using only 7-11B training tokens (compared to trillions of tokens required for pre-training) and an 8B teacher.
You might also wanna read
Monostate: All-in-One AI Training Platform for Fine-Tuning LLMs
Monostate is an all-in-one AI training platform that enables users to fine-tune large language models (LLMs) with their own data using vario
RTP-LLM: Alibaba's High-Performance Inference Engine for Large Language Model Deployment
This paper presents RTP-LLM, a high-performance inference engine developed by Alibaba for industrial-scale deployment of Large Language Mode
Parametric Memory Law: A Quantitative Framework for Understanding LoRA Memory Capacity in LLMs
This research paper introduces the Parametric Memory Law, a quantitative framework for understanding how Low-Rank Adaptation (LoRA) enables
PromptEmbedder: A Dual-LLM Framework for Efficient, Architecture-Agnostic Text Embedding
The article presents PromptEmbedder, a novel dual-LLM framework for efficient and transferable text embedding. It addresses the bottleneck o
