Zebra-Llama: Efficient Hybrid Language Models Combining SSMs and Attention Layers

mirrir

5mo ago· 2 min readenInsight

75/100

Toasty

Bagelometer↗

Not artisan, but a perfectly fine bagel. Hits the spot.

Score75TypeanalysisSentimentpositive

Summary

Researchers propose Zebra-Llama, a family of hybrid language models (1B, 3B, 8B) that combine State Space Models (SSMs) and Multi-head Latent Attention (MLA) layers to achieve Transformer-level accuracy with near-SSM efficiency. The approach uses a refined initialization and post-training pipeline to transfer knowledge from pre-trained Transformers, requiring only 7-11B training tokens instead of trillions. Zebra-Llama dramatically reduces KV cache size to 2-3.9% of original while preserving 97-100% of performance on LM Harness tasks, outperforms comparable models like MambaInLLaMA and Minitron in accuracy with fewer tokens and smaller teachers, and achieves 2.6x-3.8x higher throughput.

Key quotes

· 5 pulled

Zebra-Llama dramatically reduces KV cache size -down to 3.9%, 2%, and 2.73% of the original for the 1B, 3B, and 8B variants, respectively-while preserving 100%, 100%, and >97% of average zero-shot performance on LM Harness tasks.

Compared to models like MambaInLLaMA, X-EcoMLA, Minitron, and Llamba, Zebra-Llama consistently delivers competitive or superior accuracy while using significantly fewer tokens, smaller teachers, and vastly reduced KV cache memory.

Notably, Zebra-Llama-8B surpasses Minitron-8B in few-shot accuracy by 7% while using 8x fewer training tokens, over 12x smaller KV cache, and a smaller teacher (8B vs. 15B).

It also achieves 2.6x-3.8x higher throughput (tokens/s) than MambaInLlama up to a 32k context length.

Zebra-Llama achieves Transformer-level accuracy with near-SSM efficiency using only 7-11B training tokens (compared to trillions of tokens required for pre-training) and an 8B teacher.

Snippet from the RSS feed

With the growing demand for deploying large language models (LLMs) across diverse applications, improving their inference efficiency is crucial for sustainable and democratized access. However, retraining LLMs to meet new user-specific requirements is pro

You might also wanna read

Monostate: All-in-One AI Training Platform for Fine-Tuning LLMs

Monostate is an all-in-one AI training platform that enables users to fine-tune large language models (LLMs) with their own data using vario

Product Hunt·2mo ago

RTP-LLM: Alibaba's High-Performance Inference Engine for Large Language Model Deployment

This paper presents RTP-LLM, a high-performance inference engine developed by Alibaba for industrial-scale deployment of Large Language Mode

arxiv.org·1d ago

Parametric Memory Law: A Quantitative Framework for Understanding LoRA Memory Capacity in LLMs

This research paper introduces the Parametric Memory Law, a quantitative framework for understanding how Low-Rank Adaptation (LoRA) enables

arxiv.org·1d ago

PromptEmbedder: A Dual-LLM Framework for Efficient, Architecture-Agnostic Text Embedding

The article presents PromptEmbedder, a novel dual-LLM framework for efficient and transferable text embedding. It addresses the bottleneck o

arxiv.org·3d ago