Recover-LoRA: A Data-Free Method for Recovering Accuracy in 2-Bit Quantized Language Models

[Submitted on 2 Jun 2026]

3d ago· 2 min readenInsight

75/100

Toasty

Bagelometer↗

Toasted to a respectable shade. No regrets, no crumbs left.

Score75TypeanalysisSentimentpositive

Summary

This paper presents Recover-LoRA, a method for recovering accuracy in large language models (LLMs) that have been aggressively quantized to 2-bit precision. The authors propose a selective mixed-precision strategy (GateUp configuration) where only gate and up projection layers of the MLP are quantized to 2-bit while other layers remain at higher precision. They demonstrate via roofline analysis across model families (4B-20B) and hardware platforms that a W4/W2-GateUp deployment delivers 7.5-23.3% throughput improvement over uniform W4. Recover-LoRA uses low-rank adapters trained via logit distillation with synthetic data to recover accuracy lost from 2-bit quantization. In a case study on Qwen3-4B, the method achieves 80-95% accuracy recovery on 9 of 12 benchmarks using only 10k synthetic training samples with no labeled data.

Key quotes

· 5 pulled

Aggressive weight quantization to 2-bit precision offers substantial throughput and memory gains for large language model (LLM) inference, but typically incurs severe accuracy degradation.

We propose a selective mixed-precision strategy in which only gate and up projection layers of the MLP are quantized to 2-bit (W2), while all other linear layers remain at higher precision.

Recover-LoRA achieves 80-95% accuracy recovery on 9 of 12 benchmarks, using only 10k synthetic training samples and no labeled data.

Our results present Recover-LoRA as a practical post-quantization accuracy recovery tool for aggressive weight compression in deployment settings.

We further demonstrate that synthetic data performs comparably to curated labeled data for distillation-based recovery, and that recovery generalizes to out-of-distribution evaluation tasks.

Snippet from the RSS feed

Aggressive weight quantization to 2-bit precision offers substantial throughput and memory gains for large language model (LLM) inference, but typically incurs severe accuracy degradation. These gains are particularly relevant for edge and on-device deplo

You might also wanna read

Fast-dLLM: Training-Free Acceleration Method for Diffusion Language Models Using KV Cache and Parallel Decoding

Researchers introduce Fast-dLLM, a training-free acceleration method for diffusion-based large language models that addresses their slower i

arxiv.org·7mo ago

SparseLoCo: Communication-Efficient LLM Training with Extreme Compression via Sparsification and Quantization

SparseLoCo is a new communication-efficient training algorithm for Large Language Models (LLMs) that combines Top-k sparsification and quant

arxiv.org·9mo ago

Research Directions for Overcoming Memory and Interconnect Challenges in Large Language Model Inference Hardware

This article discusses the technical challenges of Large Language Model (LLM) inference, highlighting how the autoregressive Decode phase ma

arxiv.org·4mo ago

Research: 224× Compression of Llama-70B Achieved with Improved Accuracy Through Meaning Field Extraction

This research paper introduces a novel method for eliminating transformers from inference while maintaining or improving accuracy. The appro

zenodo.org·6mo ago

Ouro: Looped Language Models That Build Reasoning into Pre-Training Through Latent Space Iteration

Researchers introduce Ouro, a family of pre-trained Looped Language Models (LoopLM) that build reasoning capabilities directly into the pre-

arxiv.org·5mo ago

Research Study: Effectiveness of Adaptive Merging for Recycling LoRA Modules from Public Repositories

This research paper examines the effectiveness of adaptive merging methods for recycling LoRA (Low-Rank Adaptation) modules from public repo

arxiv.org·3mo ago