Recover-LoRA: A Data-Free Method for Recovering Accuracy in 2-Bit Quantized Language Models
By
[Submitted on 2 Jun 2026]
Toasted to a respectable shade. No regrets, no crumbs left.
Summary
This paper presents Recover-LoRA, a method for recovering accuracy in large language models (LLMs) that have been aggressively quantized to 2-bit precision. The authors propose a selective mixed-precision strategy (GateUp configuration) where only gate and up projection layers of the MLP are quantized to 2-bit while other layers remain at higher precision. They demonstrate via roofline analysis across model families (4B-20B) and hardware platforms that a W4/W2-GateUp deployment delivers 7.5-23.3% throughput improvement over uniform W4. Recover-LoRA uses low-rank adapters trained via logit distillation with synthetic data to recover accuracy lost from 2-bit quantization. In a case study on Qwen3-4B, the method achieves 80-95% accuracy recovery on 9 of 12 benchmarks using only 10k synthetic training samples with no labeled data.
Key quotes
· 5 pulledAggressive weight quantization to 2-bit precision offers substantial throughput and memory gains for large language model (LLM) inference, but typically incurs severe accuracy degradation.
We propose a selective mixed-precision strategy in which only gate and up projection layers of the MLP are quantized to 2-bit (W2), while all other linear layers remain at higher precision.
Recover-LoRA achieves 80-95% accuracy recovery on 9 of 12 benchmarks, using only 10k synthetic training samples and no labeled data.
Our results present Recover-LoRA as a practical post-quantization accuracy recovery tool for aggressive weight compression in deployment settings.
We further demonstrate that synthetic data performs comparably to curated labeled data for distillation-based recovery, and that recovery generalizes to out-of-distribution evaluation tasks.
You might also wanna read
Fast-dLLM: Training-Free Acceleration Method for Diffusion Language Models Using KV Cache and Parallel Decoding
Researchers introduce Fast-dLLM, a training-free acceleration method for diffusion-based large language models that addresses their slower i
SparseLoCo: Communication-Efficient LLM Training with Extreme Compression via Sparsification and Quantization
SparseLoCo is a new communication-efficient training algorithm for Large Language Models (LLMs) that combines Top-k sparsification and quant
Research Directions for Overcoming Memory and Interconnect Challenges in Large Language Model Inference Hardware
This article discusses the technical challenges of Large Language Model (LLM) inference, highlighting how the autoregressive Decode phase ma
Research: 224× Compression of Llama-70B Achieved with Improved Accuracy Through Meaning Field Extraction
This research paper introduces a novel method for eliminating transformers from inference while maintaining or improving accuracy. The appro
Ouro: Looped Language Models That Build Reasoning into Pre-Training Through Latent Space Iteration
Researchers introduce Ouro, a family of pre-trained Looped Language Models (LoopLM) that build reasoning capabilities directly into the pre-
Research Study: Effectiveness of Adaptive Merging for Recycling LoRA Modules from Public Repositories
This research paper examines the effectiveness of adaptive merging methods for recycling LoRA (Low-Rank Adaptation) modules from public repo
