Intel's AutoRound: Open-Source Quantization Toolkit for Low-Bit LLM and VLM Inference
By
lastdong
Pure flour-power. Hearty enough to carry you through lunch.
Summary
AutoRound is an advanced quantization toolkit for Large Language Models (LLMs) and Vision-Language Models (VLMs) developed by Intel. It achieves high accuracy at ultra-low bit widths (2–4 bits) using sign-gradient descent, with broad hardware compatibility across CPU, XPU, and CUDA. The toolkit supports multi-datatype formats and integrates with popular frameworks like vLLM, SGLang, and Transformers. Recent updates include block-wise FP8 quantization and MTP layer quantization support.
Key quotes
· 4 pulledAutoRound is an advanced quantization toolkit designed for Large Language Models (LLMs) and Vision-Language Models (VLMs).
It achieves high accuracy at ultra-low bit widths (2–4 bits) with minimal tuning by leveraging sign-gradient descent and providing broad hardware compatibility.
Block-wise FP8 quantization is available via --scheme FP8_BLOCK --iters 0 --disable_opt_rtn.
MTP layer quantization has been support
You might also wanna read
Google Introduces TurboQuant: Advanced LLM Compression Algorithm for Efficient AI Model Deployment
Google has developed TurboQuant, a new LLM compression algorithm that uses advanced theoretically grounded quantization techniques to enable
RTP-LLM: Alibaba's High-Performance Inference Engine for Large Language Model Deployment
This paper presents RTP-LLM, a high-performance inference engine developed by Alibaba for industrial-scale deployment of Large Language Mode
