Intel's AutoRound: Open-Source Quantization Toolkit for Low-Bit LLM and VLM Inference

lastdong

1mo ago· 8 min readenCode

100/100

Golden Brown

Bagelometer↗

Pure flour-power. Hearty enough to carry you through lunch.

Score100Typepress releaseSentimentpositive

Summary

AutoRound is an advanced quantization toolkit for Large Language Models (LLMs) and Vision-Language Models (VLMs) developed by Intel. It achieves high accuracy at ultra-low bit widths (2–4 bits) using sign-gradient descent, with broad hardware compatibility across CPU, XPU, and CUDA. The toolkit supports multi-datatype formats and integrates with popular frameworks like vLLM, SGLang, and Transformers. Recent updates include block-wise FP8 quantization and MTP layer quantization support.

Key quotes

· 4 pulled

AutoRound is an advanced quantization toolkit designed for Large Language Models (LLMs) and Vision-Language Models (VLMs).

It achieves high accuracy at ultra-low bit widths (2–4 bits) with minimal tuning by leveraging sign-gradient descent and providing broad hardware compatibility.

Block-wise FP8 quantization is available via --scheme FP8_BLOCK --iters 0 --disable_opt_rtn.

MTP layer quantization has been support

Snippet from the RSS feed

A SOTA quantization algorithm for high-accuracy low-bit LLM inference, seamlessly optimized for CPU/XPU/CUDA, with multi-datatype support and full compatibility with vLLM, SGLang, and Transformers....

You might also wanna read

Google Introduces TurboQuant: Advanced LLM Compression Algorithm for Efficient AI Model Deployment

Google has developed TurboQuant, a new LLM compression algorithm that uses advanced theoretically grounded quantization techniques to enable

Product Hunt·2mo ago

RTP-LLM: Alibaba's High-Performance Inference Engine for Large Language Model Deployment

This paper presents RTP-LLM, a high-performance inference engine developed by Alibaba for industrial-scale deployment of Large Language Mode

arxiv.org·2d ago