All Topics
All Topics
Technology
Technology
Design
Design
Programming
Programming
Science
Science
News
News
Gaming
Gaming
Entertainment
Entertainment
Business
Business
Finance
Finance
Sports
Sports
Health
Health
Food
Food
Travel
Travel
Art
Art
Music
Music
Books
Books
Education
Education
Politics
Politics
Personal
Personal
No algorithm. No AI slop. No ads. Just RSS. Pro-human. Indie writers. Real journalism. Open web. Chronological. Hand toasted.

Intel's AutoRound: Open-Source Quantization Toolkit for Low-Bit LLM and VLM Inference

By

lastdong

1mo ago· 8 min readenCode

Summary

AutoRound is an advanced quantization toolkit for Large Language Models (LLMs) and Vision-Language Models (VLMs) developed by Intel. It achieves high accuracy at ultra-low bit widths (2–4 bits) using sign-gradient descent, with broad hardware compatibility across CPU, XPU, and CUDA. The toolkit supports multi-datatype formats and integrates with popular frameworks like vLLM, SGLang, and Transformers. Recent updates include block-wise FP8 quantization and MTP layer quantization support.

Key quotes

· 4 pulled
AutoRound is an advanced quantization toolkit designed for Large Language Models (LLMs) and Vision-Language Models (VLMs).
It achieves high accuracy at ultra-low bit widths (2–4 bits) with minimal tuning by leveraging sign-gradient descent and providing broad hardware compatibility.
Block-wise FP8 quantization is available via --scheme FP8_BLOCK --iters 0 --disable_opt_rtn.
MTP layer quantization has been support
Snippet from the RSS feed
A SOTA quantization algorithm for high-accuracy low-bit LLM inference, seamlessly optimized for CPU/XPU/CUDA, with multi-datatype support and full compatibility with vLLM, SGLang, and Transformers....

You might also wanna read