All Topics

Technology

Art

Memory-Efficient Two-Pass Lanczos Algorithm Implementation in Rust

lukefleed

6mo ago· 31 min readenInsight

100/100

Golden Brown

Bagelometer↗

A five-star bake. Worth schmearing, sharing, saving.

Score100TypeanalysisSentimentneutral

Summary

This technical article presents a memory-efficient implementation of the Lanczos algorithm in Rust, addressing the significant memory requirements of the standard method. The author explores a two-pass variant that reduces memory usage from O(n×k) to O(n) at the cost of doubling matrix-vector products, demonstrating how careful implementation can make this trade-off more favorable than theoretical analysis suggests. The article focuses on algorithm engineering and low-level optimization details for high-performance computing applications.

Key quotes

· 4 pulled

The standard Lanczos method for computing matrix functions has a brutal memory requirement: storing an n×k basis matrix that grows with every iteration.

In this post, we will explore one of the most straightforward solutions to this problem: a two-pass variant of the Lanczos algorithm that only requires O(n) memory at the cost of doubling the number of matrix-vector products.

The surprising part is that when implemented carefully, the two-pass version isn't just memory-efficient—it can be surprisingly performant despite the theoretical trade-off.

How a bit of algorithm engineering and low-level details can alter what seems like a straightforward trade-off on a blackboard.

Snippet from the RSS feed

How a bit of algorithm engineering and low-level details can alter what seems like a straightforward trade-off on a blackboard.

You might also wanna read

NumKong: A Comprehensive Collection of 2,000 SIMD Kernels for Mixed-Precision Numerical Computing

The article announces the rebranding of the SimSIMD project to NumKong, which is described as a comprehensive collection of approximately 2,

ashvardanian.com·2mo ago

Ironkernel: Python DSL That Compiles to Parallel Rust for High-Performance Computing

Ironkernel is a Python DSL (Domain Specific Language) that allows developers to write NumPy-like element-wise expressions in Python, which t

github.com·2mo ago

Cimba: High-Performance Discrete Event Simulation Library in C with Multithreading and Coroutines

Cimba is a high-performance discrete event simulation library written in C and assembly that uses POSIX pthreads for parallelized replicatio

github.com·3mo ago

MpGEMM: Optimizing General Matrix Multiplication for ARM's Scalable Matrix Extension Architecture

This research paper presents MpGEMM, an open-source library that optimizes General Matrix Multiplication (GEMM) for ARM's Scalable Matrix Ex

arxiv.org·4mo ago

Breakthrough: 1.3-Second Cross-Machine Weight Transfer for Trillion-Parameter AI Models

Researchers have achieved ultra-fast 1.3-second cross-machine parameter updates for trillion-parameter AI models (Kimi-K2 with 1T parameters

research.perplexity.ai·4mo ago

Re-evaluating Warp Specialization for Modern Tensor Core GPUs

This technical blog post examines the necessity of warp specialization for high-performance kernels on modern Tensor Core GPUs like NVIDIA's

rohany.github.io·8mo ago