NumKong: A Comprehensive Collection of 2,000 SIMD Kernels for Mixed-Precision Numerical Computing
By
ashvardanian
Slow-proofed and worth the wait. Worth its weight in flour.
Summary
The article announces the rebranding of the SimSIMD project to NumKong, which is described as a comprehensive collection of approximately 2,000 SIMD kernels for mixed-precision numerical computations. The project spans 200,000 lines of code and documentation across 7 programming languages, supporting various precision levels from Float6 to Float118. It leverages multiple hardware architectures including RISC-V, Intel AMX, Arm SME, and WebAssembly Relaxed SIMD. The library provides BLAS-like functionality for operations such as dot products, batched GEMMs, distance calculations, geospatial computations, ColBERT MaxSim, and mesh alignment. The project has been extensively tested against in-house 118-bit floating point numbers and profiled for both numerical stability and performance.
Key quotes
· 5 pulledI'm killing my SimSIMD project and re-launching under a new name — NumKong — StringZilla's big brother.
Around 2'000 SIMD kernels for mixed precision numerics, spread across 200'000 lines of code & docstrings, for 7 programming languages.
One of the larger collections online — comparable to OpenBLAS, the default NumPy BLAS (Basic Linear Algebra Subprograms) backend.
All of that tested against in-house 118-bit floating point numbers and heavily profiled for both numerical stability and speed.
Around 2'000 SIMD kernels for mixed-precision BLAS-like numerics — dot products, batched GEMMs, distances, geospatial, ColBERT MaxSim, and mesh alignment — from Float6 to Float118.
You might also wanna read
Ironkernel: Python DSL That Compiles to Parallel Rust for High-Performance Computing
Ironkernel is a Python DSL (Domain Specific Language) that allows developers to write NumPy-like element-wise expressions in Python, which t
Cimba: High-Performance Discrete Event Simulation Library in C with Multithreading and Coroutines
Cimba is a high-performance discrete event simulation library written in C and assembly that uses POSIX pthreads for parallelized replicatio
Why Average LLM Use Is Likely Destroying Value in Software Development
The author argues that, contrary to prevailing hype, the average use of Large Language Models (LLMs) is likely destroying value rather than
How AI Accelerated Prototyping: From Idea to Tangible in Record Time
The author reflects on how AI has transformed their prototyping workflow. Previously, the biggest bottleneck was the time needed to scaffold
GitLab 19.0 launches with Secrets Manager, agentic workflows, and self-hosted AI models
GitLab 19.0 has been released, positioning itself as an intelligent orchestration platform for DevSecOps. The release includes expanded secr
bit.ly·1d agoCentralizing Error Handling in Rust with Custom AppError Enums
This article discusses the importance of centralizing error handling in Rust applications using a custom AppError enum combined with map_err
