All Topics

Technology

Art

Optimizing Matrix Multiplication in Swift for LLM Training on Apple Silicon

Matt Gallagher

21d ago· 26 min readen

100/100

Golden Brown

Bagelometer↗

Crisp on the outside, thoughtful on the inside. A keeper.

Score100Typehow-toSentimentpositive

Summary

This article explores optimizing handwritten matrix multiplication code in Swift for training Large Language Models on Apple Silicon. It covers 10 different implementations ranging from plain C and Swift to Metal, focusing on performance improvements from Gflop/s to Tflop/s. The author provides insight into key optimization steps for mathematical code in Swift and explains the capabilities of different Apple Silicon units including CPU, SIMD, AMX, and GPU. This is the first part in a series about training neural networks in Swift on Apple Silicon.

Key quotes

· 3 pulled

The aim is to give some insight into the key steps for optimizing mathematics code in Swift.

I also hope that these examples will offer a sense of scale about the capabilities of the different units on Apple Silicon – CPU, SIMD, AMX and GPU.

10 implementations of handwritten matrix multiplication: from plain C and Swift through to Metal

Snippet from the RSS feed

10 implementations of handwritten matrix multiplication: from plain C and Swift through to Metal

You might also wanna read

DeepSeek-V4: Hybrid Sparse-Attention Architecture Enables Efficient Million-Token Context Inference

DeepSeek-V4 introduces a hybrid sparse-attention architecture combined with on-policy distillation across domain specialists, enabling 1M-to

artgor.medium.com·17h ago

Rotary GPU: Enabling Large Mixture-of-Experts Models on Consumer Laptop GPUs with Limited Memory

This paper presents Rotary GPU, an exploratory approach to running large Mixture-of-Experts (MoE) language models on consumer-grade hardware

arxiv.org·1d ago

LinkedIn cuts GPU training hours by 65% with Generative Recommender system optimizations

LinkedIn has developed a Generative Recommender (GR) system that models user activity as token sequences, offering richer long-context perso

startuphub.ai·3d ago

Rank-Aware Decomposition Technique Reduces Computation in Recommender Systems by 87.5%

This paper presents a rank-aware decomposition technique for deep ranking models in industrial recommender systems. The key insight is that

arxiv.org·4d ago

Hands-on evaluation of MiniMax M2.7 via API on ML and coding workflows

The author evaluates MiniMax M2.7 by using it through Claude Code on three real-world ML and coding workflows: scaffolding a Kaggle competit

andlukyane.com·12d ago

Orthrus: A Dual-Architecture Framework for Fast, Lossless LLM Inference via Diffusion Decoding

Orthrus is a dual-architecture framework that combines autoregressive LLMs with diffusion models to enable fast, lossless parallel token gen

github.com·16d ago