MpGEMM: Optimizing General Matrix Multiplication for ARM's Scalable Matrix Extension Architecture

General Matrix Multiplication (GEMM) is a critical kernel in high-performance computing and deep learning. While modern architectures like ARM's Scalable Matrix Extension (SME) introduce dedicated…

Read the full article

matt_d5mo ago1 min readenInsight

technology programming computer architecture high-performance computing

You might also wanna read

APEX4: Platform-Dependent W4A4 LLM Inference via Intra-SM Compute Rebalancing

W4A4 quantization promises full utilization of INT4 Tensor Cores, yet group dequantization overhead on CUDA Cores has driven existing system

arxiv.org·1mo ago

Modular: Matrix Multiplication on Blackwell: Part 2 - Using Hardware Features to Optimize Matmul

Matrix Multiplication on Blackwell: Part 2 - Using Hardware Features to Optimize Matmul

modular.com·10mo ago

Modular: Matrix Multiplication on Blackwell: Part 3 - The Optimizations Behind 85% of SOTA Performance

Matrix Multiplication on Blackwell: Part 3 - The Optimizations Behind 85% of SOTA Performance

modular.com·10mo ago

Accelerating GPU Inference of Large Language Models with Moderately Unstructured Sparse Weight Matrices

arXiv:2607.08786v1 Announce Type: new Abstract: With the growing deployment of large language models (LLMs), LLM inference cost has become a

machinebrief.com·4d ago

Are LLM-Generated GPU Kernels Production-Ready? A Trace-Driven Benchmark and Optimization Agent

arXiv:2607.14541v1 Announce Type: new Abstract: Existing GPU kernel generation benchmarks draw problems from synthetic or curated sources th

machinebrief.com·3h ago

A Survey on the Green Development of Large Models: From Resource-Efficient Architectures to Hardware-Software Co-Design

arXiv:2607.09084v1 Announce Type: new Abstract: The rapid expansion of large-scale AI models has led to significant performance breakthrough

machinebrief.com·4d ago

Comments

No comments yet. Be the first.