All Topics

Technology

Art

Hypura: Storage-Tier-Aware LLM Inference Scheduler for Apple Silicon Enables Running Large Models Beyond Physical Memory Limits

tatef

2mo ago· 5 min readenCode

100/100

Golden Brown

Bagelometer↗

The kind of bagel that ruins lesser bagels for you.

Score100TypenewsSentimentpositive

Summary

Hypura is a storage-tier-aware LLM inference scheduler for Apple Silicon that enables running large language models that exceed physical memory capacity. It intelligently places model tensors across GPU, RAM, and NVMe storage tiers based on access patterns and hardware capabilities, allowing models like 31GB Mixtral 8x7B to run on 32GB Mac hardware at usable speeds where vanilla implementations would crash.

Key quotes

· 4 pulled

Hypura is a storage-tier-aware LLM inference scheduler for Apple Silicon.

It places model tensors across GPU, RAM, and NVMe tiers based on access patterns, bandwidth costs, and hardware capabilities — enabling models that exceed physical memory to run without crashing the system.

Run a 31 GB Mixtral 8x7B on a 32 GB Mac Mini at 2.2 tok/s. A 40 GB Llama 70B at 0.3 tok/s. Vanilla llama.cpp crashes on both.

Consumer hardware (MacBook Pro, Mac Studio) ships with fast unified memory and NVMe storage, but limited capacity.

Snippet from the RSS feed

Run models too big for your Mac's memory. Contribute to t8/hypura development by creating an account on GitHub.

You might also wanna read

DeepSeek-V4: Hybrid Sparse-Attention Architecture Enables Efficient Million-Token Context Inference

DeepSeek-V4 introduces a hybrid sparse-attention architecture combined with on-policy distillation across domain specialists, enabling 1M-to

artgor.medium.com·3h ago

Rotary GPU: Enabling Large Mixture-of-Experts Models on Consumer Laptop GPUs with Limited Memory

This paper presents Rotary GPU, an exploratory approach to running large Mixture-of-Experts (MoE) language models on consumer-grade hardware

arxiv.org·1d ago

LinkedIn cuts GPU training hours by 65% with Generative Recommender system optimizations

LinkedIn has developed a Generative Recommender (GR) system that models user activity as token sequences, offering richer long-context perso

startuphub.ai·3d ago

Rank-Aware Decomposition Technique Reduces Computation in Recommender Systems by 87.5%

This paper presents a rank-aware decomposition technique for deep ranking models in industrial recommender systems. The key insight is that

arxiv.org·3d ago

Hands-on evaluation of MiniMax M2.7 via API on ML and coding workflows

The author evaluates MiniMax M2.7 by using it through Claude Code on three real-world ML and coding workflows: scaffolding a Kaggle competit

andlukyane.com·11d ago

Orthrus: A Dual-Architecture Framework for Fast, Lossless LLM Inference via Diffusion Decoding

Orthrus is a dual-architecture framework that combines autoregressive LLMs with diffusion models to enable fast, lossless parallel token gen

github.com·16d ago