All Topics
All Topics
Technology
Technology
Design
Design
Programming
Programming
Science
Science
News
News
Gaming
Gaming
Entertainment
Entertainment
Business
Business
Finance
Finance
Sports
Sports
Health
Health
Food
Food
Travel
Travel
Art
Art
Music
Music
Books
Books
Education
Education
Politics
Politics
Personal
Personal
No algorithm. No AI slop. No ads. Just RSS. Pro-human. Indie writers. Real journalism. Open web. Chronological. Hand toasted.

Hypura: Storage-Tier-Aware LLM Inference Scheduler for Apple Silicon Enables Running Large Models Beyond Physical Memory Limits

By

tatef

2mo ago· 5 min readenCode

Summary

Hypura is a storage-tier-aware LLM inference scheduler for Apple Silicon that enables running large language models that exceed physical memory capacity. It intelligently places model tensors across GPU, RAM, and NVMe storage tiers based on access patterns and hardware capabilities, allowing models like 31GB Mixtral 8x7B to run on 32GB Mac hardware at usable speeds where vanilla implementations would crash.

Key quotes

· 4 pulled
Hypura is a storage-tier-aware LLM inference scheduler for Apple Silicon.
It places model tensors across GPU, RAM, and NVMe tiers based on access patterns, bandwidth costs, and hardware capabilities — enabling models that exceed physical memory to run without crashing the system.
Run a 31 GB Mixtral 8x7B on a 32 GB Mac Mini at 2.2 tok/s. A 40 GB Llama 70B at 0.3 tok/s. Vanilla llama.cpp crashes on both.
Consumer hardware (MacBook Pro, Mac Studio) ships with fast unified memory and NVMe storage, but limited capacity.
Snippet from the RSS feed
Run models too big for your Mac's memory. Contribute to t8/hypura development by creating an account on GitHub.

You might also wanna read