Guide to Running Google Gemma 4 AI Model Locally with LM Studio CLI on macOS
By
vbtechguy
Fresh out the oven, still warm. Top of the tray.
Summary
This article provides a technical guide on running Google's Gemma 4 26B parameter model locally using LM Studio's new headless CLI tools. It explains the advantages of local AI models over cloud APIs, including cost savings, privacy, and avoiding rate limits. The article details how to set up Gemma 4 on macOS hardware, highlighting its mixture-of-experts architecture that allows the 26B model to run efficiently by only activating 4B parameters per forward pass. The guide includes practical setup instructions for using the model with Claude Code for local inference tasks.
Key quotes
· 4 pulledCloud AI APIs are great until they are not. Rate limits, usage costs, privacy concerns, and network latency all add up.
For quick tasks like code review, drafting, or testing prompts, a local model that runs entirely on your hardware has real advantages: zero API costs, no data leaving your machine, and consistent availability.
Google's Gemma 4 is interesting for local use because of its mixture-of-experts architecture. The 26B parameter model only activates 4B parameters per forward pass, which means it runs well on hardware that could never handle a dense 26B model.
LM Studio 0.4.0 introduced llmster and the lms CLI. Here is how I set up Gemma 4 26B for local inference on macOS that can be used with Claude Code.
You might also wanna read
Locally AI: Run AI Models Offline on Apple Devices
Locally AI is a software application that enables users to run various AI models (including Llama, Gemma, Qwen, and DeepSeek) locally on App
Google DeepMind Releases Gemma 4: Most Advanced Open AI Model Family
Google DeepMind has released Gemma 4, its most advanced open AI model family to date. The models feature enhanced reasoning capabilities, mu
Guide to Calculating GPU Memory for Self-Hosted LLM Inference
The article provides a guide on calculating GPU memory requirements and managing concurrent requests for self-hosted large language model (L
MiniCPM 4.0: Open-source 8B multimodal AI model outperforms GPT-4o and Gemini Pro on vision benchmarks
MiniCPM 4.0 is an ultra-efficient 8B open-source multimodal AI model designed for on-device use that outperforms larger models like GPT-4o a
TranslateGemma: Open AI Translation Models Based on Google's Gemma 3 Support 55 Languages
TranslateGemma is a new suite of open AI translation models built on Google's Gemma 3 framework, supporting 55 languages with high accuracy
Google launches Gemini 3.1 Flash-Lite, its fastest and cheapest model for high-volume AI pipelines
Google's Gemini 3.1 Flash-Lite has reached general availability as the company's most cost-efficient Gemini 3 model. It's designed for high-
