LiteLLM Seeks First Reliability Engineer for AI Gateway Infrastructure
By
ij23
A good honest bake. Not flashy, but you'll finish the whole bagel.
Summary
LiteLLM, an open-source AI gateway with $7M ARR and 36K+ GitHub stars, is hiring its first dedicated reliability engineer. The role involves 60% operational reliability and 40% performance engineering for a system that routes hundreds of millions of LLM API calls daily for major companies like NASA, Adobe, Netflix, Stripe, and Nvidia. The engineer will own production stability, performance optimization, and observability for one of the most widely deployed AI infrastructure projects, handling challenges like memory management in Python async services, database scaling, and supporting 100+ AI provider APIs.
Key quotes
· 5 pulledWhen LiteLLM goes down, our customers' entire AI stack goes down. We need someone who makes sure that doesn't happen.
You'd be the first dedicated reliability hire. You'll own reliability, performance, and production stability end-to-end. Nobody will tell you how to do it.
We route traffic for some of the largest AI deployments on the planet. One customer is scaling from 20M to 200M daily AI calls through our gateway.
The problems here are genuinely hard: Memory management in long-running Python async services — our proxy handles thousands of concurrent streaming connections.
Scale & impact: Your work is in the critical path for hundreds of millions of AI API calls daily. NASA, Netflix, Adobe, Stripe depend on this.
You might also wanna read

RunLLM: AI Support Engineer for Resolving Complex Issues
RunLLM is an AI Support Engineer designed to resolve complex support issues by reading logs, code, and documentation, significantly reducing
OpenLIT: Zero-Code Observability Platform for AI Agents and LLM Applications
OpenLIT is an open-source observability platform that provides zero-code monitoring for AI agents and LLM applications. It addresses the com

RunLLM: AI Tool for Resolving Support Issues with UC Berkeley Research
RunLLM, an AI tool built on UC Berkeley research, resolves complex support issues by analyzing logs, code, and documentation. It claims to s
