LiteLLM Seeks First Reliability Engineer for AI Gateway Infrastructure

ij23

3mo ago· 3 min readen

75/100

Toasty

Bagelometer↗

A good honest bake. Not flashy, but you'll finish the whole bagel.

Score75Typepress releaseSentimentpositive

Summary

LiteLLM, an open-source AI gateway with $7M ARR and 36K+ GitHub stars, is hiring its first dedicated reliability engineer. The role involves 60% operational reliability and 40% performance engineering for a system that routes hundreds of millions of LLM API calls daily for major companies like NASA, Adobe, Netflix, Stripe, and Nvidia. The engineer will own production stability, performance optimization, and observability for one of the most widely deployed AI infrastructure projects, handling challenges like memory management in Python async services, database scaling, and supporting 100+ AI provider APIs.

Key quotes

· 5 pulled

When LiteLLM goes down, our customers' entire AI stack goes down. We need someone who makes sure that doesn't happen.

You'd be the first dedicated reliability hire. You'll own reliability, performance, and production stability end-to-end. Nobody will tell you how to do it.

We route traffic for some of the largest AI deployments on the planet. One customer is scaling from 20M to 200M daily AI calls through our gateway.

The problems here are genuinely hard: Memory management in long-running Python async services — our proxy handles thousands of concurrent streaming connections.

Scale & impact: Your work is in the critical path for hundreds of millions of AI API calls daily. NASA, Netflix, Adobe, Stripe depend on this.

Snippet from the RSS feed

TLDR LiteLLM is an open-source AI gateway (36K+ GitHub stars) that routes hundreds of millions of LLM API calls daily for companies like NASA, Adobe, Netflix, Stripe, and Nvidia. We're at $7M ARR, 10 people, YC W23. When LiteLLM goes down, our customers'

You might also wanna read

RunLLM: AI Support Engineer for Resolving Complex Issues

RunLLM is an AI Support Engineer designed to resolve complex support issues by reading logs, code, and documentation, significantly reducing

Product Hunt·10mo ago

OpenLIT: Zero-Code Observability Platform for AI Agents and LLM Applications

OpenLIT is an open-source observability platform that provides zero-code monitoring for AI agents and LLM applications. It addresses the com

Product Hunt·8mo ago

RunLLM: AI Tool for Resolving Support Issues with UC Berkeley Research

RunLLM, an AI tool built on UC Berkeley research, resolves complex support issues by analyzing logs, code, and documentation. It claims to s

Product Hunt·10mo ago