Why average CPU utilization is a misleading metric for cloud-native applications
By
JeremyTheo
Front-window bakery material. Catches the eye, delivers the goods.
Summary
The article discusses the pitfalls of relying on average CPU utilization metrics in cloud-native environments, particularly in Kubernetes. It uses a real-world case study of a Go application that experienced context deadline exceeded errors in production due to CPU throttling, which couldn't be reproduced in development or testing. The article explains how Completely Fair Scheduler (CFS) throttling works and argues that average CPU graphs hide critical throttling events, leading to misleading conclusions about application performance. It advocates for more granular metrics like CPU throttling time and utilization percentiles to diagnose performance issues accurately.
Key quotes
· 5 pulledA Go function in our application kept getting cancelled in production.
The function had a tight timeout. The same code ran fine in our development setups, in our CI and CD pipelines, in every integration test we had.
In production it would sometimes blow past the timeout and die with context deadline exceeded.
What made it worse was the state machine library we used. When its context got cancelled, it wouldn't recover on its own.
We couldn't reproduce it.
You might also wanna read
Agumbe: AI-Powered Workspace Platform for Kubernetes Application Development
Agumbe is a platform that provides AI-powered workspaces for building and running applications on Kubernetes. It helps teams go from idea to
Towlion: Self-Hosted Micro-PaaS for GitHub-Based Application Deployment
Towlion is a self-hosted micro-PaaS (Platform as a Service) that enables developers to deploy full web applications directly from GitHub to
aws-doctor: Open-Source CLI Tool for AWS Security, Cost, and Best Practices Auditing
aws-doctor is an open-source command-line tool written in Golang that performs comprehensive health checks on AWS accounts. It audits securi
Netflix's Simian Army: Testing Cloud Reliability Through Intentional Failures
Netflix discusses their cloud infrastructure reliability strategy called the "Simian Army" - a suite of tools designed to test and improve s
Debugging Envoy Load Balancer Latency with eBPF Zero-Code Instrumentation
The article describes a technical solution for debugging an Envoy Network Load Balancer using eBPF (Extended Berkeley Packet Filter) for zer
Wozz: Kubernetes Cost Optimization Tool for Preventing Resource Waste
Wozz is a Kubernetes cost optimization tool that helps engineering teams reduce cloud spending through two main approaches: a PR Cost Linter
