All Topics

Technology

Art

Why average CPU utilization is a misleading metric for cloud-native applications

JeremyTheo

9d ago· 14 min readenInsight

100/100

Golden Brown

Bagelometer↗

Front-window bakery material. Catches the eye, delivers the goods.

Score100TypeanalysisSentimentnegative

Summary

The article discusses the pitfalls of relying on average CPU utilization metrics in cloud-native environments, particularly in Kubernetes. It uses a real-world case study of a Go application that experienced context deadline exceeded errors in production due to CPU throttling, which couldn't be reproduced in development or testing. The article explains how Completely Fair Scheduler (CFS) throttling works and argues that average CPU graphs hide critical throttling events, leading to misleading conclusions about application performance. It advocates for more granular metrics like CPU throttling time and utilization percentiles to diagnose performance issues accurately.

Key quotes

· 5 pulled

A Go function in our application kept getting cancelled in production.

The function had a tight timeout. The same code ran fine in our development setups, in our CI and CD pipelines, in every integration test we had.

In production it would sometimes blow past the timeout and die with context deadline exceeded.

What made it worse was the state machine library we used. When its context got cancelled, it wouldn't recover on its own.

We couldn't reproduce it.

Snippet from the RSS feed

How CFS throttling works and the case against the average CPU graph.

You might also wanna read

Agumbe: AI-Powered Workspace Platform for Kubernetes Application Development

Agumbe is a platform that provides AI-powered workspaces for building and running applications on Kubernetes. It helps teams go from idea to

Product Hunt·2mo ago

Towlion: Self-Hosted Micro-PaaS for GitHub-Based Application Deployment

Towlion is a self-hosted micro-PaaS (Platform as a Service) that enables developers to deploy full web applications directly from GitHub to

towlion.github.io·2mo ago

aws-doctor: Open-Source CLI Tool for AWS Security, Cost, and Best Practices Auditing

aws-doctor is an open-source command-line tool written in Golang that performs comprehensive health checks on AWS accounts. It audits securi

github.com·4mo ago

Netflix's Simian Army: Testing Cloud Reliability Through Intentional Failures

Netflix discusses their cloud infrastructure reliability strategy called the "Simian Army" - a suite of tools designed to test and improve s

netflixtechblog.com·4mo ago

Debugging Envoy Load Balancer Latency with eBPF Zero-Code Instrumentation

The article describes a technical solution for debugging an Envoy Network Load Balancer using eBPF (Extended Berkeley Packet Filter) for zer

sergiocipriano.com·5mo ago

Wozz: Kubernetes Cost Optimization Tool for Preventing Resource Waste

Wozz is a Kubernetes cost optimization tool that helps engineering teams reduce cloud spending through two main approaches: a PR Cost Linter

github.com·5mo ago