All Topics
All Topics
Technology
Technology
Design
Design
Programming
Programming
Science
Science
News
News
Gaming
Gaming
Entertainment
Entertainment
Business
Business
Finance
Finance
Sports
Sports
Health
Health
Food
Food
Travel
Travel
Art
Art
Music
Music
Books
Books
Education
Education
Politics
Politics
Personal
Personal
No algorithm. No AI slop. No ads. Just RSS. Pro-human. Indie writers. Real journalism. Open web. Chronological. Hand toasted.

Why average CPU utilization is a misleading metric for cloud-native applications

By

JeremyTheo

9d ago· 14 min readenInsight

Summary

The article discusses the pitfalls of relying on average CPU utilization metrics in cloud-native environments, particularly in Kubernetes. It uses a real-world case study of a Go application that experienced context deadline exceeded errors in production due to CPU throttling, which couldn't be reproduced in development or testing. The article explains how Completely Fair Scheduler (CFS) throttling works and argues that average CPU graphs hide critical throttling events, leading to misleading conclusions about application performance. It advocates for more granular metrics like CPU throttling time and utilization percentiles to diagnose performance issues accurately.

Key quotes

· 5 pulled
A Go function in our application kept getting cancelled in production.
The function had a tight timeout. The same code ran fine in our development setups, in our CI and CD pipelines, in every integration test we had.
In production it would sometimes blow past the timeout and die with context deadline exceeded.
What made it worse was the state machine library we used. When its context got cancelled, it wouldn't recover on its own.
We couldn't reproduce it.
Snippet from the RSS feed
How CFS throttling works and the case against the average CPU graph.

You might also wanna read