All Topics
All Topics
Technology
Technology
Design
Design
Programming
Programming
Science
Science
News
News
Gaming
Gaming
Entertainment
Entertainment
Business
Business
Finance
Finance
Sports
Sports
Health
Health
Food
Food
Travel
Travel
Art
Art
Music
Music
Books
Books
Education
Education
Politics
Politics
Personal
Personal
No algorithm. No AI slop. No ads. Just RSS. Pro-human. Indie writers. Real journalism. Open web. Chronological. Hand toasted.

LinkedIn Engineers Diagnose Kernel Lock Contention Behind Recurring Database Freezes

4d ago· 2 min readenInsight

Summary

LinkedIn engineers investigated recurring short-lived database outages that caused the user feed to become unavailable for 10–15 seconds before recovering, with no useful logs or clear triggers. Traditional monitoring failed to identify the root cause, so engineers analyzed OS and runtime behavior during the freezes. They found that incident timing correlated with momentary spikes in memory allocation, followed by stabilization at a higher memory baseline. After ruling out CPU throttling, memory fragmentation, compaction, and file I/O, engineers built an automated "trap" that detected a freeze and immediately captured off-CPU profiling data to diagnose the kernel lock contention issue.

Key quotes

· 4 pulled
Short-lived recurring database outages made the user feed unavailable for 10–15 seconds and then recover without useful logs or clear external triggers.
Conventional monitoring and metrics did not reveal the root cause, so engineers investigated OS and runtime behavior during the freezes.
Incident timing correlated with momentary spikes in memory allocation, followed by stabilization at a higher memory baseline, while CPU throttling, memory fragmentation and compaction, and file I/O were ruled out.
Engineers built an automated 'trap' that detected a freeze and immediately captured an off-CPU profile.
Snippet from the RSS feed
Off-CPU eBPF profiling was triggered instantly during brief database feed outages to capture blocked threads and identify root cause when logs and metrics failed.

You might also wanna read