LinkedIn Engineers Diagnose Kernel Lock Contention Behind Recurring Database Freezes
Crispy enough to crunch, soft enough to enjoy. A good bake.
Summary
LinkedIn engineers investigated recurring short-lived database outages that caused the user feed to become unavailable for 10–15 seconds before recovering, with no useful logs or clear triggers. Traditional monitoring failed to identify the root cause, so engineers analyzed OS and runtime behavior during the freezes. They found that incident timing correlated with momentary spikes in memory allocation, followed by stabilization at a higher memory baseline. After ruling out CPU throttling, memory fragmentation, compaction, and file I/O, engineers built an automated "trap" that detected a freeze and immediately captured off-CPU profiling data to diagnose the kernel lock contention issue.
Key quotes
· 4 pulledShort-lived recurring database outages made the user feed unavailable for 10–15 seconds and then recover without useful logs or clear external triggers.
Conventional monitoring and metrics did not reveal the root cause, so engineers investigated OS and runtime behavior during the freezes.
Incident timing correlated with momentary spikes in memory allocation, followed by stabilization at a higher memory baseline, while CPU throttling, memory fragmentation and compaction, and file I/O were ruled out.
Engineers built an automated 'trap' that detected a freeze and immediately captured an off-CPU profile.
