Distributed Systems Challenge: Scheduling Stateful Nodes When MMAP Interferes with Memory Accounting
By
leo_e
A second-rack bagel that's nearly first-rack. Tasty stuff.
Summary
A technical discussion about distributed systems challenges where memory-mapped files (mmap) interfere with accurate memory accounting, causing scheduling problems for stateful nodes. The author describes a cascading failure where a coordinator got stuck in a loop due to inaccurate memory usage reporting, leading to a DDoS-like situation. The post seeks advice and war stories about handling this classic distributed systems problem where mmap makes memory accounting unreliable for scheduling decisions.
Key quotes
· 4 pulledWe're hitting a classic distributed systems wall and I'm looking for war stories or 'least worst' practices.
The architecture is standard: a Control Plane (Coordinator) assigns data segments to Worker Nodes. The workload involves heavy use of mmap and lazy loading for large datasets.
We had a cascading failure where the Coordinator got stuck in a loop, DDOS-ing
The Context: We maintain a distributed stateful engine (think search/analytics).
You might also wanna read
Understanding Metastable Failures: Self-Sustaining System Performance Issues
The article discusses metastable failures in computer systems - self-sustaining performance failures caused by positive feedback loops trigg
Agent Memory Is Distributed State Management, Not Magic
The article argues that "agent memory" in AI systems is fundamentally just distributed state management rebranded. It draws parallels betwee
Modified Raft Consensus Protocol Enables Progress with Minority Node Participation
This article describes a modified version of the Raft consensus protocol that allows progress to be made even when fewer than a majority of
Building a Rust Multi-Paxos Engine with AI: Lessons from 130K Lines of Code
A developer shares their experience building a 130K-line Rust-based multi-Paxos consensus engine using AI coding agents over ~3 months. The
Explaining the Raft Consensus Algorithm Using "Mean Girls" Analogies
This article uses the movie "Mean Girls" as an analogy to explain the Raft Consensus Algorithm, a distributed systems protocol for ensuring
Performance Analysis: io_uring vs libaio Across Linux Kernel Versions and IOMMU Considerations
This article examines the performance evolution of Linux's asynchronous I/O interfaces, comparing traditional libaio with the newer io_uring
blog.ydb.tech·2mo ago