Jepsen Analysis Reveals Data Loss Vulnerabilities in NATS JetStream 2.12.1
By
aphyr
Slow-proofed and worth the wait. Worth its weight in flour.
Summary
Jepsen's independent testing of NATS JetStream version 2.12.1 revealed significant data loss vulnerabilities in the distributed streaming system. The analysis found that JetStream loses committed writes when data files are truncated or corrupted on a minority of nodes, and coordinated power failures or OS crashes combined with network delays can cause persistent split-brain scenarios. The primary cause was identified as the system's default policy of flushing writes to disk every two minutes rather than before acknowledging them. The report also includes a belated note about data loss in version 2.10.22 that was fixed in 2.10.23. NATS has since documented the risks of its default fsync policy, with remaining issues under investigation.
Key quotes
· 5 pulledWe tested NATS JetStream, version 2.12.1, and found that it lost writes if data files were truncated or corrupted on a minority of nodes.
We also found that coordinated power failures, or an OS crash on a single node combined with network delays or process pauses, can cause the loss of committed writes and persistent split-brain.
This data loss was caused (at least in part) by choosing to flush writes to disk every two minutes, rather than before acknowledging them.
NATS has now documented the risk of its default fsync policy, and the remaining issues remain under investigation.
This research was performed independently by Jepsen, without compensation, and conducted in accordance with the Jepsen ethics policy.
You might also wanna read
Case Study: Overhauling TigerBeetle's Routing Algorithm with Generative Testing and Fuzzing Techniques
The article appears to be a technical case study about overhauling TigerBeetle's routing algorithm to handle varying network topologies in a
Jepsen Identifies Critical Issues in Capela's Distributed Programming Environment
The article discusses the collaboration between Jepsen and Capela, an unreleased distributed programming environment, to test development bu
Agent Memory Is Distributed State Management, Not Magic
The article argues that "agent memory" in AI systems is fundamentally just distributed state management rebranded. It draws parallels betwee
Modified Raft Consensus Protocol Enables Progress with Minority Node Participation
This article describes a modified version of the Raft consensus protocol that allows progress to be made even when fewer than a majority of
Building a Rust Multi-Paxos Engine with AI: Lessons from 130K Lines of Code
A developer shares their experience building a 130K-line Rust-based multi-Paxos consensus engine using AI coding agents over ~3 months. The
Docket: AI-driven cross-platform QA testing with self-healing automation
Docket is an AI-driven end-to-end testing tool that works across web, iOS, Android, and desktop platforms. It uses coordinate-based automati
