All Topics

Technology

Art

Jepsen Analysis Reveals Data Loss Vulnerabilities in NATS JetStream 2.12.1

aphyr

5mo ago· 21 min readenInsight

100/100

Golden Brown

Bagelometer↗

Slow-proofed and worth the wait. Worth its weight in flour.

Score100TypeanalysisSentimentnegative

Summary

Jepsen's independent testing of NATS JetStream version 2.12.1 revealed significant data loss vulnerabilities in the distributed streaming system. The analysis found that JetStream loses committed writes when data files are truncated or corrupted on a minority of nodes, and coordinated power failures or OS crashes combined with network delays can cause persistent split-brain scenarios. The primary cause was identified as the system's default policy of flushing writes to disk every two minutes rather than before acknowledging them. The report also includes a belated note about data loss in version 2.10.22 that was fixed in 2.10.23. NATS has since documented the risks of its default fsync policy, with remaining issues under investigation.

Key quotes

· 5 pulled

We tested NATS JetStream, version 2.12.1, and found that it lost writes if data files were truncated or corrupted on a minority of nodes.

We also found that coordinated power failures, or an OS crash on a single node combined with network delays or process pauses, can cause the loss of committed writes and persistent split-brain.

This data loss was caused (at least in part) by choosing to flush writes to disk every two minutes, rather than before acknowledging them.

NATS has now documented the risk of its default fsync policy, and the remaining issues remain under investigation.

This research was performed independently by Jepsen, without compensation, and conducted in accordance with the Jepsen ethics policy.

Snippet from the RSS feed

NATS is a distributed streaming system. Regular NATS streams offer only best-effort delivery, but a subsystem, called JetStream, guarantees messages are delivered at least once. We tested NATS JetStream, version 2.12.1, and found that it lost writes if da

You might also wanna read

Case Study: Overhauling TigerBeetle's Routing Algorithm with Generative Testing and Fuzzing Techniques

The article appears to be a technical case study about overhauling TigerBeetle's routing algorithm to handle varying network topologies in a

tigerbeetle.com·6mo ago

Jepsen Identifies Critical Issues in Capela's Distributed Programming Environment

The article discusses the collaboration between Jepsen and Capela, an unreleased distributed programming environment, to test development bu

jepsen.io·9mo ago

Agent Memory Is Distributed State Management, Not Magic

The article argues that "agent memory" in AI systems is fundamentally just distributed state management rebranded. It draws parallels betwee

hackernoon.com·4d ago

Modified Raft Consensus Protocol Enables Progress with Minority Node Participation

This article describes a modified version of the Raft consensus protocol that allows progress to be made even when fewer than a majority of

padhye.org·5d ago

Building a Rust Multi-Paxos Engine with AI: Lessons from 130K Lines of Code

A developer shares their experience building a 130K-line Rust-based multi-Paxos consensus engine using AI coding agents over ~3 months. The

zfhuang99.github.io·11d ago

Docket: AI-driven cross-platform QA testing with self-healing automation

Docket is an AI-driven end-to-end testing tool that works across web, iOS, Android, and desktop platforms. It uses coordinate-based automati

Product Hunt·1mo ago