Geometric and Stochastic Analysis of Discontinuities in Sparse Mixture-of-Experts Architectures
By
[Submitted on 17 Jun 2026]
Summary
This paper provides a rigorous geometric and stochastic analysis of discontinuities in Sparse Mixture-of-Experts (SMoE) architectures, which are widely used in state-of-the-art language and vision models. The authors classify discontinuities by order based on tied experts at switching events, establish asymptotic volume estimates showing lower-order discontinuities dominate, and prove that random input paths almost surely hit order-1 discontinuities first with explicit probability bounds. They propose a smoothing mechanism that adds minimal computational overhead while improving continuity and empirical performance across language and vision tasks.
Source
Key quotes
· 5 pulledSparse Mixture-of-Experts (SMoE) architectures are now widely deployed in state-of-the-art language and vision models, where conditional routing allows scaling to very large networks.
In the vicinity of these discontinuity surfaces, even inputs that are arbitrarily close may activate substantially different sets of experts resulting in significantly different outputs.
We first classify them by order, determined by the number of tied experts at a switching event.
These theoretical results imply that inputs are more likely to lie near lower order discontinuities.
Our analysis guarantees that the added computational overhead remains small while providing localized smoothing near discontinuities, and experiments across language and vision tasks show that smoothing not only enforces continuity of the SMoE map but also enhances empirical performance.
You might also wanna read

Building high-performance expert-parallel dispatch and combine kernels for MoE LLM inference
This article provides a deep technical deep-dive into the architecture and implementation of high-performance Expert Parallelism (EP) kernel
Pangu Pro MoE: Mixture of Grouped Experts for Efficient Sparsity
uGMM-NN: Neural Network Architecture with Gaussian Mixture Model Neurons for Probabilistic Reasoning
This research paper introduces uGMM-NN (Univariate Gaussian Mixture Model Neural Network), a novel neural architecture that embeds probabili
NSA: A Hardware-Aligned and Natively Trainable Sparse Attention Mechanism for Efficient Long-Context Modeling
The article introduces NSA (Natively trainable Sparse Attention), a novel sparse attention mechanism designed to improve efficiency in long-
NSA: A Hardware-Aligned and Natively Trainable Sparse Attention Mechanism for Efficient Long-Context Modeling
The article introduces NSA (Natively trained Sparse Attention), a novel sparse attention mechanism designed to enhance efficiency in long-co
Training-Free Single-Image Diffusion Model Achieves Fast, High-Quality Generation
This paper presents a training-free approach to single-image diffusion models. Instead of training a neural network on a single image (which
