All Topics

Technology

Art

Geometric and Stochastic Analysis of Discontinuities in Sparse Mixture-of-Experts Architectures

[Submitted on 17 Jun 2026]

3h ago· 2 min readenInsight

Summary

This paper provides a rigorous geometric and stochastic analysis of discontinuities in Sparse Mixture-of-Experts (SMoE) architectures, which are widely used in state-of-the-art language and vision models. The authors classify discontinuities by order based on tied experts at switching events, establish asymptotic volume estimates showing lower-order discontinuities dominate, and prove that random input paths almost surely hit order-1 discontinuities first with explicit probability bounds. They propose a smoothing mechanism that adds minimal computational overhead while improving continuity and empirical performance across language and vision tasks.

Source

bskyGeometric and Stochastic Analysis of Discontinuities in Sparse Mixture-of-Experts Architecturesarxiv.org

Key quotes

· 5 pulled

Sparse Mixture-of-Experts (SMoE) architectures are now widely deployed in state-of-the-art language and vision models, where conditional routing allows scaling to very large networks.

In the vicinity of these discontinuity surfaces, even inputs that are arbitrarily close may activate substantially different sets of experts resulting in significantly different outputs.

We first classify them by order, determined by the number of tied experts at a switching event.

These theoretical results imply that inputs are more likely to lie near lower order discontinuities.

Our analysis guarantees that the added computational overhead remains small while providing localized smoothing near discontinuities, and experiments across language and vision tasks show that smoothing not only enforces continuity of the SMoE map but also enhances empirical performance.

Snippet from the RSS feed

Sparse Mixture-of-Experts (SMoE) architectures are now widely deployed in state-of-the-art language and vision models, where conditional routing allows scaling to very large networks. However, this very Top-$k$ expert selection that enables conditional ro

You might also wanna read

Building high-performance expert-parallel dispatch and combine kernels for MoE LLM inference

This article provides a deep technical deep-dive into the architecture and implementation of high-performance Expert Parallelism (EP) kernel

fergusfinn.com·9d ago

Pangu Pro MoE: Mixture of Grouped Experts for Efficient Sparsity

arxiv.org·11mo ago

uGMM-NN: Neural Network Architecture with Gaussian Mixture Model Neurons for Probabilistic Reasoning

This research paper introduces uGMM-NN (Univariate Gaussian Mixture Model Neural Network), a novel neural architecture that embeds probabili

arxiv.org·9mo ago

NSA: A Hardware-Aligned and Natively Trainable Sparse Attention Mechanism for Efficient Long-Context Modeling

The article introduces NSA (Natively trainable Sparse Attention), a novel sparse attention mechanism designed to improve efficiency in long-

arxiv.org·10mo ago

NSA: A Hardware-Aligned and Natively Trainable Sparse Attention Mechanism for Efficient Long-Context Modeling

The article introduces NSA (Natively trained Sparse Attention), a novel sparse attention mechanism designed to enhance efficiency in long-co

aclanthology.org·10mo ago

Training-Free Single-Image Diffusion Model Achieves Fast, High-Quality Generation

This paper presents a training-free approach to single-image diffusion models. Instead of training a neural network on a single image (which

arxiv.org·12d ago