All Topics
All Topics
Technology
Technology
Design
Design
Programming
Programming
Science
Science
News
News
Gaming
Gaming
Entertainment
Entertainment
Business
Business
Finance
Finance
Sports
Sports
Health
Health
Food
Food
Travel
Travel
Art
Art
Music
Music
Books
Books
Education
Education
Politics
Politics
Personal
Personal
Bluesky
Twitter
No algorithm. No AI slop. No ads. Just RSS. Pro-human. Indie writers. Real journalism. Open web. Chronological. Hand toasted.

Geometric and Stochastic Analysis of Discontinuities in Sparse Mixture-of-Experts Architectures

By

[Submitted on 17 Jun 2026]

3h ago· 2 min readenInsight

Summary

This paper provides a rigorous geometric and stochastic analysis of discontinuities in Sparse Mixture-of-Experts (SMoE) architectures, which are widely used in state-of-the-art language and vision models. The authors classify discontinuities by order based on tied experts at switching events, establish asymptotic volume estimates showing lower-order discontinuities dominate, and prove that random input paths almost surely hit order-1 discontinuities first with explicit probability bounds. They propose a smoothing mechanism that adds minimal computational overhead while improving continuity and empirical performance across language and vision tasks.

Source

bskyGeometric and Stochastic Analysis of Discontinuities in Sparse Mixture-of-Experts Architecturesarxiv.org

Key quotes

· 5 pulled
Sparse Mixture-of-Experts (SMoE) architectures are now widely deployed in state-of-the-art language and vision models, where conditional routing allows scaling to very large networks.
In the vicinity of these discontinuity surfaces, even inputs that are arbitrarily close may activate substantially different sets of experts resulting in significantly different outputs.
We first classify them by order, determined by the number of tied experts at a switching event.
These theoretical results imply that inputs are more likely to lie near lower order discontinuities.
Our analysis guarantees that the added computational overhead remains small while providing localized smoothing near discontinuities, and experiments across language and vision tasks show that smoothing not only enforces continuity of the SMoE map but also enhances empirical performance.
Snippet from the RSS feed
Sparse Mixture-of-Experts (SMoE) architectures are now widely deployed in state-of-the-art language and vision models, where conditional routing allows scaling to very large networks. However, this very Top-$k$ expert selection that enables conditional ro

You might also wanna read