UniFormer: A Unified Model-Centric Scaling Framework for Industrial Recommendation Systems
By
[Submitted on 25 Jun 2026]
Summary
This paper introduces UniFormer, a unified model-centric scaling framework for industrial recommender systems developed by Kuaishou. Unlike prior component-centric approaches that scale individual modules independently, UniFormer decomposes the modeling space into feature and task spaces, each modeled by dedicated interaction modules. It introduces a semantic-based tokenization scheme for user-item decoupling to enable request-level inference acceleration, and uses multi-sequence cross-attention to prevent preference collapse by capturing heterogeneous behavior patterns. Multi-view FFNs support flexible parameter scaling. Online A/B testing in two production scenarios (Kuaishou and Kuaishou Lite) demonstrated consistent improvements in user engagement metrics, including +0.101%/+0.260% in App Stay Time and +0.729%/+1.113% in Watch Time.
Source
Key quotes
· 5 pulledUniFormer decomposes the overall modeling space into feature and task spaces, which are modeled by stacked Feature-space Interaction Modules and Task-space Interaction Modules, respectively.
UniFormer introduces semantic-based tokenization scheme to enable user-item decoupling, thereby achieving request-level inference acceleration.
To prevent preference collapse, UniFormer employs multi-sequence cross-attention to separately capture heterogeneous behavior patterns, followed by the self-attention to enhance interaction modeling.
Extensive online A/B testing in two production scenarios, Kuaishou and Kuaishou Lite, shows that UniFormer consistently improves user engagement and interaction metrics.
Dedicated multi-view FFNs are introduced to support flexible and scalable parameter scaling across different modeling components.
You might also wanna read
Tokasaurus: An LLM Inference Engine for High-Throughput Workloads
Research Study: Effectiveness of Adaptive Merging for Recycling LoRA Modules from Public Repositories
This research paper examines the effectiveness of adaptive merging methods for recycling LoRA (Low-Rank Adaptation) modules from public repo
Systematic evaluation of 21 LLM-as-a-Judge models reveals reliability flaws and position bias across 541,000 judgments
This paper presents the largest systematic evaluation of LLM-as-a-Judge models to date, analyzing 21 judges from nine providers across three
Systematic evaluation of 21 LLM-as-a-Judge models reveals reliability flaws and position bias across 541,000 judgments
This paper presents the largest systematic evaluation of LLM-as-a-Judge models to date, analyzing 21 judges from nine providers across three
Consistency Diffusion Language Models Achieve 14x Faster Inference Through KV Caching and Step Reduction
Consistency Diffusion Language Models (CDLM) represent a breakthrough in language model architecture that addresses key limitations of stand
Jasmine: A Scalable JAX-Based World Modeling Codebase for Efficient AI Training
Researchers introduce Jasmine, a high-performance JAX-based world modeling codebase designed to scale from single hosts to hundreds of accel
Technical Analysis of Local RAG Implementation: Tradeoffs Between Inference Speed and Retrieval Accuracy
The article discusses local RAG (Retrieval-Augmented Generation) implementation, focusing on model performance tradeoffs between inference s

Comments
Sign in to join the conversation.
No comments yet. Be the first.