Ultralytics YOLO26: A Unified Real-Time Vision Model Family with NMS-Free Inference and Advanced Training Pipeline
By
[Submitted on 2 Jun 2026]
Summary
Ultralytics YOLO26 is a new family of real-time vision models that addresses key limitations of prior YOLO detectors. It introduces a dual-head design for NMS-free end-to-end inference, removes Distribution Focal Loss (DFL) for a lighter head, and incorporates a hybrid Muon-SGD optimizer (MuSGD), Progressive Loss training, and a Small Target Aware Label assignment strategy (STAL) to ensure small objects get positive label coverage. The model family spans five scales (n/s/m/l/x) and supports detection, instance segmentation, pose estimation, classification, and oriented detection in a single pipeline. YOLO26 achieves 40.9-57.5 mAP on COCO at 1.7-11.8 ms T4 TensorRT latency, advancing the accuracy-latency Pareto frontier. An open-vocabulary extension, YOLOE-26, enables text-, visual-, and prompt-free inference, reaching 40.6 AP on LVIS minival.
Source
Key quotes
· 5 pulledReal-time vision demands models that are accurate, efficient, and simple to deploy across diverse hardware.
YOLO26 uses a dual-head design for native NMS-free end-to-end inference and removes DFL entirely, yielding a lighter head with unconstrained regression range.
Its training pipeline combines MuSGD, a hybrid Muon-SGD optimizer adapted from large language model training; Progressive Loss, which shifts supervision toward the inference-time head; and STAL, a label assignment strategy that guarantees positive coverage for small objects.
Across all scales, YOLO26 achieves 40.9-57.5 mAP on COCO at 1.7-11.8 ms T4 TensorRT latency, advancing the accuracy-latency Pareto front over prior real-time detectors.
The family spans five scales (n/s/m/l/x) and supports detection, instance segmentation, pose estimation, classification, and oriented detection in a single pipeline, with an open-vocabulary extension, YOLOE-26, for text-, visual-, and prompt-free inference.
You might also wanna read
Lumos-Nexus: A Training-Efficient Two-Stage Framework for High-Fidelity Video Generation with Reasoning Capabilities
Lumos-Nexus is a training-efficient unified video generation framework that addresses the computational challenge of integrating large high-
LoomVideo: A 5B-Parameter Unified Model for Efficient Video Generation and Editing
LoomVideo is a new 5-billion parameter unified architecture for video generation and editing that addresses computational bottlenecks in exi
RTP-LLM: Alibaba's High-Performance Inference Engine for Large Language Model Deployment
This paper presents RTP-LLM, a high-performance inference engine developed by Alibaba for industrial-scale deployment of Large Language Mode
ReMoT: A Reinforcement Learning Framework Using Motion Contrast Triplets to Improve VLM Spatio-Temporal Reasoning
ReMoT (Reinforcement Learning with Motion Contrast Triplets) is a unified training paradigm designed to address spatio-temporal consistency
LK Losses: A New Training Objective to Optimize Acceptance Rate in Speculative Decoding for LLMs
This paper introduces LK losses, a novel training objective for speculative decoding in large language models (LLMs). Speculative decoding a
PromptEmbedder: A Dual-LLM Framework for Efficient, Architecture-Agnostic Text Embedding
The article presents PromptEmbedder, a novel dual-LLM framework for efficient and transferable text embedding. It addresses the bottleneck o
Comments
Sign in to join the conversation.
No comments yet. Be the first.
