MJEPA: A Unified Single-Encoder Architecture for Self-Supervised Audio-Visual Learning
By
[Submitted on 23 Jun 2026]
Summary
This paper introduces MJEPA (Multimodal Joint-Embedding Predictive Architecture), a self-supervised learning method for audio-visual representation learning. Unlike existing approaches that use separate modality-specific encoders and complex combinations of contrastive or reconstruction objectives, MJEPA uses a single unified encoder for both audio and visual modalities with only one predictive objective applied within and across modalities. The key finding is that cross-modal prediction is critical — without it, performance degrades below unimodal baselines. Results show the frozen ViT-g model outperforms prior frozen baselines by over 6.8 mAP on AudioSet-20K, surpasses fully finetuned models on ESC-50 and FSD50K, and is competitive on video benchmarks despite using 10x less video data.
Source
Key quotes
· 3 pulledWe introduce MJEPA, a joint-embedding predictive architecture for audio-visual learning that uses a single, unified encoder for both modalities.
We show that cross-modal prediction is critical: without it, a shared encoder degrades below unimodal baselines; with it, each modality's representation benefits from the other.
Our frozen ViT-g model outperforms the best prior frozen baseline by over 6.8 mAP on AudioSet-20K, surpasses fully finetuned models on ESC-50 and FSD50K, and is competitive on video benchmarks despite using 10x less video data.
You might also wanna read
LeJEPA: A Theoretically Grounded Self-Supervised Learning Framework for AI Representation Learning
Researchers present LeJEPA, a theoretically grounded self-supervised learning framework that addresses limitations in Joint-Embedding Predic
VideoMLA: Low-Rank Latent KV Cache Reduces Memory by 92.7% for Minute-Scale Video Diffusion
This paper introduces VideoMLA, the first application of Multi-Head Latent Attention (MLA) to video diffusion models. It replaces per-head k
Lumos-Nexus: A Training-Efficient Two-Stage Framework for High-Fidelity Video Generation with Reasoning Capabilities
Lumos-Nexus is a training-efficient unified video generation framework that addresses the computational challenge of integrating large high-
Using Vision-Language Models to Segment Robot Demonstration Videos into Subtask Annotations
This article presents a benchmark and field report on using Vision-Language Models (VLMs) to segment robot demonstration videos and egocentr
E-VAds: A New Benchmark for Understanding E-Commerce Short Videos Using Multi-Modal LLMs
This paper introduces E-VAds, the first benchmark specifically designed for understanding e-commerce short videos. The authors propose a mul
DatBench: A New Framework for More Faithful and Efficient Vision-Language Model Evaluation
The article introduces DatBench, a new evaluation framework for vision-language models (VLMs) that addresses critical issues in current eval

Comments
Sign in to join the conversation.
No comments yet. Be the first.