MJEPA: A Unified Single-Encoder Architecture for Self-Supervised Audio-Visual Learning

[Submitted on 23 Jun 2026]

8d ago· 2 min readenInsight

Summary

This paper introduces MJEPA (Multimodal Joint-Embedding Predictive Architecture), a self-supervised learning method for audio-visual representation learning. Unlike existing approaches that use separate modality-specific encoders and complex combinations of contrastive or reconstruction objectives, MJEPA uses a single unified encoder for both audio and visual modalities with only one predictive objective applied within and across modalities. The key finding is that cross-modal prediction is critical — without it, performance degrades below unimodal baselines. Results show the frozen ViT-g model outperforms prior frozen baselines by over 6.8 mAP on AudioSet-20K, surpasses fully finetuned models on ESC-50 and FSD50K, and is competitive on video benchmarks despite using 10x less video data.

Source

Twitter / XMJEPA: A Unified Single-Encoder Architecture for Self-Supervised Audio-Visual Learningarxiv.org

Key quotes

· 3 pulled

We introduce MJEPA, a joint-embedding predictive architecture for audio-visual learning that uses a single, unified encoder for both modalities.

We show that cross-modal prediction is critical: without it, a shared encoder degrades below unimodal baselines; with it, each modality's representation benefits from the other.

Our frozen ViT-g model outperforms the best prior frozen baseline by over 6.8 mAP on AudioSet-20K, surpasses fully finetuned models on ESC-50 and FSD50K, and is competitive on video benchmarks despite using 10x less video data.

Snippet from the RSS feed

Self-supervised learning from large-scale video data has emerged as a dominant paradigm for visual representation learning. Since audio and visual streams naturally co-occur in video data, extending this success to jointly learn from both modalities is a

You might also wanna read

LeJEPA: A Theoretically Grounded Self-Supervised Learning Framework for AI Representation Learning

Researchers present LeJEPA, a theoretically grounded self-supervised learning framework that addresses limitations in Joint-Embedding Predic

arxiv.org·7mo ago

VideoMLA: Low-Rank Latent KV Cache Reduces Memory by 92.7% for Minute-Scale Video Diffusion

This paper introduces VideoMLA, the first application of Multi-Head Latent Attention (MLA) to video diffusion models. It replaces per-head k

arxiv.org·1mo ago

Lumos-Nexus: A Training-Efficient Two-Stage Framework for High-Fidelity Video Generation with Reasoning Capabilities

Lumos-Nexus is a training-efficient unified video generation framework that addresses the computational challenge of integrating large high-

arxiv.org·1mo ago

Using Vision-Language Models to Segment Robot Demonstration Videos into Subtask Annotations

This article presents a benchmark and field report on using Vision-Language Models (VLMs) to segment robot demonstration videos and egocentr

macrodata.co·5d ago

E-VAds: A New Benchmark for Understanding E-Commerce Short Videos Using Multi-Modal LLMs

This paper introduces E-VAds, the first benchmark specifically designed for understanding e-commerce short videos. The authors propose a mul

arxiv.org·15d ago

DatBench: A New Framework for More Faithful and Efficient Vision-Language Model Evaluation

The article introduces DatBench, a new evaluation framework for vision-language models (VLMs) that addresses critical issues in current eval

arxiv.org·5mo ago

Comments

No comments yet. Be the first.