All Topics
All Topics
Technology
Technology
AI
AI
Business
Business
Entertainment
Entertainment
News
News
Programming
Programming
Security
Security
Science
Science
Design
Design
Environment
Environment
Finance
Finance
Crypto
Crypto
Politics
Politics
Sports
Sports
Education
Education
Gaming
Gaming
Art
Art
Music
Music
Health
Health
Books
Books
Food
Food
Travel
Travel
Personal
Personal
Bluesky
Twitter

MJEPA: A Unified Single-Encoder Architecture for Self-Supervised Audio-Visual Learning

By

[Submitted on 23 Jun 2026]

8d ago· 2 min readenInsight

Summary

This paper introduces MJEPA (Multimodal Joint-Embedding Predictive Architecture), a self-supervised learning method for audio-visual representation learning. Unlike existing approaches that use separate modality-specific encoders and complex combinations of contrastive or reconstruction objectives, MJEPA uses a single unified encoder for both audio and visual modalities with only one predictive objective applied within and across modalities. The key finding is that cross-modal prediction is critical — without it, performance degrades below unimodal baselines. Results show the frozen ViT-g model outperforms prior frozen baselines by over 6.8 mAP on AudioSet-20K, surpasses fully finetuned models on ESC-50 and FSD50K, and is competitive on video benchmarks despite using 10x less video data.

Source

Twitter / XMJEPA: A Unified Single-Encoder Architecture for Self-Supervised Audio-Visual Learningarxiv.org

Key quotes

· 3 pulled
We introduce MJEPA, a joint-embedding predictive architecture for audio-visual learning that uses a single, unified encoder for both modalities.
We show that cross-modal prediction is critical: without it, a shared encoder degrades below unimodal baselines; with it, each modality's representation benefits from the other.
Our frozen ViT-g model outperforms the best prior frozen baseline by over 6.8 mAP on AudioSet-20K, surpasses fully finetuned models on ESC-50 and FSD50K, and is competitive on video benchmarks despite using 10x less video data.
Snippet from the RSS feed
Self-supervised learning from large-scale video data has emerged as a dominant paradigm for visual representation learning. Since audio and visual streams naturally co-occur in video data, extending this success to jointly learn from both modalities is a

You might also wanna read

LeJEPA: A Theoretically Grounded Self-Supervised Learning Framework for AI Representation Learning

Researchers present LeJEPA, a theoretically grounded self-supervised learning framework that addresses limitations in Joint-Embedding Predic

arxiv.org·7mo ago

VideoMLA: Low-Rank Latent KV Cache Reduces Memory by 92.7% for Minute-Scale Video Diffusion

This paper introduces VideoMLA, the first application of Multi-Head Latent Attention (MLA) to video diffusion models. It replaces per-head k

arxiv.org·1mo ago

Lumos-Nexus: A Training-Efficient Two-Stage Framework for High-Fidelity Video Generation with Reasoning Capabilities

Lumos-Nexus is a training-efficient unified video generation framework that addresses the computational challenge of integrating large high-

arxiv.org·1mo ago

Using Vision-Language Models to Segment Robot Demonstration Videos into Subtask Annotations

This article presents a benchmark and field report on using Vision-Language Models (VLMs) to segment robot demonstration videos and egocentr

macrodata.co·5d ago

E-VAds: A New Benchmark for Understanding E-Commerce Short Videos Using Multi-Modal LLMs

This paper introduces E-VAds, the first benchmark specifically designed for understanding e-commerce short videos. The authors propose a mul

arxiv.org·15d ago

DatBench: A New Framework for More Faithful and Efficient Vision-Language Model Evaluation

The article introduces DatBench, a new evaluation framework for vision-language models (VLMs) that addresses critical issues in current eval

arxiv.org·5mo ago

Comments

Sign in to join the conversation.

No comments yet. Be the first.