iLLaDA: An 8B Masked Diffusion Language Model Trained with Bidirectional Attention

[Submitted on 24 Jun 2026]

4h ago· 2 min readenInsight

technology science artificial intelligence machine learning research

Summary

The paper introduces iLLaDA, an 8-billion parameter masked diffusion language model trained from scratch with fully bidirectional attention, as an alternative to standard autoregressive models. It was pre-trained on 12 trillion tokens and fine-tuned on a 25-billion token instruction corpus. The model shows significant improvements over its predecessor LLaDA across general, mathematical, and code benchmarks (e.g., +21.6 points on BBH, +14.9 on ARC-Challenge, +14.5 on MATH, +16.5 on HumanEval), and remains competitive with Qwen2.5 7B despite its non-autoregressive training approach.

Source

Twitter / XiLLaDA: An 8B Masked Diffusion Language Model Trained with Bidirectional Attentionarxiv.org

Key quotes

· 4 pulled

We present iLLaDA, an 8B masked diffusion language model trained from scratch with fully bidirectional attention.

iLLaDA keeps the masked diffusion objective throughout pre-training and supervised fine-tuning (SFT), scaling pre-training to 12T tokens and fine-tuning on a 25B-token instruction corpus for 12 epochs.

Compared with LLaDA, iLLaDA improves broadly across general, mathematical, and code benchmarks; for example, iLLaDA-Base improves by 21.6 points on BBH and 14.9 points on ARC-Challenge, while iLLaDA-Instruct improves by 14.5 points on MATH and 16.5 points on HumanEval.

These results show that fully bidirectional diffusion training from scratch is a competitive path toward strong language models.

Snippet from the RSS feed

Modern large language models are predominantly trained with autoregressive factorization and causal attention. We present \emph{iLLaDA}, an 8B masked diffusion language model trained from scratch with fully bidirectional attention. iLLaDA keeps the masked

You might also wanna read

Fast-dLLM: Training-Free Acceleration Method for Diffusion Language Models Using KV Cache and Parallel Decoding

Researchers introduce Fast-dLLM, a training-free acceleration method for diffusion-based large language models that addresses their slower i

arxiv.org·8mo ago

Google's DiffusionGemma achieves 4x faster text generation using diffusion-based parallel token generation

DiffusionGemma is a new text generation model from Google that achieves up to 4x faster inference speeds compared to traditional autoregress

blog.google·14d ago

Google's DiffusionGemma achieves 4x faster text generation using diffusion-based parallel token generation

DiffusionGemma is a new text generation model from Google that achieves up to 4x faster inference speeds compared to traditional autoregress

blog.google·14d ago

Three training-time interventions improve diffusion-based speculative decoding by 21-76%

This paper presents an empirical analysis of three training-time interventions to improve speculative decoding with diffusion language model

arxiv.org·9h ago

Zebra-Llama: Efficient Hybrid Language Models Combining SSMs and Attention Layers

Researchers propose Zebra-Llama, a family of hybrid language models (1B, 3B, 8B) that combine State Space Models (SSMs) and Multi-head Laten

arxiv.org·6mo ago

Google's DiffusionGemma achieves 4x faster text generation using diffusion-based approach

DiffusionGemma is a new text generation model from Google that achieves up to 4x faster inference speeds compared to traditional autoregress

deepmind.google·10d ago

Consistency Diffusion Language Models Achieve 14x Faster Inference Through KV Caching and Step Reduction

Consistency Diffusion Language Models (CDLM) represent a breakthrough in language model architecture that addresses key limitations of stand

together.ai·4mo ago

Comments

No comments yet. Be the first.