iLLaDA: An 8B Masked Diffusion Language Model Trained with Bidirectional Attention
By
[Submitted on 24 Jun 2026]
Summary
The paper introduces iLLaDA, an 8-billion parameter masked diffusion language model trained from scratch with fully bidirectional attention, as an alternative to standard autoregressive models. It was pre-trained on 12 trillion tokens and fine-tuned on a 25-billion token instruction corpus. The model shows significant improvements over its predecessor LLaDA across general, mathematical, and code benchmarks (e.g., +21.6 points on BBH, +14.9 on ARC-Challenge, +14.5 on MATH, +16.5 on HumanEval), and remains competitive with Qwen2.5 7B despite its non-autoregressive training approach.
Source
Key quotes
· 4 pulledWe present iLLaDA, an 8B masked diffusion language model trained from scratch with fully bidirectional attention.
iLLaDA keeps the masked diffusion objective throughout pre-training and supervised fine-tuning (SFT), scaling pre-training to 12T tokens and fine-tuning on a 25B-token instruction corpus for 12 epochs.
Compared with LLaDA, iLLaDA improves broadly across general, mathematical, and code benchmarks; for example, iLLaDA-Base improves by 21.6 points on BBH and 14.9 points on ARC-Challenge, while iLLaDA-Instruct improves by 14.5 points on MATH and 16.5 points on HumanEval.
These results show that fully bidirectional diffusion training from scratch is a competitive path toward strong language models.
You might also wanna read
Fast-dLLM: Training-Free Acceleration Method for Diffusion Language Models Using KV Cache and Parallel Decoding
Researchers introduce Fast-dLLM, a training-free acceleration method for diffusion-based large language models that addresses their slower i
Google's DiffusionGemma achieves 4x faster text generation using diffusion-based parallel token generation
DiffusionGemma is a new text generation model from Google that achieves up to 4x faster inference speeds compared to traditional autoregress
Google's DiffusionGemma achieves 4x faster text generation using diffusion-based parallel token generation
DiffusionGemma is a new text generation model from Google that achieves up to 4x faster inference speeds compared to traditional autoregress
Three training-time interventions improve diffusion-based speculative decoding by 21-76%
This paper presents an empirical analysis of three training-time interventions to improve speculative decoding with diffusion language model
Zebra-Llama: Efficient Hybrid Language Models Combining SSMs and Attention Layers
Researchers propose Zebra-Llama, a family of hybrid language models (1B, 3B, 8B) that combine State Space Models (SSMs) and Multi-head Laten
Google's DiffusionGemma achieves 4x faster text generation using diffusion-based approach
DiffusionGemma is a new text generation model from Google that achieves up to 4x faster inference speeds compared to traditional autoregress
Consistency Diffusion Language Models Achieve 14x Faster Inference Through KV Caching and Step Reduction
Consistency Diffusion Language Models (CDLM) represent a breakthrough in language model architecture that addresses key limitations of stand

Comments
Sign in to join the conversation.
No comments yet. Be the first.