A 13-year-old's $500 journey training a custom language model on a Mamba-2 backbone
By
Faris Allafi · July 2026 · Model: hr-diffuse-1-nano on Hugging Face · Discussion on Hacker News
Summary
A 13-year-old developer documents their journey training a custom language model (DIMBA II) using a masked diffusion approach on a bidirectional Mamba-2 backbone, spending $500 of their own money. The model deliberately produces incorrect answers (e.g., "capital of Japan is Paris") as a design choice. The article covers six failed self-correction methods attempted during training, honest results, and technical insights into transformer architecture, diffusion models, and the Mamba architecture. It's a blend of technical report and personal narrative about the challenges of training LLMs from scratch.
Source
Key quotes
· 4 pulledI am 13, and I spent hours of my time, and my own money, to train a language model that thinks the capital of Japan is Paris.
First thing you should know: contrary to common belief, the capital of Japan is in fact Tokyo.
You might think I am just building another ChatGPT wrapper, and that could not be farther from the truth.
The transformer architecture, popularized by the paper Attention Is All You Need (Vaswani et al., 2017), is the current SOTA architecture in LLMs.
You might also wanna read
iLLaDA: An 8B Masked Diffusion Language Model Trained with Bidirectional Attention
The paper introduces iLLaDA, an 8-billion parameter masked diffusion language model trained from scratch with fully bidirectional attention,
Mamba Explained: How State Space Models Challenge Transformer Dominance in AI
Mamba is a novel AI model based on State Space Models (SSMs) that emerges as a formidable alternative to Transformer models. It addresses th
LLMs Can Describe Their Own Internal Decision-Making Processes, New Research Shows
This research paper demonstrates that large language models (LLMs) can accurately describe their own internal decision-making processes. The
BabyVision Benchmark Reveals MLLMs Fail at Basic Visual Tasks That 3-Year-Olds Can Solve
This paper introduces BabyVision, a benchmark designed to assess core visual reasoning abilities in Multimodal LLMs (MLLMs) independent of l
New Chinese AI models and Liquid Foundation Models push LLM efficiency and reasoning forward
The article discusses recent developments in language models, highlighting new Chinese models from StepFun and MiniMax that offer affordable
Australian startup Springboards launches Flint, an LLM trained to break out of AI groupthink for creative tasks
Most large language models suffer from "groupthink" — producing predictable, similar responses to open-ended questions. Australian startup S
Australian startup Springboards launches Flint, an LLM trained to break out of AI groupthink for creative tasks
Most large language models suffer from "groupthink" — producing predictable, similar responses to open-ended questions. Australian startup S
Australian startup Springboards launches Flint, an LLM trained to break out of AI groupthink for creative tasks
Most large language models suffer from "groupthink" — producing predictable, similar responses to open-ended questions. Australian startup S

Comments
Sign in to join the conversation.
No comments yet. Be the first.