Attention Residuals: A Drop-in Replacement for Standard Residual Connections in Transformers

GaggiX

2mo ago· 6 min readenCode

100/100

Golden Brown

Bagelometer↗

Pulled from the oven just right. Trustworthy, fact-dense, deeply satisfying.

Score100TypenewsSentimentneutral

Summary

Attention Residuals (AttnRes) is a novel architectural modification for Transformers that replaces standard residual connections with attention-based mechanisms. Instead of uniform additive accumulation of previous layer outputs, AttnRes enables each layer to selectively attend to and aggregate earlier representations through learned, input-dependent attention over depth. The approach comes in two variants: Full AttnRes where each layer attends over all previous outputs, and Block AttnRes which groups layers into blocks to reduce memory requirements from O(Ld) to O(Nd). This drop-in replacement aims to improve Transformer performance by allowing more sophisticated information flow between layers.

Key quotes

· 3 pulled

This is the official repository for Attention Residuals (AttnRes), a drop-in replacement for standard residual connections in Transformers that enables each layer to selectively aggregate earlier representations via learned, input-dependent attention over depth.

(a) Standard residuals with uniform additive accumulation. (b) Full AttnRes: each layer attends over all previous outputs. (c) Block AttnRes: layers are grouped into blocks, reducing memory from O(Ld) to O(Nd).

Standard residual connections accumu

Snippet from the RSS feed

Contribute to MoonshotAI/Attention-Residuals development by creating an account on GitHub.

You might also wanna read

DeepSeek-V4: Hybrid Sparse-Attention Architecture Enables Efficient Million-Token Context Inference

DeepSeek-V4 introduces a hybrid sparse-attention architecture combined with on-policy distillation across domain specialists, enabling 1M-to

artgor.medium.com·18h ago

Orthrus: A Dual-Architecture Framework for Fast, Lossless LLM Inference via Diffusion Decoding

Orthrus is a dual-architecture framework that combines autoregressive LLMs with diffusion models to enable fast, lossless parallel token gen

github.com·16d ago

Hybrid Attention Mechanism Achieves 51x Speedup in Language Model Inference

A developer has created a hybrid attention mechanism for language models by modifying PyTorch and Triton internals. The approach changes the

news.ycombinator.com·1mo ago

Investigating the RYS Method: Testing Layer Duplication Across Modern LLMs

This article explores the RYS (Repeat Your Self) method discovered in Part 1, where duplicating seven middle layers in Qwen2-72B without wei

dnhkng.github.io·2mo ago

Scaling Karpathy's Autoresearch: Parallel GPU Processing Enables New AI Experimentation Strategies

The article describes an experiment where researchers scaled Andrej Karpathy's autoresearch system by giving it access to 16 GPUs on a Kuber

blog.skypilot.co·2mo ago

LLM Circuit Finder: Duplicating Specific Layers in Transformer Models Improves Reasoning Performance Without Training

The article describes a GitHub project called 'llm-circuit-finder' that implements a method for discovering and exploiting 'reasoning circui

github.com·2mo ago