All Topics
All Topics
Technology
Technology
Design
Design
Programming
Programming
Science
Science
News
News
Gaming
Gaming
Entertainment
Entertainment
Business
Business
Finance
Finance
Sports
Sports
Health
Health
Food
Food
Travel
Travel
Art
Art
Music
Music
Books
Books
Education
Education
Politics
Politics
Personal
Personal
No algorithm. No AI slop. No ads. Just RSS. Pro-human. Indie writers. Real journalism. Open web. Chronological. Hand toasted.

Attention Residuals: A Drop-in Replacement for Standard Residual Connections in Transformers

By

GaggiX

2mo ago· 6 min readenCode

Summary

Attention Residuals (AttnRes) is a novel architectural modification for Transformers that replaces standard residual connections with attention-based mechanisms. Instead of uniform additive accumulation of previous layer outputs, AttnRes enables each layer to selectively attend to and aggregate earlier representations through learned, input-dependent attention over depth. The approach comes in two variants: Full AttnRes where each layer attends over all previous outputs, and Block AttnRes which groups layers into blocks to reduce memory requirements from O(Ld) to O(Nd). This drop-in replacement aims to improve Transformer performance by allowing more sophisticated information flow between layers.

Key quotes

· 3 pulled
This is the official repository for Attention Residuals (AttnRes), a drop-in replacement for standard residual connections in Transformers that enables each layer to selectively aggregate earlier representations via learned, input-dependent attention over depth.
(a) Standard residuals with uniform additive accumulation. (b) Full AttnRes: each layer attends over all previous outputs. (c) Block AttnRes: layers are grouped into blocks, reducing memory from O(Ld) to O(Nd).
Standard residual connections accumu
Snippet from the RSS feed
Contribute to MoonshotAI/Attention-Residuals development by creating an account on GitHub.

You might also wanna read

DeepSeek-V4: Hybrid Sparse-Attention Architecture Enables Efficient Million-Token Context Inference

DeepSeek-V4 introduces a hybrid sparse-attention architecture combined with on-policy distillation across domain specialists, enabling 1M-to

artgor.medium.com·18h ago

Orthrus: A Dual-Architecture Framework for Fast, Lossless LLM Inference via Diffusion Decoding

Orthrus is a dual-architecture framework that combines autoregressive LLMs with diffusion models to enable fast, lossless parallel token gen

github.com·16d ago

Hybrid Attention Mechanism Achieves 51x Speedup in Language Model Inference

A developer has created a hybrid attention mechanism for language models by modifying PyTorch and Triton internals. The approach changes the

news.ycombinator.com·1mo ago

Investigating the RYS Method: Testing Layer Duplication Across Modern LLMs

This article explores the RYS (Repeat Your Self) method discovered in Part 1, where duplicating seven middle layers in Qwen2-72B without wei

dnhkng.github.io·2mo ago

Scaling Karpathy's Autoresearch: Parallel GPU Processing Enables New AI Experimentation Strategies

The article describes an experiment where researchers scaled Andrej Karpathy's autoresearch system by giving it access to 16 GPUs on a Kuber

blog.skypilot.co·2mo ago

LLM Circuit Finder: Duplicating Specific Layers in Transformer Models Improves Reasoning Performance Without Training

The article describes a GitHub project called 'llm-circuit-finder' that implements a method for discovering and exploiting 'reasoning circui

github.com·2mo ago