All Topics
All Topics
Technology
Technology
AI
AI
Business
Business
Entertainment
Entertainment
News
News
Programming
Programming
Security
Security
Science
Science
Design
Design
Environment
Environment
Finance
Finance
Crypto
Crypto
Politics
Politics
Sports
Sports
Education
Education
Gaming
Gaming
Art
Art
Music
Music
Health
Health
Books
Books
Food
Food
Travel
Travel
Personal
Personal
Bluesky
Twitter

RAGP: A Graph-Based Prompt Compression Method Using Lévy Walk-Guided Pruning

By

[Submitted on 4 May 2026]

1h ago· 2 min readenInsight

Summary

This paper introduces RAGP (Redundancy-Aware Graph Pruning), a novel prompt compression method that treats text as a multiplex graph rather than flat token sequences. By modeling both fine-grained attention-based dependencies and coarse-grained semantic relations, RAGP uses Lévy walks to efficiently prune redundant nodes. Experiments on LongBench show RAGP achieves a 49.3 average score at 4x compression, outperforming existing LLM-based methods like LongLLMLingua (48.8 at 3x compression).

Source

bskyRAGP: A Graph-Based Prompt Compression Method Using Lévy Walk-Guided Pruningarxiv.org

Key quotes

· 4 pulled
Existing prompt compression methods treat text as flat token sequences, failing to capture the distributed nature of important information, which is often spread across multiple locations and connected through both local syntactic dependencies and global semantic relations.
We propose RAGP, which formulates prompt compression as Redundancy-Aware Graph Pruning on a multiplex graph that jointly models fine-grained attention-based dependencies and coarse-grained semantic relations.
To efficiently identify non-redundant nodes in this heterogeneous structure (dense local subgraphs and sparse global connections), we employ Levy walks whose heavy-tailed step distribution naturally balances local exploitation with global exploration.
Experiments on LongBench show that RAGP achieves an average score of 49.3 under a 4x compression ratio, outperforming existing LLM-based compression methods, such as LongLLMLingua, which attains 48.8 at a 3x compression ratio.
Snippet from the RSS feed
Existing prompt compression methods treat text as flat token sequences, failing to capture the distributed nature of important information, which is often spread across multiple locations and connected through both local syntactic dependencies and global

You might also wanna read

Fast-dLLM: Training-Free Acceleration Method for Diffusion Language Models Using KV Cache and Parallel Decoding

Researchers introduce Fast-dLLM, a training-free acceleration method for diffusion-based large language models that addresses their slower i

arxiv.org·8mo ago

ChunkLLM: A Lightweight Framework for Accelerating Large Language Model Inference

ChunkLLM is a lightweight, pluggable framework designed to accelerate large language model inference by addressing computational inefficienc

arxiv.org·8mo ago

Expected Attention: KV Cache Compression Method for Efficient LLM Inference

This research paper introduces Expected Attention, a training-free method for compressing Key-Value (KV) cache in large language models to r

arxiv.org·8mo ago

Building memchunk: A High-Performance Text Chunking Library for RAG Pipelines Using SIMD and memchr

The article details the development of memchunk, a high-performance text chunking library for RAG (Retrieval-Augmented Generation) pipelines

minha.sh·5mo ago

Recursive Language Models: A New Approach for Processing Extremely Long Prompts Beyond Standard Context Windows

Researchers propose Recursive Language Models (RLMs), a novel inference strategy that enables large language models to process prompts far b

arxiv.org·6mo ago

SparseLoCo: Communication-Efficient LLM Training with Extreme Compression via Sparsification and Quantization

SparseLoCo is a new communication-efficient training algorithm for Large Language Models (LLMs) that combines Top-k sparsification and quant

arxiv.org·10mo ago

Comments

Sign in to join the conversation.

No comments yet. Be the first.