RAGP: A Graph-Based Prompt Compression Method Using Lévy Walk-Guided Pruning
By
[Submitted on 4 May 2026]
Summary
This paper introduces RAGP (Redundancy-Aware Graph Pruning), a novel prompt compression method that treats text as a multiplex graph rather than flat token sequences. By modeling both fine-grained attention-based dependencies and coarse-grained semantic relations, RAGP uses Lévy walks to efficiently prune redundant nodes. Experiments on LongBench show RAGP achieves a 49.3 average score at 4x compression, outperforming existing LLM-based methods like LongLLMLingua (48.8 at 3x compression).
Source
Key quotes
· 4 pulledExisting prompt compression methods treat text as flat token sequences, failing to capture the distributed nature of important information, which is often spread across multiple locations and connected through both local syntactic dependencies and global semantic relations.
We propose RAGP, which formulates prompt compression as Redundancy-Aware Graph Pruning on a multiplex graph that jointly models fine-grained attention-based dependencies and coarse-grained semantic relations.
To efficiently identify non-redundant nodes in this heterogeneous structure (dense local subgraphs and sparse global connections), we employ Levy walks whose heavy-tailed step distribution naturally balances local exploitation with global exploration.
Experiments on LongBench show that RAGP achieves an average score of 49.3 under a 4x compression ratio, outperforming existing LLM-based compression methods, such as LongLLMLingua, which attains 48.8 at a 3x compression ratio.
You might also wanna read
Fast-dLLM: Training-Free Acceleration Method for Diffusion Language Models Using KV Cache and Parallel Decoding
Researchers introduce Fast-dLLM, a training-free acceleration method for diffusion-based large language models that addresses their slower i
ChunkLLM: A Lightweight Framework for Accelerating Large Language Model Inference
ChunkLLM is a lightweight, pluggable framework designed to accelerate large language model inference by addressing computational inefficienc
Expected Attention: KV Cache Compression Method for Efficient LLM Inference
This research paper introduces Expected Attention, a training-free method for compressing Key-Value (KV) cache in large language models to r
Building memchunk: A High-Performance Text Chunking Library for RAG Pipelines Using SIMD and memchr
The article details the development of memchunk, a high-performance text chunking library for RAG (Retrieval-Augmented Generation) pipelines
Recursive Language Models: A New Approach for Processing Extremely Long Prompts Beyond Standard Context Windows
Researchers propose Recursive Language Models (RLMs), a novel inference strategy that enables large language models to process prompts far b
SparseLoCo: Communication-Efficient LLM Training with Extreme Compression via Sparsification and Quantization
SparseLoCo is a new communication-efficient training algorithm for Large Language Models (LLMs) that combines Top-k sparsification and quant

Comments
Sign in to join the conversation.
No comments yet. Be the first.