Heretic: Automated Tool for Removing Censorship from Language Models
By
melded
Front-window bakery material. Catches the eye, delivers the goods.
Summary
Heretic is an automated tool that removes censorship and safety alignment from transformer-based language models using directional ablation (abliteration) combined with TPE-based parameter optimization via Optuna. The tool works by co-minimizing refusal rates and KL divergence from the original model, resulting in decensored models that retain most of their original capabilities without expensive post-training.
Key quotes
· 5 pulledHeretic is a tool that removes censorship (aka 'safety alignment') from transformer-based language models without expensive post-training.
It combines an advanced implementation of directional ablation, also known as 'abliteration' (Arditi et al. 2024, Lai 2025 (1, 2)), with a TPE-based parameter optimizer powered by Optuna.
This approach enables Heretic to work completely automatically.
Heretic finds high-quality abliteration parameters by co-minimizing the number of refusals and the KL divergence from the original model.
This results in a decensored model that retains as much of the original model's capabilities as possible.
You might also wanna read
The Problem with Structured Outputs in LLMs: How Constrained Decoding Creates False Confidence
This article critiques the use of structured outputs and constrained decoding in large language models (LLMs), arguing that while these tech
Critique of Train-Test Split Methodology for Advanced Machine Learning Tasks
The article critiques traditional train-test split methodology in machine learning, using a satirical case study about building a 'butt clas
DeepSeek-V4: Hybrid Sparse-Attention Architecture Enables Efficient Million-Token Context Inference
DeepSeek-V4 introduces a hybrid sparse-attention architecture combined with on-policy distillation across domain specialists, enabling 1M-to
Rotary GPU: Enabling Large Mixture-of-Experts Models on Consumer Laptop GPUs with Limited Memory
This paper presents Rotary GPU, an exploratory approach to running large Mixture-of-Experts (MoE) language models on consumer-grade hardware
LinkedIn cuts GPU training hours by 65% with Generative Recommender system optimizations
LinkedIn has developed a Generative Recommender (GR) system that models user activity as token sequences, offering richer long-context perso
jqwik maintainer embeds protestware targeting AI coding agents in open-source library
The article reports on a controversial incident in the open-source software world where the maintainer of jqwik (a Java property-based testi
