All Topics

Technology

Art

Heretic: Automated Tool for Removing Censorship from Language Models

melded

6mo ago· 7 min readenCode

100/100

Golden Brown

Bagelometer↗

Front-window bakery material. Catches the eye, delivers the goods.

Score100TypenewsSentimentneutral

Summary

Heretic is an automated tool that removes censorship and safety alignment from transformer-based language models using directional ablation (abliteration) combined with TPE-based parameter optimization via Optuna. The tool works by co-minimizing refusal rates and KL divergence from the original model, resulting in decensored models that retain most of their original capabilities without expensive post-training.

Key quotes

· 5 pulled

Heretic is a tool that removes censorship (aka 'safety alignment') from transformer-based language models without expensive post-training.

It combines an advanced implementation of directional ablation, also known as 'abliteration' (Arditi et al. 2024, Lai 2025 (1, 2)), with a TPE-based parameter optimizer powered by Optuna.

This approach enables Heretic to work completely automatically.

Heretic finds high-quality abliteration parameters by co-minimizing the number of refusals and the KL divergence from the original model.

This results in a decensored model that retains as much of the original model's capabilities as possible.

Snippet from the RSS feed

Fully automatic censorship removal for language models - p-e-w/heretic

You might also wanna read

The Problem with Structured Outputs in LLMs: How Constrained Decoding Creates False Confidence

This article critiques the use of structured outputs and constrained decoding in large language models (LLMs), arguing that while these tech

boundaryml.com·5mo ago

Critique of Train-Test Split Methodology for Advanced Machine Learning Tasks

The article critiques traditional train-test split methodology in machine learning, using a satirical case study about building a 'butt clas

folio.benguzovsky.com·5mo ago

DeepSeek-V4: Hybrid Sparse-Attention Architecture Enables Efficient Million-Token Context Inference

DeepSeek-V4 introduces a hybrid sparse-attention architecture combined with on-policy distillation across domain specialists, enabling 1M-to

artgor.medium.com·12h ago

Rotary GPU: Enabling Large Mixture-of-Experts Models on Consumer Laptop GPUs with Limited Memory

This paper presents Rotary GPU, an exploratory approach to running large Mixture-of-Experts (MoE) language models on consumer-grade hardware

arxiv.org·1d ago

LinkedIn cuts GPU training hours by 65% with Generative Recommender system optimizations

LinkedIn has developed a Generative Recommender (GR) system that models user activity as token sequences, offering richer long-context perso

startuphub.ai·3d ago

jqwik maintainer embeds protestware targeting AI coding agents in open-source library

The article reports on a controversial incident in the open-source software world where the maintainer of jqwik (a Java property-based testi

nesbitt.io·3d ago