Representation Reuse for Cost-Effective Safety Classifiers

How to build low-overhead jailbreak and harmful-content detectors by repurposing the internal activations of your existing model instead of running a separate classifier.

Running a separate safety classifier alongside your primary model is one of the most reliable ways to catch harmful or jailbroken inputs — but it comes with a real cost. Using a mid-sized model as a dedicated safety filter can add 20–30% to your per-token inference spend, which quickly becomes untenable at scale. A more practical approach exploits something you are already paying for: the rich internal representations the base model computes on every forward pass.

Why Separate Classifiers Are Expensive

A fully independent safety classifier is essentially a second model running in parallel. It processes the same input tokens, builds its own internal representations, and produces a classification signal. This is computationally clean — no coupling to the policy model — but wasteful. The policy model has already done the hard semantic work of understanding the input: encoding syntax, resolving coreferences, recognizing intent. A separate classifier largely replicates that work from scratch.

The cost scales with the size of the classifier you need. Smaller classifiers are cheaper but less accurate on adversarial inputs; larger classifiers close the accuracy gap but push overhead higher. You end up trading off safety coverage against inference budget, with no obvious equilibrium.

Representation Reuse: The Core Idea

Modern transformer models build progressively richer representations of their input across layers. By the middle-to-late layers, the residual stream encodes surprisingly high-level semantic distinctions — including, it turns out, distinctions relevant to safety. Rather than training a new model to rediscover these distinctions, you can attach a lightweight classifier directly to the activations the policy model already produces.

Two practical variants exist:

Linear probing fits a simple linear classifier (or a thin MLP) on top of frozen intermediate activations. The policy model runs its normal forward pass; you read off activations at one or more layers and feed them to the probe. The probe itself is negligibly cheap — a matrix multiply and a sigmoid. The only added cost is the minor overhead of extracting activations, which is essentially free compared to the forward pass itself.

Partial fine-tuning goes one step further: instead of freezing all weights, you allow the final transformer block (attention + MLP) to be retrained for classification while sharing the full backbone with the policy model. During inference the model runs its normal forward pass up to the penultimate block, then branches: the original final block produces the next-token logits, and the fine-tuned final block produces a safety score. You pay for one extra transformer block rather than a whole separate model.

Note

Both methods depend on the policy model’s backbone being frozen (or nearly so) at inference time. If you are serving the policy model anyway, the marginal cost of a linear probe is close to zero — you are reading activations that are already computed and sitting in GPU memory.

Architecture of a Two-Stage Pipeline

For the highest accuracy at the lowest average cost, you can combine a cheap probe with a more powerful dedicated classifier in a cascade. The probe acts as a fast first-stage filter: inputs it classifies as clearly safe are passed directly to the agent; inputs it flags are escalated to the heavier classifier for a final verdict.

Incoming prompt
       │
       ▼
┌─────────────────────┐
│  Policy model       │
│  forward pass       │
│                     │
│  [Layer N-1 acts]──►│──► Linear probe ──► SAFE ──► Agent response
│                     │         │
│  [Full output]      │       FLAGGED
└─────────────────────┘         │
                                ▼
                    ┌───────────────────────┐
                    │  Dedicated classifier │
                    │  (full fine-tuned)    │
                    └───────────┬───────────┘
                                │
                    SAFE ◄──────┴──────► BLOCK

The cascade exploits the fact that most production traffic is not adversarial. If 95% of inputs are clearly benign and the probe correctly passes them through, the expensive second-stage classifier only runs on the remaining 5%. Overall system cost drops by an order of magnitude while the accuracy on the hard cases is preserved by the stronger model.

Implementation Considerations

To train a linear probe for safety classification, you need labeled activation data. The most practical approach mirrors how dedicated classifiers are trained: generate synthetic harmful prompts and benign prompts using your data pipeline, run them through the frozen policy model, collect activations at the target layer(s), and fit a logistic regression or small MLP.

import torch
from sklearn.linear_model import LogisticRegression
import numpy as np

def collect_activations(model, tokenizer, prompts, layer_index, device):
    """Extract residual stream activations at a specific layer."""
    activations = []
    hooks = []

    def hook_fn(module, input, output):
        # output is (batch, seq_len, hidden_dim); take the last token
        activations.append(output[0][:, -1, :].detach().cpu().float())

    target_layer = model.model.layers[layer_index]
    hooks.append(target_layer.register_forward_hook(hook_fn))

    with torch.no_grad():
        for prompt in prompts:
            inputs = tokenizer(prompt, return_tensors="pt").to(device)
            model(**inputs)

    for h in hooks:
        h.remove()

    return torch.cat(activations, dim=0).numpy()

# Collect activations for harmful and benign examples
harmful_acts = collect_activations(model, tokenizer, harmful_prompts, layer_index=28, device="cuda")
benign_acts  = collect_activations(model, tokenizer, benign_prompts,  layer_index=28, device="cuda")

X = np.vstack([harmful_acts, benign_acts])
y = np.array([1] * len(harmful_acts) + [0] * len(benign_acts))

probe = LogisticRegression(max_iter=1000)
probe.fit(X, y)

Layer selection matters. Probes on early layers capture surface syntax and miss semantic intent; probes on the very final layer are often dominated by output-token prediction signals. In practice, layers in the range of 60–80% of total depth tend to give the best classification signal for content-level safety distinctions. Running a small sweep across candidate layers during training is worth the investment.

Tip

If your policy model uses key-value caching, you can often extract probe activations from the prefill pass at negligible marginal cost — the activations are computed anyway during KV-cache population, so you only need to route them to your probe before they are discarded.

Limitations and Adaptive Adversaries

The key caveat with representation-reuse classifiers is adversarial robustness. A dedicated classifier trained on a fully independent architecture is harder to attack simultaneously with the policy model. A probe that reads from the policy model’s own activations is, in principle, vulnerable to adversarial inputs crafted specifically to suppress the safety signal in those activations while still eliciting harmful outputs.

This is not a reason to avoid the approach — the efficiency gains are substantial and most real-world jailbreak attempts are not adaptive in this sense. But it does mean that representation-reuse classifiers are best deployed alongside, not as a complete replacement for, a periodically retrained dedicated classifier. For your highest-risk use cases, treat the probe as a first-stage filter and preserve a full classifier as the backstop. For moderate-risk deployments where cost dominates, a well-validated probe alone may be sufficient, provided you run periodic red-team evaluations with adversaries who have white-box access to the probe’s architecture.

Warning

Do not evaluate probe robustness only against adversarial examples generated to fool a separate classifier. Attackers with knowledge of the probe’s layer and weights can craft inputs that evade it while bypassing the policy model’s refusals. Always include adaptive white-box attacks in your evaluation suite before moving a probe classifier to production.

Representation reuse is ultimately a systems engineering trade-off: you are exchanging a small amount of robustness margin against a large reduction in inference cost. Quantifying that trade-off precisely — for your specific threat model, your traffic distribution, and your cost constraints — is the engineering work that makes the approach production-ready.