Skip to content

AIM-Intelligence/RepBend

Repository files navigation

Representation Bending

Overview

Representation Bending is a method for modifying internal representations of LLMs to improve safety while maintaining general capabilities.

This repo is the code and model for the paper "Representation Bending for Large Language Model Safety"

Paper

Representation Bending for Large Language Model Safety

Authors: Ashkan Yousefpour, Taeheon Kim, Ryan S Kwon, Seungbeen Lee, Wonje Jeung, Seungju Han, Alvin Wan, Harrison Ngan, Youngjae Yu, Jonghyun Choi

Arxiv: Link

Idea: Existing safety-enhancing techniques, such as fine-tuning with human feedback or adversarial training, are still vulnerable as they address specific threats and often fail to generalize across unseen attacks, or require manual system-level defenses.a novel approach that fundamentally disrupts the representations underlying harmful behaviors in LLMs, offering a scalable solution to enhance (potentially inherent) safety. RepBend brings the idea of activation steering – simple vector arithmetic for steering model’s behavior during inference – to loss-based fine-tuning. RepBend achieves state-of-the-art performance, outperforming prior methods such as Circuit Breaker, RMU, and NPO, with up to 95% reduction in attack success rates across diverse jailbreak benchmarks.

Interesting Results

RepBend has the lowest Average Attack Success Rate (ASR) across five black-box and three white-box access attacks on Mistral 7B and Llama3 8B models.

Heatmaps cells show next token prediction and colors show entropy (blue: high confidence, red: low confidence) across layers (Y-axis) for tokens (X-axis). (a) Original instruction-tuned model LLama 3 8B complies with the request. (b) RepBend refuses the request with high certainty (blue heatmaps at the top). (c) Even when a complying sequence is forced, RepBend's representation diverges to generate random tokens.

Pretrained Models:

Setup

Prerequisites

Ensure your NVIDIA driver supports CUDA 11.6 or later by running:

Installation

Set up the environment with the following commands:

conda create -n safety python==3.10.14
conda activate safety
pip install transformers==4.40.0
pip install torch==2.3.1 torchvision xformers==0.0.27 --index-url https://download.pytorch.org/whl/cu118
pip install flash-attn==2.5.9.post1 --no-build-isolation
pip install peft==0.11.1 bitsandbytes pandas opencv-python timm torch_optimizer easydict pycocoevalcap sentencepiece protobuf trl==0.8.6 deepspeed==0.14.0 numpy==1.26.4 accelerate==0.29.3 jsonlines

If flash-attn installation fails due to missing CUDA toolkit, install it using:

conda install -c conda-forge cudatoolkit-dev -y

Training Representation Bending

Dataset

Download the dataset used for training: wildjailbreak.jsonl

Running the Training Script

Run the training script using:

sbatch train.sh  # For Slurm

OR

bash train.sh  # For local execution

Training Arguments

Argument Description
alpha, beta, gamma, epsilon Coefficients for each loss term
target_layer_start_idx Start index of target layers for representation modification
layers_window_size Number of layers to modify (target_layer_start_idx to target_layer_start_idx + layers_window_size)
transform_layers Layers where LoRA modules are attached (-1 for all layers)
max_step Number of training steps
alpha_mode "all": Computes safe loss using all layers, "target": Computes safe loss using target layers only
loss_mode Determines which token representations contribute to loss calculation:
prompt_last: Last token of input prompts
prompt_all: All tokens of input prompts
prompt_response: All tokens of prompts & responses
response_all: All tokens of responses

Evaluation

Once training is complete, evaluate the model using the AI2 Safety Tool.


Citation

If you use Representation Bending in your work, please cite:

@inproceedings{repbend,
  title={Representation Bending for Large Language Model Safety},
  author={Yousefpour, Ashkan and Kim, Taeheon and Kwon, Ryan S and Lee, Seungbeen and Jeung, Wonje and Han, Seungju and Wan, Alvin and Ngan, Harrison and Yu, Youngjae and Choi, Jonghyun},
  booktitle = {Proceedings of the 63st Annual Meeting of the Association for Computational Linguistics},
  month = {July},
  year={2025}
}

For any questions or issues, feel free to open an issue on this repository!

About

Code for Representation Bending Paper

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published