Representation Bending

Overview

Representation Bending is a method for modifying internal representations of LLMs to improve safety while maintaining general capabilities.

This repo is the code and model for the paper "Representation Bending for Large Language Model Safety"

Paper

Representation Bending for Large Language Model Safety

Authors: Ashkan Yousefpour, Taeheon Kim, Ryan S Kwon, Seungbeen Lee, Wonje Jeung, Seungju Han, Alvin Wan, Harrison Ngan, Youngjae Yu, Jonghyun Choi

Arxiv: Link

Idea: Existing safety-enhancing techniques, such as fine-tuning with human feedback or adversarial training, are still vulnerable as they address specific threats and often fail to generalize across unseen attacks, or require manual system-level defenses.a novel approach that fundamentally disrupts the representations underlying harmful behaviors in LLMs, offering a scalable solution to enhance (potentially inherent) safety. RepBend brings the idea of activation steering – simple vector arithmetic for steering model’s behavior during inference – to loss-based fine-tuning. RepBend achieves state-of-the-art performance, outperforming prior methods such as Circuit Breaker, RMU, and NPO, with up to 95% reduction in attack success rates across diverse jailbreak benchmarks.

Interesting Results

RepBend has the lowest Average Attack Success Rate (ASR) across five black-box and three white-box access attacks on Mistral 7B and Llama3 8B models.

Heatmaps cells show next token prediction and colors show entropy (blue: high confidence, red: low confidence) across layers (Y-axis) for tokens (X-axis). (a) Original instruction-tuned model LLama 3 8B complies with the request. (b) RepBend refuses the request with high certainty (blue heatmaps at the top). (c) Even when a complying sequence is forced, RepBend's representation diverges to generate random tokens.

Pretrained Models:

Huggingface Mistral 7B
Huggingface Llama3 8B
Huggingface Mistral 7B LoRA
Huggingface Llama3 8B LoRA

Setup

Prerequisites

Ensure your NVIDIA driver supports CUDA 11.6 or later by running:

Installation

Set up the environment with the following commands:

conda create -n safety python==3.10.14
conda activate safety
pip install transformers==4.40.0
pip install torch==2.3.1 torchvision xformers==0.0.27 --index-url https://download.pytorch.org/whl/cu118
pip install flash-attn==2.5.9.post1 --no-build-isolation
pip install peft==0.11.1 bitsandbytes pandas opencv-python timm torch_optimizer easydict pycocoevalcap sentencepiece protobuf trl==0.8.6 deepspeed==0.14.0 numpy==1.26.4 accelerate==0.29.3 jsonlines

If flash-attn installation fails due to missing CUDA toolkit, install it using:

conda install -c conda-forge cudatoolkit-dev -y

Training Representation Bending

Dataset

Download the dataset used for training: wildjailbreak.jsonl

Running the Training Script

Run the training script using:

sbatch train.sh  # For Slurm

OR

bash train.sh  # For local execution

Training Arguments

Argument	Description
`alpha`, `beta`, `gamma`, `epsilon`	Coefficients for each loss term
`target_layer_start_idx`	Start index of target layers for representation modification
`layers_window_size`	Number of layers to modify (`target_layer_start_idx` to `target_layer_start_idx + layers_window_size`)
`transform_layers`	Layers where LoRA modules are attached (-1 for all layers)
`max_step`	Number of training steps
`alpha_mode`	`"all"`: Computes safe loss using all layers, `"target"`: Computes safe loss using target layers only
`loss_mode`	Determines which token representations contribute to loss calculation: • `prompt_last`: Last token of input prompts • `prompt_all`: All tokens of input prompts • `prompt_response`: All tokens of prompts & responses • `response_all`: All tokens of responses

Evaluation

Once training is complete, evaluate the model using the AI2 Safety Tool.

Citation

If you use Representation Bending in your work, please cite:

@inproceedings{repbend,
  title={Representation Bending for Large Language Model Safety},
  author={Yousefpour, Ashkan and Kim, Taeheon and Kwon, Ryan S and Lee, Seungbeen and Jeung, Wonje and Han, Seungju and Wan, Alvin and Ngan, Harrison and Yu, Youngjae and Choi, Jonghyun},
  booktitle = {Proceedings of the 63st Annual Meeting of the Association for Computational Linguistics},
  month = {July},
  year={2025}
}

For any questions or issues, feel free to open an issue on this repository!

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
activation		activation
configs		configs
github_assets		github_assets
methods		methods
safety_evaluation		safety_evaluation
scripts/eval_parser		scripts/eval_parser
README.md		README.md
evaluate.sh		evaluate.sh
harmbench_test.json		harmbench_test.json
harmbench_test_ctx.json		harmbench_test_ctx.json
train.sh		train.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Representation Bending

Overview

Paper

Representation Bending for Large Language Model Safety

Interesting Results

Setup

Prerequisites

Installation

Training Representation Bending

Dataset

Running the Training Script

Training Arguments

Evaluation

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Languages

AIM-Intelligence/RepBend

Folders and files

Latest commit

History

Repository files navigation

Representation Bending

Overview

Paper

Representation Bending for Large Language Model Safety

Interesting Results

Setup

Prerequisites

Installation

Training Representation Bending

Dataset

Running the Training Script

Training Arguments

Evaluation

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages