Skip to content

Supplementary code for the paper "Control Tax: The Price of Keeping AI in Check"

License

Notifications You must be signed in to change notification settings

MikhailTerekhov/control-tax-code

Repository files navigation

Control Tax

This repository contains supplementary code for the paper "Control Tax: The Price of Keeping AI in Check."

Website: https://mikhailterekhov.github.io/control-tax/

Paper: https://arxiv.org/abs/2506.05296

Below you can find the instructions to reprouce the main results from the paper. They were tested on a fresh Ubuntu 22.04 container with root access and torch pre-installed. Note that a full reproduction requires a significant amount of API credits (>$1000), but the scripts allow for subsampling.

The full dataset of backdoored code used in the paper is located at data/backdoor_code_full.pkl

Prerequisites

  1. Install the dependencies from the requirements.txt file. In a suitable environment (conda, pip, or global), run
pip install -r requirements.txt

The repository makes heavy use of the Inspect framework for LLM evaluation.

  1. Set up your API keys for the models you want to use as environment variables. Put them in a .env file at the root of this repository. This should make them visible to Inspect. Follow the structure in .env.example.

  2. To run the plotting scripts, one also needs to install tex:

apt update
apt install -y texlive texlive-latex-extra texlive-fonts-recommended dvipng cm-super

Tex packages will prompt you for the geographic area during installation, you can pick any.

Backdoor Generation

To generate backdoors, run the corresponding Inspect eval:

inspect eval backdoors/backdoor_eval.py --model openai/gpt-4o-mini,openai/gpt-4o

If you require a different set of models, adjust accordingly. This will run the evaluation on the full filtered APPS dataset. If you'd like to run on a subset, add a task parameter, e.g. -T num_samples_per_difficulty=5.

To open the logs in an interactive viewer, run inspect view. You can open localhost:7575 in browser to browse the logs. If you are on a remote machine, you will need to forward the corresponding port.

Next, we will want to get the code out of the log files. When the inspect eval finishes, it will tell you the locations of the inspect log files it generated, one for each model. We can now extract a dataset of backdoored code samples. To do that one needs to use the extract_backdoor_code.py:

python backdoors/extract_backdoor_code.py --pickle /path/to/inspect/log/1.eval /path/to/inspect/log/2.eval

The script will report the name of the file with the data that was created.

Plotting the results

The backdoor result distribution plot can be generated by running

python backdoors/plot_backdoor_results.py /path/to/inspect/log/1.eval /path/to/inspect/log/2.eval

Note that this takes in the original inspect logs, not the generated pickle file. Note also that this function saves and loads from a cache file to save time. This cache file is also important for strategy optimizaiton below. If you'd like to plot multiple different sets of attacker/monitor pairs, use the --cache_suffix command line argument to create multiple cache files. If you try to rerun the plotting script without it, it will silently plot the old data!

Running the monitors

With the pickle file with backdoors at hand, you can run the monitor models:

inspect eval monitors/inspect_monitor.py --model openai/gpt-4o-mini,openai/gpt-4o -T dataset_path=path/to/dataset.pkl

You'll need to put in the path that was given to you by extract_backdoor_code.py here.

Plotting the results

To plot the results, you can run

python monitors/plot_numeric_monitor_performance.py /path/to/inspect/monitor/log/1.eval /path/to/inspect/monitor/log/2.eval --hist_attackers openai/gpt-4o openai/gpt-4o-mini --hist_monitors openai/gpt-4o openai/gpt-4o-mini 

You will have to put in the actual logs reported to you by Inspect at the previous step. Order of arguments is important! Note that this function saves and loads from a cache file to save time. This cache file is also important for strategy optimizaiton below. If you'd like to plot multiple different sets of attacker/monitor pairs, use the --cache_suffix command line argument to create multiple cache files. If you try to rerun the plotting script without it, it will silently plot the old data!

Strategy optimization

To get meaningful results at this step, you will likely need larger samples sizes (N>=100 successful backdoors per attacker). You will also need to have run the plotting scripts for backdooring and monitoring since those scripts generate cache files necessary here. We ran the optimization on an A100 GPU with 80Gb of memory. For smaller memory sizes you will need to adjust the default parameters as explained below.

P[red team win] against p_b

If you did not provide the --cache_suffix during plotting, you can run plot_pb.py to generate the (P_w(p_b)) dependence with the default parameters:

python protocols/plot_pb.py

You can adjust the attacker and monitor models used to do this plot by providing --attacker and --monitor args. If you're running into OOM issues, you can reduce the number of (p_b) values and the (N) values that are used, as well as the interpolation resolution for the CDFs, eg --num_pb_points 20 --num_kde_scores 100 --N 10 100 1000.

Optimal strategies and P[blue team win] against c

Like the one above, this script will rely on the backdooring and monitoring plotting cache files to work. To generate the data and several plots, you can run

python protocols/plot_strategies.py --num_c_vals 11

Note that this script will also print the (p_+) values computed from the backdoor success statistics. Collecting the data may take some time. You can add --use_limit_approximation to compute based on the limit approximation from Section 4.2 rather than based on the exact finite (N) expression (this should also speed up the process). If you run into OOM issues, you can reduce some parameters, e.g. --num_processes 1 --num_kde_scores 100 --horizon 200. However, please bear in mind that lower discretisation rates (--num_kde_scores) may lead to numerical instabilities. On completion, the script will save two plots: one for the optimal blue team strategies and one for the dependency of P[blue team win] on c.

Sequential anytime monitoring

To get the plots for anytime monitoring safety, you can run

python protocols/plot_anytime.py

The script relise on the scame backdooring and monitoring caches. You can adjust the attacker and monitor models (keeping in mind that you should first generate data that includes this attacker and monitor pair) with flags --attacker and --monitor.

Model usage and cost statistics

We computed the costs of running the monitors using utilities in the total_model_usage.py file. To get some statistics of running a monitor, including per-sample costs, you can run

python total_model_usage.py /path/to/inspect/monitor/log.eval

where you will need to provide the Inspect logs of running your monitors. Note that the API costs are subject to frequent change, and the costs dict in the script that hard-codes them might thus become obsolete.

Licence

This project is licensed under the MIT licence. Details can be found in the LICENSE file in the root directory of the project.

About

Supplementary code for the paper "Control Tax: The Price of Keeping AI in Check"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages