COMPASS is a framework for evaluating policy alignment: given only an organization’s policy (e.g., allow/deny rules), it enables you to benchmark whether an LLM’s responses comply with that policy in structured, enterprise-like scenarios.
This repository provides tools to:
- Define a custom policy for your organization.
- Generate a benchmark of synthetic queries (standard and adversarial) tailored to that policy.
- Evaluate LLMs on how well they adhere to your rules.
For reproducing the experiments from the paper (including RAG scenarios and full-scale results), please see REPRODUCE.md.
conda create -n compass python=3.11
conda activate compass
pip install -r requirements.txtSet up your API keys in .env (see .env.sample). The exact credentials you need depend on which providers/models you select in scripts/config/*.yaml (for synthesis, evaluation, and judging).
cp .env.sample .env
# Edit .env to add your keys- OpenAI judge (default):
OPENAI_API_KEY(used byscripts/response_judge.pyunless you change the judge config) - OpenRouter (denied_edge synthesis, optional response generation):
OPENROUTER_API_KEY(used byscripts/denied_edge_queries_synthesis.py) - Vertex / Anthropic Vertex (allowed_edge synthesis, optional):
- If you use Claude on Vertex via
anthropic[vertex], set up Google Cloud credentials (e.g.,GOOGLE_APPLICATION_CREDENTIALS) and ensure your Vertex project/region are configured in the YAML files underscripts/config/. - If you use Gemini via
google-genaiin this repo,VERTEX_API_KEYis required (seescripts/utils/vertex_api_utils.py).
- If you use Claude on Vertex via
Tip: If you are starting from scratch, you can run base query generation + verification + response evaluation first, and add edge/RAG workflows later.
We provide a comprehensive testbed dataset covering 8 industry verticals (Automotive, Healthcare, Financial, etc.) generated using COMPASS. You can access the Testbed Dataset on Hugging Face:
👉 AIM-Intelligence/COMPASS-Policy-Alignment-Testbed-Dataset
This dataset serves as a reference for what COMPASS generates and allows you to test models immediately without generating your own data.
The testbed queries corresponding to the verified query buckets under scenario/queries/verified_* are published there (as Parquet).
Follow these steps to create a policy alignment benchmark for your own organization.
To build a custom benchmark and evaluate responses, you typically provide:
- Policy + Context: required for query generation.
- System prompt: required for response generation (evaluation).
1. Policy File (scenario/policies/MyOrg.json):
Define allowlist (topics you WANT to answer) and denylist (topics you MUST refuse).
{
"allowlist": {
"product_support": "Technical support and usage guidelines for MyOrg's software products, including installation, debugging, and API usage.",
"pricing": "Publicly available pricing tiers (Free, Pro, Enterprise) and feature comparison tables."
},
"denylist": {
"competitors": "Comparisons with CompetitorX or CompetitorY, or market share analysis.",
"internal_security": "Details about internal server infrastructure, employee credentials, or unpatched vulnerabilities."
}
}2. Context File (scenario/contexts/MyOrg.txt):
Provide a description of your organization to help the LLM generate realistic scenarios.
MyOrg is a leading provider of cloud-based project management software...
3. System Prompt File (scenario/system_prompts/MyOrg.txt):
Provide the system prompt that the model will use when responding to queries. You can write any prompt you want the model to follow.
You are a helpful assistant for MyOrg. You must strictly follow the company's content policies...
Use the synthesis scripts to generate user queries based on your policy, and then run verification scripts to ensure quality.
Note: The synthesis scripts enumerate all
scenario/policies/*.jsonfiles by default.Recommended (to run a single custom org safely): work in a separate branch/copy, and temporarily keep only these three files for your org:
scenario/policies/MyOrg.jsonscenario/contexts/MyOrg.txtscenario/system_prompts/MyOrg.txtThis is the most reliable way to avoid accidental API calls for other scenarios.
You can also use--debug/--max-companiesto limit the run, but it is less explicit than isolating the files.
1. Generate Standard Queries (Base):
python scripts/base_queries_synthesis.pyThis generates standard questions for both allowlist and denylist topics.
2. Verify Base Queries:
python scripts/base_queries_verification.pyThis validates the generated queries and saves the approved ones to scenario/queries/verified_base/.
3. Generate Edge Cases (Adversarial/Borderline):
python scripts/allowed_edge_queries_synthesis.py
python scripts/denied_edge_queries_synthesis.pyallowed_edge: Tricky questions that seem risky but should be answered.denied_edge: Adversarial attacks (jailbreaks, social engineering) trying to elicit denied info.
Prerequisites:
allowed_edge_queries_synthesis.pyuses Vertex utilities (seescripts/utils/vertex_api_utils.py), so make sure your Vertex/GCP auth is configured.denied_edge_queries_synthesis.pycalls OpenRouter, so you needOPENROUTER_API_KEY.
4. Verify Edge Cases:
python scripts/allowed_edge_queries_verification.py
python scripts/denied_edge_queries_verification.pyValidated queries are saved to scenario/queries/verified_allowed_edge/ and scenario/queries/verified_denied_edge/.
-
Generate Responses: Run your target LLM against the generated queries. You must specify the model, company, and query type. The script will automatically load the verified queries.
# Example for OpenRouter models python scripts/response_generation_openrouter.py \ --model "openai/gpt-4-turbo" \ --company "MyOrg" \ --query_type "base"
(Run separately for
base,allowed_edge, anddenied_edge) -
Judge Compliance: Use an LLM-as-a-Judge to score the responses.
python scripts/response_judge.py "response_results" -n 5 -
Analyze Results:
python scripts/analyze_judged_results.py --target-directory judge_results
scenario/: Your input data (policies, contexts) and generated benchmarks.policies/: Put your JSON policy here.contexts/: Put your company description TXT here.system_prompts/: Put your system prompt TXT here.queries/: Generated benchmark data.
scripts/: Tools for synthesis and evaluation.results/: Output from model runs and evaluations.
If you use COMPASS in your research, please cite:
@misc{choi2026compass,
title={COMPASS: A Framework for Evaluating Organization-Specific Policy Alignment in LLMs},
author={Dasol Choi and DongGeon Lee and Brigitta Jesica Kartono and Helena Berndt and Taeyoun Kwon and Joonwon Jang and Haon Park and Hwanjo Yu and Minsuk Kahng},
year={2026},
eprint={2601.01836},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2601.01836},
}