pairwiseLLM is a R package that provides a unified, extensible
framework for generating, submitting, and modeling pairwise comparisons
of writing quality using large language models (LLMs).
It includes:
- Unified live and batch APIs across OpenAI, Anthropic, and Gemini
- An adaptive pairing workflow to run optimal pairs of writing samples until reliability targets are met
- A prompt template registry with tested templates designed to reduce positional bias
- Positional-bias diagnostics (forward vs reverse design)
- Bradley–Terry (BT), Bayesian BT, and Elo modeling
- Consistent data structures for all providers
Several vignettes are available to demonstrate functionality.
For basic function usage, see:
For advanced batch processing workflows, see:
For information on prompt evaluation and positional-bias diagnostics, see:
The following models are confirmed to work for pairwise comparisons. Other similar models may work, but have not been fully tested.
| Provider | Model | Reasoning Mode? |
|---|---|---|
| OpenAI | gpt-5.2 | ✅ Yes |
| OpenAI | gpt-5.1 | ✅ Yes |
| OpenAI | gpt-4o | ❌ No |
| OpenAI | gpt-4.1 | ❌ No |
| Anthropic | claude-sonnet-4-5 | ✅ Yes |
| Anthropic | claude-haiku-4-5 | ✅ Yes |
| Anthropic | claude-opus-4-5 | ✅ Yes |
| Google/Gemini | gemini-3-pro-preview | ✅ Yes |
| Google/Gemini | gemini-3-flash-preview | ✅ Yes |
| DeepSeek-AI1 | DeepSeek-R1 | ✅ Yes |
| DeepSeek-AI1 | DeepSeek-V3 | ❌ No |
| Moonshot-AI1 | Kimi-K2-Instruct-0905 | ❌ No |
| Qwen1 | Qwen3-235B-A22B-Instruct-2507 | ❌ No |
| Qwen2 | qwen3:32b | ✅ Yes |
| Google2 | gemma3:27b | ❌ No |
| Mistral2 | mistral-small3.2:24b | ❌ No |
1 via the together.ai API
2 via Ollama on a local machine
Batch APIs are currently available for OpenAI, Anthropic, and Gemini
only. Models accessed via Together.ai and Ollama are supported for live
comparisons via submit_llm_pairs() / llm_compare_pair().
| Backend | Live | Batch |
|---|---|---|
| openai | ✅ | ✅ |
| anthropic | ✅ | ✅ |
| gemini | ✅ | ✅ |
| together | ✅ | ❌ |
| ollama | ✅ | ❌ |
pairwiseLLM is available on CRAN, install with:
install.packages("pairwiseLLM")To install the development version from GitHub:
# install.packages("pak")
pak::pak("shmercer/pairwiseLLM")Load the package:
library(pairwiseLLM)pairwiseLLM reads keys only from environment variables.
Keys are never printed, never stored, and never written to
disk.
You can verify which providers are available using:
check_llm_api_keys()This returns a tibble showing whether R can see the required keys for:
- OpenAI
- Anthropic
- Google Gemini
- Together.ai
You may set keys temporarily for the current R session:
Sys.setenv(OPENAI_API_KEY = "your-key-here")
Sys.setenv(ANTHROPIC_API_KEY = "your-key-here")
Sys.setenv(GEMINI_API_KEY = "your-key-here")
Sys.setenv(TOGETHER_API_KEY = "your-key-here")…but it is strongly recommended
to store them in your ~/.Renviron file.
Open your .Renviron file:
usethis::edit_r_environ()Add the following lines:
OPENAI_API_KEY="your-openai-key"
ANTHROPIC_API_KEY="your-anthropic-key"
GEMINI_API_KEY="your-gemini-key"
TOGETHER_API_KEY="your-together-key"
Save the file, then restart R.
You can confirm that R now sees the keys:
check_llm_api_keys()At a high level, pairwiseLLM workflows follow this structure:
- Writing samples – e.g., essays, constructed responses, short answers.
- Trait – a rating dimension such as “overall quality” or “organization”.
- Pairs – pairs of samples to be compared for that trait.
- Prompt template – instructions + placeholders for
{TRAIT_NAME},{TRAIT_DESCRIPTION},{SAMPLE_1},{SAMPLE_2}. - Backend – which provider/model to use (OpenAI, Anthropic, Gemini, Together, Ollama).
- Modeling – convert pairwise results to latent scores via BT or Elo.
The package provides helpers for each step.
pairwiseLLM includes an adaptive pairing workflow for efficiently
ranking writing samples using pairwise comparisons. Instead of
allocating comparisons uniformly at random, adaptive pairing selects
pairs where additional judgments are most informative, concentrating
effort on ambiguous regions (near-ties) to reduce posterior uncertainty
in latent quality estimates and rankings.
In practice, compared to random pairing designs:
- Overall Bayesian EAP reliability can be slightly lower (because comparisons are not spread uniformly),
- but credible/confidence intervals around latent quality scores and rankings are typically tighter.
To get started, see:
- Tutorial: Adaptive Pairing & Ranking
https://shmercer.github.io/pairwiseLLM/articles/adaptive-pairing.html - Design spec: Bayesian BTL + Adaptive Pairing Design
https://shmercer.github.io/pairwiseLLM/articles/bayesian-btl-adaptive-pairing-design.html
pairwiseLLM includes:
- A default template tested for positional bias
- Support for multiple templates stored by name
- User-defined templates via
register_prompt_template()
list_prompt_templates()
#> [1] "default" "test1" "test2" "test3" "test4" "test5"tmpl <- get_prompt_template("default")
cat(substr(tmpl, 1, 400), "...\n")
#> You are a debate adjudicator. Your task is to weigh the comparative strengths of two writing samples regarding a specific trait.
#>
#> TRAIT: {TRAIT_NAME}
#> DEFINITION: {TRAIT_DESCRIPTION}
#>
#> SAMPLES:
#>
#> === SAMPLE_1 ===
#> {SAMPLE_1}
#>
#> === SAMPLE_2 ===
#> {SAMPLE_2}
#>
#> EVALUATION PROCESS (Mental Simulation):
#>
#> 1. **Advocate for SAMPLE_1**: Mentally list the single strongest point of evidence that makes SAMPLE_1 the ...register_prompt_template("my_template", "
Compare two essays for {TRAIT_NAME}…
{TRAIT_NAME} is defined as {TRAIT_DESCRIPTION}.
SAMPLE 1:
{SAMPLE_1}
SAMPLE 2:
{SAMPLE_2}
<BETTER_SAMPLE>SAMPLE_1</BETTER_SAMPLE> or
<BETTER_SAMPLE>SAMPLE_2</BETTER_SAMPLE>
")Use it in a submission:
tmpl <- get_prompt_template("my_template")Traits define what “quality” means.
trait_description("overall_quality")
#> $name
#> [1] "Overall Quality"
#>
#> $description
#> [1] "Overall quality of the writing, considering how well ideas are expressed,\n how clearly the writing is organized, and how effective the language and\n conventions are."You can also provide custom traits:
trait_description(
custom_name = "Clarity",
custom_description = "How understandable, coherent, and well structured the ideas are."
)Use the unified API for direct API calls. The submit_llm_pairs()
function supports parallel processing and incremental output
saving for all supported backends (OpenAI, Anthropic, Gemini,
Together, and Ollama).
llm_compare_pair()— compare one pairsubmit_llm_pairs()— compare many pairs at once
Key Features:
- Parallel Execution: Set
parallel = TRUEandworkers = nto speed up processing. - Resume Capability: Provide a
save_path(e.g.,"results.csv"). The function writes results as they finish. If interrupted, running the command again will automatically skip pairs already present in the file. - Robust Output: Returns a list containing
$results(successful comparisons) and$failed_pairs(scheduled pairs with no observed outcome) plus$failed_attempts(attempt-level failures, including retry/timeout/parse issues), ensuring one bad request doesn’t crash the whole job.
Example:
data("example_writing_samples")
pairs <- example_writing_samples |>
make_pairs() |>
sample_pairs(10, seed = 123) |>
randomize_pair_order()
td <- trait_description("overall_quality")
tmpl <- get_prompt_template("default")
# Run in parallel with incremental saving
res_list <- submit_llm_pairs(
pairs = pairs,
backend = "openai",
model = "gpt-4o",
trait_name = td$name,
trait_description = td$description,
prompt_template = tmpl,
parallel = TRUE,
workers = 4,
save_path = "live_results.csv"
)
# Inspect successes
head(res_list$results)
# Inspect failures (if any)
if (nrow(res_list$failed_pairs) > 0) {
print(res_list$failed_pairs)
}
# Inspect attempt-level failures (if any)
if (nrow(res_list$failed_attempts) > 0) {
print(res_list$failed_attempts)
}Most providers give a discount for batch jobs. For large-scale runs use:
llm_submit_pairs_batch()llm_download_batch_results()
Example:
batch <- llm_submit_pairs_batch(
backend = "anthropic",
model = "claude-sonnet-4-5",
pairs = pairs,
trait_name = td$name,
trait_description = td$description,
prompt_template = tmpl
)
results <- llm_download_batch_results(batch)Before running a large live or batch job, you can estimate token usage
and cost with estimate_llm_pairs_cost(). The estimator:
- Runs a small pilot on
n_testpairs (live calls) to observeprompt_tokensandcompletion_tokens - Uses the pilot to calibrate a prompt-bytes → input-tokens model for the remaining pairs
- Estimates output tokens for the remaining pairs using the pilot distribution and calculates costs (expected = 50th %ile; budget = 90th %ile).
data("example_writing_samples", package = "pairwiseLLM")
pairs <- example_writing_samples |>
make_pairs() |>
sample_pairs(n_pairs = 200, seed = 123) |>
randomize_pair_order(seed = 456)
td <- trait_description("overall_quality")
tmpl <- set_prompt_template()
# Estimate cost using a small pilot run (live calls).
# If your provider offers discounted batch pricing, set batch_discount accordingly.
est <- estimate_llm_pairs_cost(
pairs = pairs,
backend = "openai",
model = "gpt-4.1",
endpoint = "chat.completions",
trait_name = td$name,
trait_description = td$description,
prompt_template = tmpl,
mode = "batch",
batch_discount = 0.5, # e.g., batch costs 50 percent of live
n_test = 10, # number of paid pilot calls
budget_quantile = 0.9, # "budget" uses p90 output tokens
cost_per_million_input = 3.00, # set these to your provider pricing
cost_per_million_output = 12.00
)
est
est$summaryBy default, the estimator returns the pilot results and the remaining pairs. This lets you run the pilot once, then submit only the remaining pairs:
# Pairs not included in the pilot:
remaining_pairs <- est$remaining_pairs
# Submit remaining pairs using your preferred workflow (live):
res_live <- submit_llm_pairs(remaining_pairs, backend = "openai", model = "gpt-4.1", ...)
# For batch:
batch <- llm_submit_pairs_batch(
backend = "openai",
model = "gpt-4.1",
pairs = remaining_pairs,
trait_name = td$name,
trait_description = td$description,
prompt_template = tmpl)
results <- llm_download_batch_results(batch)For very large jobs or when you need to restart polling after an interruption, pairwiseLLM provides two convenience helpers that wrap the low–level batch APIs:
llm_submit_pairs_multi_batch()— divides a table of pairwise comparisons into multiple batch jobs, uploads the input JSONL files, creates the batches, and optionally writes a registry CSV containing all batch IDs and file paths. You can split by specifying eithern_segments(number of jobs) orbatch_size(maximum number of pairs per job).llm_resume_multi_batches()— polls all unfinished batches, downloads and parses the results as soon as each job completes, and optionally writes per‑job result CSVs and a single combined CSV with the merged results.
Use these helpers when your dataset is large or if you anticipate having to pause and resume the job.
data("example_writing_samples", package = "pairwiseLLM")
# construct 100 pairs and a trait description
pairs <- example_writing_samples |>
make_pairs() |>
sample_pairs(n_pairs = 100, seed = 123) |>
randomize_pair_order(seed = 456)
td <- trait_description("overall_quality")
tmpl <- set_prompt_template()
# 1. Submit the pairs as 10 separate batches and write a registry CSV to disk.
multi_job <- llm_submit_pairs_multi_batch(
pairs = pairs,
backend = "openai",
model = "gpt-5.2",
trait_name = td$name,
trait_description = td$description,
prompt_template = tmpl,
n_segments = 10,
output_dir = "directory_name/",
write_registry = TRUE,
include_thoughts = TRUE
)
# 2. Later (or in a new session), resume polling and download results.
res <- llm_resume_multi_batches(
jobs = multi_job$jobs,
interval_seconds = 60,
write_results_csv = TRUE,
write_combined_csv = TRUE,
keep_jsonl = FALSE
)
head(res$combined)The registry CSV contains all batch IDs and file paths, allowing you to
resume polling with llm_resume_multi_batches() even if the R session
is interrupted.
LLMs often show a first-position or second-position bias.
pairwiseLLM includes explicit tools for testing this.
pairs_fwd <- make_pairs(example_writing_samples)
pairs_rev <- sample_reverse_pairs(pairs_fwd, reverse_pct = 1.0)Submit:
# Submit forward pairs
out_fwd <- submit_llm_pairs(pairs_fwd, model = "gpt-4o", backend = "openai", ...)
# Submit reverse pairs
out_rev <- submit_llm_pairs(pairs_rev, model = "gpt-4o", backend = "openai", ...)Compute bias:
cons <- compute_reverse_consistency(out_fwd$results, out_rev$results)
bias <- check_positional_bias(cons)
cons$summary
bias$summaryFive included templates have been tested across different backend
providers. Complete details are presented in a vignette:
vignette("prompt-template-bias")
# Using the example writing pairs (fully offline; no LLM calls)
data("example_writing_pairs")
# build_bt_data() converts (ID1, ID2, better_id) into the 0/1 format.
bt_ex <- build_bt_data(example_writing_pairs)
# Result has:
# - object1: ID of the first item
# - object2: ID of the second item
# - result : 1 if object1 wins, 0 if object2 wins
head(bt_ex)
bt_fit <- fit_bt_model(bt_ex)
summarize_bt_fit(bt_fit)data("example_writing_pairs")
elo_data <- build_elo_data(example_writing_pairs)
elo_fit <- fit_elo_model(elo_data, runs = 5)
elo_fit$elo
elo_fit$reliability
elo_fit$reliability_weightedpairwiseLLM fits rankings using Bayesian Bradley–Terry–Luce (BTL)
models. These models estimate a latent quality parameter for each item
based on pairwise comparison outcomes, while providing uncertainty
estimates and principled stopping diagnostics.
The package supports four closely related BTL variants, differing in how they model LLM judge behavior.
All models estimate one latent quality parameter per item. They differ only in whether they include:
- a lapse (random-response) rate
- a position (order) bias
| Model | Lapse | Position bias | Description |
|---|---|---|---|
btl |
✗ | ✗ | Standard Bradley–Terry–Luce |
btl_e |
✓ | ✗ | BTL with lapse (random responding) |
btl_b |
✗ | ✓ | BTL with position bias |
btl_e_b |
✓ | ✓ | BTL with both lapse and position bias (default) |
Recommended default: btl_e_b This is the most robust option when
the judge is an LLM or other noisy rater.
-
Lapse (
ε) Useful when judgments occasionally appear random or inconsistent. The lapse parameter absorbs these errors without distorting item-level quality estimates. -
Position bias (
b) Useful when the judge systematically prefers the first or second item presented. This is especially important when prompts present items in a fixed order.
If you are confident that neither effect is present, you can use the
simpler btl model.
You can fit a Bayesian BTL model directly from pairwise comparison data, without using adaptive pairing.
data("example_writing_results")
# Generate a vector of all unique sample IDs
ids <- sort(unique(c(example_writing_results$A_id, example_writing_results$B_id)))
fit <- fit_bayes_btl_mcmc(
results = example_writing_results,
ids = ids,
model_variant = "btl_e_b"
)This fits the model using MCMC via cmdstanr and returns posterior
samples and summaries.
Posterior summaries for items can be extracted using helper functions:
item_summary <- summarize_items(fit)
head(item_summary)Typical outputs include:
- posterior mean latent quality,
- credible intervals,
- induced ranks,
- uncertainty measures.
You can also inspect convergence and diagnostics:
summarize_refits(fit)This reports:
- R-hat,
- effective sample sizes,
- divergence counts,
- reliability metrics.
When using adaptive pairing (adaptive_rank()), the same Bayesian BTL
models are fit intermittently during the run:
-
Pair selection is guided by the fast TrueSkill model.
-
Bayesian BTL refits provide:
- uncertainty estimates,
- diagnostics,
- stopping decisions,
- and late-stage adaptation signals.
You can therefore:
- fit Bayesian BTL models standalone for analysis,
- or use them as part of an adaptive ranking workflow.
For a full tutorial on adaptive pairing, see:
- Adaptive Pairing & Ranking https://shmercer.github.io/pairwiseLLM/articles/adaptive-pairing.html
For a detailed technical specification of the Bayesian and adaptive algorithms, see:
- Bayesian BTL + Adaptive Pairing Design (v3.1) https://shmercer.github.io/pairwiseLLM/articles/bayesian-btl-adaptive-pairing-design.html
| Workflow | Use Case | Functions |
|---|---|---|
| Live | small or interactive runs | submit_llm_pairs, llm_compare_pair |
| Batch | large jobs, cost control | llm_submit_pairs_batch, llm_download_batch_results |
Contributions to pairwiseLLM are very welcome!
- Bug reports (with reproducible examples when possible)
- Feature requests, ideas, and discussion
- Pull requests improving:
- functionality
- documentation
- examples / vignettes
- test coverage
- Backend integrations (e.g., additional LLM providers or local inference engines)
- Modeling extensions
If you encounter a problem:
-
Run:
devtools::session_info()
-
Include:
- reproducible code
- the error message
- the model/backend involved
- your operating system
-
Open an issue at:
https://github.com/shmercer/pairwiseLLM/issues
MIT License. See LICENSE.
- Sterett H. Mercer – University of British Columbia
UBC Faculty Profile: https://ecps.educ.ubc.ca/sterett-h-mercer/
ResearchGate: https://www.researchgate.net/profile/Sterett_Mercer
Google Scholar: https://scholar.google.ca/citations?user=YJg4svsAAAAJ&hl=en
Mercer, S. H. (2026). pairwiseLLM: Pairwise writing quality comparisons with large language models (Version 1.3.0) [R package; Computer software]. https://github.com/shmercer/pairwiseLLM
