This is the implementation for the paper HarnessLLM: Automatic Testing Harness Generation via Reinforcement Learning. We propose a framework that trains LLMs to generate automatic test harnesses.
conda create -n harness python=3.10 -y
conda activate harness
USE_MEGATRON=0 bash scripts/install_vllm_sglang_mcore.sh
pip install --no-deps -e .
conda install conda-forge::python-prctlFirst, generate test cases with a model on a specific dataset:
bash bashes/generate.sh {model} {data} {prompt} {port}
where prompt can be harness or inputoutput. For example, to run our model on LCB Seen version, run:
bash bashes/generate.sh Shiyu-Lab/HarnessLLM_RL_Qwen3_4B Shiyu-Lab/Testcase_LCB_Seen harness 30000
Next, evaluate the generated test cases with:
python -m scripts.eval {model} {prompt} --data_path {data}
For example, to evaluate the above-generated test cases, run:
python -m scripts.eval Shiyu-Lab/HarnessLLM_RL_Qwen3_4B harness --data_path Shiyu-Lab/Testcase_LCB_Seen
You can also run bash bashes/generate_and_eval_testcase.sh to evaluate on all datasets.
We use LLaMA-Factory for SFT training.
You can download our training data from Huggingface. The config file we use for SFT training is scripts/qwen3_sft.yaml. You can follow the instructions on LLaMA-Factory to launch training.
Use the script examples/grpo_trainer/run_harness.sh to train our model.
@misc{liu2025harnessllmautomatictestingharness,
title={HarnessLLM: Automatic Testing Harness Generation via Reinforcement Learning},
author={Yujian Liu and Jiabao Ji and Yang Zhang and Wenbo Guo and Tommi Jaakkola and Shiyu Chang},
year={2025},
eprint={2511.01104},
archivePrefix={arXiv},
primaryClass={cs.SE},
url={https://arxiv.org/abs/2511.01104},
}
Our implementation is based on Verl.