feat: pointwise grpo training using common grader datasets and prompt… by jc200808 · Pull Request #118 · agentscope-ai/OpenJudge

jc200808 · 2026-02-12T02:02:15Z

… templates

OpenJudge Version

[The version of OpenJudge you are working on, e.g. import openjudge; print(openjudge.__version__)]

Description

[Please describe the background, purpose, changes made, and how to test this PR]

Checklist

Please check the following items before code is ready to be reviewed.

Code has been formatted with pre-commit run --all-files command
All tests are passing
Docstrings are in Google style
Related documentation has been updated (e.g. links, examples, etc.)
Code is ready for review

… templates

gemini-code-assist · 2026-02-12T02:02:35Z

Summary of Changes

Hello @jc200808, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the GRPO training framework by introducing dedicated support for pointwise training using common OpenJudge grader datasets and their associated prompt templates. It provides a comprehensive pipeline, from data preprocessing and flexible dataset loading to a custom reward function, enabling the training of judge models that can score single responses based on various criteria like correctness, hallucination, and relevance.

Highlights

New GRPO Training Script: Introduced a new GRPO training script (run_pointwise_grader.sh) specifically designed for OpenJudge grader datasets and prompt templates, enabling pointwise judge model training.
Flexible Grader Dataset Handling: Implemented grader_rl_dataset.py which includes BaseChatRLDataset and PointwiseChatRLDataset to flexibly load and preprocess diverse grader datasets, supporting both old and new JSON structures and various prompt templates.
Custom Pointwise Reward Function: Added grader_reward_fn.py to define a custom reward function for pointwise scoring, capable of extracting scores from model responses (including handling 'thinking' tags and <score> tags) and calculating rewards based on prediction accuracy.
Grader Data Preprocessing Utility: Provided preprocess_grader_data.py as a utility to prepare raw grader data, transforming it into pointwise positive/negative examples and splitting it into training and validation sets in a suitable format.
Updated Documentation: Updated the GRPO README.md with clear instructions for setting up and running the new pointwise training using grader datasets.

Changelog

cookbooks/training_judge_model/grpo/README.md
- Added a new section "GRPO training using OpenJudge grader dataset" with instructions and a command to run the new run_pointwise_grader.sh script.
cookbooks/training_judge_model/grpo/grader_rl_dataset.py
- Added DatasetConfig dataclass for dataset configuration.
- Added BaseChatRLDataset as a base class for RL datasets, handling file loading, format detection, prompt filtering, and tokenization.
- Added PointwiseChatRLDataset inheriting from BaseChatRLDataset, specifically tailored for pointwise scoring datasets, including logic to build chat messages and format prompts using OpenJudge grader templates (correctness, hallucination, relevance, etc.), and extract ground truth scores from various data structures.
cookbooks/training_judge_model/grpo/pointwise/grader_reward_fn.py
- Added filter_thinking_parts function to remove internal "thinking" tags from model responses.
- Added extract_score function to parse scores from <score> tags or JSON objects within model responses.
- Added calculate_reward function to compute rewards based on the difference between predicted and true scores, supporting both binary and multi-class scoring.
- Added compute_score function as the main entry point for reward calculation, integrating the filtering, extraction, and reward logic.
cookbooks/training_judge_model/grpo/pointwise/run_pointwise_grader.sh
- Added a new bash script for configuring and launching GRPO pointwise training jobs on a Ray cluster.
- Configured paths for model, training/validation data, output, and custom modules (grader_reward_fn.py, grader_rl_dataset.py).
- Set various hyperparameters for data handling (batch sizes, max lengths), optimizer, GRPO specifics (rollout samples), and training (epochs, save/test frequency).
- Included Ray job submission command with detailed parameter passing for the custom dataset and reward function.
cookbooks/training_judge_model/grpo/pointwise/utils/preprocess_grader_data.py
- Added utility functions get_data_files, get_split_params, and validate_files for interactive user input and file validation.
- Added process_single_file function to read JSON data, transform it into pointwise positive/negative examples (with scores), shuffle, sample, and split into _train.jsonl and _val.jsonl files.
- Implemented main function to orchestrate the data preprocessing workflow.

Activity

No specific activity (comments, reviews, progress) was provided in the context.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces GRPO pointwise training using OpenJudge grader datasets, adding a new dataset loader, reward function, preprocessing script, and training script. Critically, it introduces several security vulnerabilities, including a prompt injection vulnerability in the dataset processing logic and multiple command injection vulnerabilities in the training shell script due to unquoted environment variables, which could allow an attacker to manipulate the training process or execute arbitrary commands. The review also provides feedback on improving code quality, consistency, and robustness, with suggestions for specific exception handling, reducing code duplication, standardizing logging practices, and refining the reward calculation logic.

cookbooks/training_judge_model/grpo/grader_rl_dataset.py

cookbooks/training_judge_model/grpo/pointwise/run_pointwise_grader.sh