Skip to content

Comments

feat: pointwise grpo training using common grader datasets and prompt…#118

Open
jc200808 wants to merge 2 commits intoagentscope-ai:mainfrom
jc200808:feat-dev-260210
Open

feat: pointwise grpo training using common grader datasets and prompt…#118
jc200808 wants to merge 2 commits intoagentscope-ai:mainfrom
jc200808:feat-dev-260210

Conversation

@jc200808
Copy link
Collaborator

… templates

OpenJudge Version

[The version of OpenJudge you are working on, e.g. import openjudge; print(openjudge.__version__)]

Description

[Please describe the background, purpose, changes made, and how to test this PR]

Checklist

Please check the following items before code is ready to be reviewed.

  • Code has been formatted with pre-commit run --all-files command
  • All tests are passing
  • Docstrings are in Google style
  • Related documentation has been updated (e.g. links, examples, etc.)
  • Code is ready for review

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @jc200808, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the GRPO training framework by introducing dedicated support for pointwise training using common OpenJudge grader datasets and their associated prompt templates. It provides a comprehensive pipeline, from data preprocessing and flexible dataset loading to a custom reward function, enabling the training of judge models that can score single responses based on various criteria like correctness, hallucination, and relevance.

Highlights

  • New GRPO Training Script: Introduced a new GRPO training script (run_pointwise_grader.sh) specifically designed for OpenJudge grader datasets and prompt templates, enabling pointwise judge model training.
  • Flexible Grader Dataset Handling: Implemented grader_rl_dataset.py which includes BaseChatRLDataset and PointwiseChatRLDataset to flexibly load and preprocess diverse grader datasets, supporting both old and new JSON structures and various prompt templates.
  • Custom Pointwise Reward Function: Added grader_reward_fn.py to define a custom reward function for pointwise scoring, capable of extracting scores from model responses (including handling 'thinking' tags and <score> tags) and calculating rewards based on prediction accuracy.
  • Grader Data Preprocessing Utility: Provided preprocess_grader_data.py as a utility to prepare raw grader data, transforming it into pointwise positive/negative examples and splitting it into training and validation sets in a suitable format.
  • Updated Documentation: Updated the GRPO README.md with clear instructions for setting up and running the new pointwise training using grader datasets.
Changelog
  • cookbooks/training_judge_model/grpo/README.md
    • Added a new section "GRPO training using OpenJudge grader dataset" with instructions and a command to run the new run_pointwise_grader.sh script.
  • cookbooks/training_judge_model/grpo/grader_rl_dataset.py
    • Added DatasetConfig dataclass for dataset configuration.
    • Added BaseChatRLDataset as a base class for RL datasets, handling file loading, format detection, prompt filtering, and tokenization.
    • Added PointwiseChatRLDataset inheriting from BaseChatRLDataset, specifically tailored for pointwise scoring datasets, including logic to build chat messages and format prompts using OpenJudge grader templates (correctness, hallucination, relevance, etc.), and extract ground truth scores from various data structures.
  • cookbooks/training_judge_model/grpo/pointwise/grader_reward_fn.py
    • Added filter_thinking_parts function to remove internal "thinking" tags from model responses.
    • Added extract_score function to parse scores from <score> tags or JSON objects within model responses.
    • Added calculate_reward function to compute rewards based on the difference between predicted and true scores, supporting both binary and multi-class scoring.
    • Added compute_score function as the main entry point for reward calculation, integrating the filtering, extraction, and reward logic.
  • cookbooks/training_judge_model/grpo/pointwise/run_pointwise_grader.sh
    • Added a new bash script for configuring and launching GRPO pointwise training jobs on a Ray cluster.
    • Configured paths for model, training/validation data, output, and custom modules (grader_reward_fn.py, grader_rl_dataset.py).
    • Set various hyperparameters for data handling (batch sizes, max lengths), optimizer, GRPO specifics (rollout samples), and training (epochs, save/test frequency).
    • Included Ray job submission command with detailed parameter passing for the custom dataset and reward function.
  • cookbooks/training_judge_model/grpo/pointwise/utils/preprocess_grader_data.py
    • Added utility functions get_data_files, get_split_params, and validate_files for interactive user input and file validation.
    • Added process_single_file function to read JSON data, transform it into pointwise positive/negative examples (with scores), shuffle, sample, and split into _train.jsonl and _val.jsonl files.
    • Implemented main function to orchestrate the data preprocessing workflow.
Activity
  • No specific activity (comments, reviews, progress) was provided in the context.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces GRPO pointwise training using OpenJudge grader datasets, adding a new dataset loader, reward function, preprocessing script, and training script. Critically, it introduces several security vulnerabilities, including a prompt injection vulnerability in the dataset processing logic and multiple command injection vulnerabilities in the training shell script due to unquoted environment variables, which could allow an attacker to manipulate the training process or execute arbitrary commands. The review also provides feedback on improving code quality, consistency, and robustness, with suggestions for specific exception handling, reducing code duplication, standardizing logging practices, and refining the reward calculation logic.

# Extract score from <score> tag
score_pattern = r"<score>(.*?)</score>"
match = re.search(score_pattern, response_text, re.DOTALL)
pprint(f"response_text: {response_text}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using pprint (and print on line 143) for logging and error reporting is not ideal in library code. It's better to use the standard logging module. This provides more control over verbosity and output streams. Please consider adding a logger at the module level (e.g., logger = logging.getLogger(__name__)) and using logger.debug() here and logger.error() on line 143.

response_text = str(response_text)

# Extract score from <score> tag
score_pattern = r"<score>(.*?)</score>"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possbile for score being float? like 0.8?

Copy link
Collaborator Author

@jc200808 jc200808 Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally speaking it is possible. For this custom reward function, since our grader are defined as two kinds of strategies: binary (0/1) and multiple(1-5 scales), we only consider int score at this moment, but it should work for other type like float.

true_score = int(ground_truth)
else:
# If ground_truth is unavailable, try to get from extra_info
if extra_info and isinstance(extra_info, dict):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any case cover this? e.g., extrace_info is json format string?

Copy link
Collaborator Author

@jc200808 jc200808 Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In some dataset, extra_info field may include the ground-truth. e.g.
{'index': 776, 'is_chosen': True, 'split': 'train', 'subset': 'Factuality', 'task_category': 'comparison', 'unique_id': 'eb0d805d16e718e4'}
For more information, it can refer https://verl.readthedocs.io/en/latest/preparation/reward_function.html


# Return detailed information
return {
"score": reward,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"score": reward --> "reward": reward ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The key name must be "score",verl/workers/reward_manager/naive.py, line 97, in call
reward = score["score"]
~~~~~^^^^^^^^^
KeyError: 'score'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants