feat: pointwise grpo training using common grader datasets and prompt…#118
feat: pointwise grpo training using common grader datasets and prompt…#118jc200808 wants to merge 2 commits intoagentscope-ai:mainfrom
Conversation
Summary of ChangesHello @jc200808, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the GRPO training framework by introducing dedicated support for pointwise training using common OpenJudge grader datasets and their associated prompt templates. It provides a comprehensive pipeline, from data preprocessing and flexible dataset loading to a custom reward function, enabling the training of judge models that can score single responses based on various criteria like correctness, hallucination, and relevance. Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces GRPO pointwise training using OpenJudge grader datasets, adding a new dataset loader, reward function, preprocessing script, and training script. Critically, it introduces several security vulnerabilities, including a prompt injection vulnerability in the dataset processing logic and multiple command injection vulnerabilities in the training shell script due to unquoted environment variables, which could allow an attacker to manipulate the training process or execute arbitrary commands. The review also provides feedback on improving code quality, consistency, and robustness, with suggestions for specific exception handling, reducing code duplication, standardizing logging practices, and refining the reward calculation logic.
cookbooks/training_judge_model/grpo/pointwise/run_pointwise_grader.sh
Outdated
Show resolved
Hide resolved
cookbooks/training_judge_model/grpo/pointwise/run_pointwise_grader.sh
Outdated
Show resolved
Hide resolved
cookbooks/training_judge_model/grpo/pointwise/run_pointwise_grader.sh
Outdated
Show resolved
Hide resolved
| # Extract score from <score> tag | ||
| score_pattern = r"<score>(.*?)</score>" | ||
| match = re.search(score_pattern, response_text, re.DOTALL) | ||
| pprint(f"response_text: {response_text}") |
There was a problem hiding this comment.
Using pprint (and print on line 143) for logging and error reporting is not ideal in library code. It's better to use the standard logging module. This provides more control over verbosity and output streams. Please consider adding a logger at the module level (e.g., logger = logging.getLogger(__name__)) and using logger.debug() here and logger.error() on line 143.
cookbooks/training_judge_model/grpo/pointwise/grader_reward_fn.py
Outdated
Show resolved
Hide resolved
| response_text = str(response_text) | ||
|
|
||
| # Extract score from <score> tag | ||
| score_pattern = r"<score>(.*?)</score>" |
There was a problem hiding this comment.
is it possbile for score being float? like 0.8?
There was a problem hiding this comment.
Generally speaking it is possible. For this custom reward function, since our grader are defined as two kinds of strategies: binary (0/1) and multiple(1-5 scales), we only consider int score at this moment, but it should work for other type like float.
| true_score = int(ground_truth) | ||
| else: | ||
| # If ground_truth is unavailable, try to get from extra_info | ||
| if extra_info and isinstance(extra_info, dict): |
There was a problem hiding this comment.
any case cover this? e.g., extrace_info is json format string?
There was a problem hiding this comment.
In some dataset, extra_info field may include the ground-truth. e.g.
{'index': 776, 'is_chosen': True, 'split': 'train', 'subset': 'Factuality', 'task_category': 'comparison', 'unique_id': 'eb0d805d16e718e4'}
For more information, it can refer https://verl.readthedocs.io/en/latest/preparation/reward_function.html
|
|
||
| # Return detailed information | ||
| return { | ||
| "score": reward, |
There was a problem hiding this comment.
"score": reward --> "reward": reward ?
There was a problem hiding this comment.
The key name must be "score",verl/workers/reward_manager/naive.py, line 97, in call
reward = score["score"]
~~~~~^^^^^^^^^
KeyError: 'score'
… templates
OpenJudge Version
[The version of OpenJudge you are working on, e.g.
import openjudge; print(openjudge.__version__)]Description
[Please describe the background, purpose, changes made, and how to test this PR]
Checklist
Please check the following items before code is ready to be reviewed.
pre-commit run --all-filescommand