Eval harness: Tinker proxy scoring, batched feedback, configurable base model#14
Eval harness: Tinker proxy scoring, batched feedback, configurable base model#14
Conversation
…se model Fix eval harness for the batched feedback API (broken since PR #12) and add Tinker proxy support so evals can run against the Tinker stack without vLLM. - Add POST /v1/score endpoint to Tinker proxy for logprob scoring - Rewrite _submit_feedback to use FeedbackBatchRequest with rollout_logprobs - Add --proxy-url, --base-model, --batch-size CLI flags - Add proxy-based logprob scoring path (fetch_response_logprob_sum_via_proxy) - Add rollout collection from proxy cache (_generate_and_collect) - Skip vLLM LoRA load/reload when in Tinker mode - Use config.base_model instead of hardcoded Qwen/Qwen3-8B in _init_lora - Add unit tests for /v1/score endpoint Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughA new Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant EvalRunner as Eval Runner
participant Proxy
participant Cache as Proxy Cache
participant vLLM
rect rgba(100, 150, 200, 0.5)
Note over Client,vLLM: Proxy Mode (proxy_url configured)
Client->>EvalRunner: run_preference_experiment(config with proxy_url)
EvalRunner->>Proxy: POST /v1/completions (prompt)
Proxy->>vLLM: Generate completion
vLLM-->>Proxy: completion_text
Proxy->>Cache: Store completion & logprobs
Proxy-->>EvalRunner: response
EvalRunner->>Cache: fetch_rollout_from_proxy_cache(response)
Cache-->>EvalRunner: completion, logprobs
EvalRunner->>EvalRunner: measure_logprob_margin(proxy_url=...)
EvalRunner->>Proxy: POST /v1/score (prompt+completion)
Proxy-->>EvalRunner: logprobs, logprob_sum
EvalRunner->>EvalRunner: Collect sample (prompt, response, logprobs, feedback)
EvalRunner->>EvalRunner: Batch & submit_feedback (multiple samples)
end
rect rgba(200, 150, 100, 0.5)
Note over Client,vLLM: Direct vLLM Mode (no proxy_url)
Client->>EvalRunner: run_preference_experiment(config without proxy_url)
EvalRunner->>vLLM: Generate completion
vLLM-->>EvalRunner: completion
EvalRunner->>vLLM: Get per-token logprobs
vLLM-->>EvalRunner: logprobs
EvalRunner->>EvalRunner: measure_logprob_margin(proxy_url=None)
EvalRunner->>EvalRunner: Collect sample
EvalRunner->>EvalRunner: Batch & submit_feedback (multiple samples)
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~30 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Fix all issues with AI agents
In `@scripts/eval/logprob.py`:
- Around line 121-134: The proxy call in fetch_response_logprob_sum_via_proxy
currently posts prompt and completion separately which causes tokenization
boundary mismatches versus joint encoding; instead send a single concatenated
string so the proxy encodes the full prompt+completion as one unit (or update
the proxy to provide a joint-encoding mode) — update the JSON payload in
fetch_response_logprob_sum_via_proxy (or adjust the /v1/score handler) to ensure
the tokenizer sees chatml_prefix + response_text together, so returned
"logprob_sum" uses the same tokenization as the vLLM path.
In `@scripts/eval/runner.py`:
- Around line 256-311: The code in _fetch_rollout_logprobs_vllm builds a
hardcoded ChatML string (chatml_prefix using <|im_start|> / <|im_end|> and
constants _ASSISTANT_PREFIX/_IM_END) which will mis-tokenize non-ChatML base
models; change the implementation to either (1) call the vLLM /tokenize and
/v1/completions endpoints using the messages structure (pass the messages list
instead of a ChatML string) if the server supports it, or (2) detect the base
model/template and construct the prefix using the correct template markers (or
expose a template strategy) before calling tokenize and completions; update the
code paths that build chatml_prefix and the /tokenize call in
_fetch_rollout_logprobs_vllm (and the constants _ASSISTANT_PREFIX/_IM_END) to
use the chosen approach so token counts and prompt_logprobs are correct for
non-ChatML models.
- Around line 193-220: The code in _fetch_rollout_from_proxy_cache reads
data["logprobs"] which can be None (per _CompletionCacheEntry), causing
downstream failures; update the function to guard against None by coalescing
data["logprobs"] to an empty list (or validating and raising a clear error)
before assigning to logprobs, e.g., replace the direct assignment to ensure
logprobs is a list[float] (and keep the function return signature unchanged),
and add a brief comment referencing the proxy cache nullable contract so future
readers see why the default is used.
🧹 Nitpick comments (1)
scripts/eval/runner.py (1)
671-686: Single fixed-delay retry may be insufficient for transient failures.The retry logic uses a single 5-second sleep before retrying. For production-grade resilience, consider exponential backoff or at least a configurable retry count. However, for an eval harness this is pragmatic enough.
The cached response text and logprobs come from the same seq.tokens / seq.logprobs — they are already aligned. Stripping wrapper tokens from the text without stripping corresponding logprobs would cause a length mismatch in the training engine. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 3c29a9d0d2
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
…n tests Centralize ChatML construction in messages_to_chatml() (teacher_helpers) and delegate to tokenizer.apply_chat_template() via the proxy's /v1/score endpoint. This eliminates 4 independent <|im_start|>/<|im_end|> construction sites that would break for non-ChatML models. - Add optional `messages` field to /v1/score ScoreRequest with model_validator - Replace LogprobPair.prompt_chatml with prompt_messages (list[ChatMessage]) - Update logprob.py to pass messages instead of pre-built ChatML strings - Replace manual ChatML loop in runner.py with messages_to_chatml() - Add \n separator to messages_to_chatml() for canonical ChatML format - Remove template stripping from integration test _fetch_raw_completion() - Add tests for /v1/score with messages and validation Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The Tinker SDK's compute_logprobs() returns a Future, not the logprobs directly. Add .result() to match the pattern used in _sample_async(). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ashboard - Move all 12 eval files from scripts/eval/ to claas/eval/ - Extract add_eval_arguments() and build_config() for reuse - Add `claas eval` CLI subcommand in claas/cli.py - Add /v1/eval endpoint serving HTML dashboard from eval results - New claas/eval/dashboard.py reads summary.json + steps.jsonl and renders summary table with per-preference expandable step details - New claas/eval_dashboard.html template (same style as index.html) - Add claas/eval/README.md with usage, metrics, and output format docs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Wrap apply_chat_template and encode results in list() so the prompt_tokens variable is always list[int], resolving invalid-assignment and invalid-argument-type errors from ty. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The return type of apply_chat_template is a complex union that ty cannot narrow through isinstance checks. Annotating the result as Any resolves invalid-assignment and invalid-argument-type errors on both Python 3.11 (CI) and 3.14 (local). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…plots - Remove CriteriaResult/PreferenceSummary types and all pass/fail/marginal verdict logic from runner and dashboard — raw metrics only - Dashboard now discovers all run subdirectories under results_dir and renders each as a collapsible section (newest-first) - Embed plot PNGs inline as base64 <img> tags in a responsive grid - Remove threshold lines/bands from all plot functions - Update HTML template: drop verdict CSS, add run-section and plot-grid styles Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 3fc822f208
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
claas/eval/collapse.py (1)
205-212:⚠️ Potential issue | 🟡 MinorTreat missing logprobs as unavailable entropy.
If the backend omits
logprobs, returning0.0entropy can trigger false collapse signals. Consider returningNonefor entropy in this case to align with the optional type.🔧 Safer fallback
logprobs_content = choice.get("logprobs", {}).get("content", []) if not logprobs_content: - return (0.0, 0.0) + return (None, 0.0)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@claas/eval/collapse.py` around lines 205 - 212, The code treats missing logprobs as zero entropy by returning (0.0, 0.0); instead return None for unavailable entropy so callers can distinguish "unavailable" from a numeric entropy. In the block that inspects choice = resp.json()["choices"][0] and logprobs_content = choice.get("logprobs", {}).get("content", []), change the early return for missing logprobs_content to return (None, None) (and update any related type hints or callers to accept Optional[float]) so entropy is represented as unavailable rather than zero.claas/eval/__main__.py (1)
160-176:⚠️ Potential issue | 🟡 MinorGuard against
batch_size <= 0.The CLI accepts any integer; a value of 0 (or negative) will skip sampling and still record steps, producing misleading results. Add a simple validation in
build_config(or in argparse) to enforce>= 1.✅ Minimal validation
def build_config(args: argparse.Namespace) -> HarnessConfig: """Build a HarnessConfig from parsed CLI arguments.""" # Parse --metrics comma string metrics_list = [m.strip() for m in args.metrics.split(",") if m.strip()] @@ if args.collapse_steps is not None: collapse_steps = {int(s.strip()) for s in args.collapse_steps.split(",") if s.strip()} + + if args.batch_size < 1: + raise ValueError("--batch-size must be >= 1")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@claas/eval/__main__.py` around lines 160 - 176, Add validation to ensure args.batch_size is >= 1 inside build_config: check args.batch_size (from the CLI) and if it's less than 1 raise a clear exception (e.g., ValueError with message "batch_size must be >= 1") or otherwise terminate, so sampling can't be skipped; modify build_config to perform this guard before using args.batch_size to construct the HarnessConfig and reference the args.batch_size symbol and the build_config function in the change.
🧹 Nitpick comments (4)
claas/cli.py (1)
201-206: Consider deferring the import to avoid module load overhead.The
from .eval.__main__ import add_eval_argumentsruns at module parse time (when setting up subparsers), which loads the eval module even when other commands are invoked. This is inconsistent with the lazy import pattern used incmd_eval(lines 88-89).♻️ Defer import to match lazy loading pattern
# eval command eval_parser = subparsers.add_parser("eval", help="Run the evaluation harness") - from .eval.__main__ import add_eval_arguments - - add_eval_arguments(eval_parser) + # Arguments are added lazily when the eval command is actually invoked + # to avoid loading the eval module for other commands eval_parser.set_defaults(func=cmd_eval)Then in
cmd_eval, add the arguments dynamically:def cmd_eval(args: argparse.Namespace) -> int: """Run the evaluation harness.""" from .eval.__main__ import add_eval_arguments, build_config from .eval.runner import run_harness # Re-parse with eval-specific arguments if needed config = build_config(args) asyncio.run(run_harness(config)) return 0However, this would require architectural changes to argparse setup. The current approach is acceptable if the eval module import is lightweight.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@claas/cli.py` around lines 201 - 206, The import of add_eval_arguments runs at module-import time and should be deferred to avoid loading the eval module for other commands; remove the top-level "from .eval.__main__ import add_eval_arguments" and instead import add_eval_arguments lazily inside cmd_eval (the command handler) or right before calling add_eval_arguments when constructing eval_parser (preferred: move the add_eval_arguments call into cmd_eval so eval-specific args are added/parsed only when cmd_eval runs); update references to add_eval_arguments and ensure eval_parser.set_defaults(func=cmd_eval) remains so argparse still dispatches to the lazily-importing cmd_eval.claas/eval/README.md (2)
26-28: Add language specifier to code block.Static analysis flagged this code block as missing a language specifier.
📝 Proposed fix
-``` +```text GET /v1/eval?results_dir=./eval_results</details> <details> <summary>🤖 Prompt for AI Agents</summary>Verify each finding against the current code and only fix it if needed.
In
@claas/eval/README.mdaround lines 26 - 28, The code block showing "GET
/v1/eval?results_dir=./eval_results" lacks a language specifier; update the
triple-backtick delimiter for that block to include a language (e.g., changetotext) so the fenced code block becomestext ...; locate the README
section containing the GET /v1/eval?results_dir=./eval_results snippet and
modify the opening fence accordingly.</details> --- `61-68`: **Add language specifier to directory structure block.** Static analysis flagged this code block as missing a language specifier. <details> <summary>📝 Proposed fix</summary> ```diff -``` +```text eval_results/ ├── summary.json # Per-preference pass/fail verdicts └── <preference>/ ├── metadata.json # Run config + LoRA ID ├── baseline.json # Pre-training metric snapshot └── steps.jsonl # One JSON object per feedback step ``` ``` </details> <details> <summary>🤖 Prompt for AI Agents</summary>Verify each finding against the current code and only fix it if needed.
In
@claas/eval/README.mdaround lines 61 - 68, Update the fenced code block in
README.md under the eval_results directory structure to include a language
specifier (use "text") so the block is properly highlighted; locate the block
that currently starts withand replace it withtext to produce
text\neval_results/...ensuring the rest of the block content
(summary.json, /metadata.json, baseline.json, steps.jsonl) remains
unchanged.</details> </blockquote></details> <details> <summary>claas/proxy/tinker_inference_proxy.py (1)</summary><blockquote> `494-496`: **Consider logging when logprobs are `None`.** Replacing `None` logprobs with `0.0` silently masks potential issues. A logprob of `0.0` means probability 1.0, which is likely incorrect if the actual value was `None` due to an error or edge case. <details> <summary>♻️ Proposed improvement</summary> ```diff raw = logprobs_full[prompt_len : prompt_len + completion_len] - logprobs = [lp if lp is not None else 0.0 for lp in raw] + if any(lp is None for lp in raw): + logger.warning( + "Some completion logprobs were None (got %d/%d), defaulting to 0.0", + sum(1 for lp in raw if lp is None), + len(raw), + ) + logprobs = [lp if lp is not None else 0.0 for lp in raw] ``` </details> <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against the current code and only fix it if needed. In `@claas/proxy/tinker_inference_proxy.py` around lines 494 - 496, The current slice that builds logprobs (using logprobs_full[prompt_len : prompt_len + completion_len] into raw and mapping None to 0.0 into logprobs) silently masks missing values; update the logic in the function containing this block (the code around variables logprobs_full, prompt_len, completion_len, raw, logprobs) to detect when any lp is None, emit a warning/error-level log including contextual info (index or token position, prompt_len/completion_len, maybe a sample of logprobs_full), and only then decide on a fallback value or propagate the None; ensure the logger call includes enough context to debug the source of None values rather than silently substituting 0.0. ``` </details> </blockquote></details> </blockquote></details> <details> <summary>🤖 Prompt for all review comments with AI agents</summary>Verify each finding against the current code and only fix it if needed.
Inline comments:
In@claas/api.py:
- Around line 844-857: The eval_dashboard endpoint takes results_dir from the
query and passes it to eval_dashboard_html allowing arbitrary filesystem reads;
fix by restricting and validating the incoming path: define a known base
directory (e.g., DEFAULT_RESULTS_BASE) and in eval_dashboard resolve the
requested path with pathlib.Path(results_dir).resolve(), join/resolve against
the base (base_dir.joinpath(requested).resolve()) and reject the request (raise
HTTPException 400/403) if the resolved path is not a child of the base_dir or if
results_dir is absolute or contains upward traversal; only pass the
validated/resolved child path into eval_dashboard_html and log the rejection
reasons.In
@claas/eval/dashboard.py:
- Around line 33-53: The current loaders _load_json and _load_steps will raise
on malformed JSON and crash the dashboard; update them to catch JSONDecodeError
(and optionally ValueError/UnicodeDecodeError) and handle failures gracefully:
in _load_json (function _load_json) wrap json.load in try/except and return {}
while logging or ignoring the error if decoding fails, and in _load_steps
(function _load_steps) wrap json.loads per-line in try/except and skip lines
that fail to decode (do not append) while optionally logging a warning; keep
using step_result_from_dict when json.loads succeeds so valid lines still
produce StepResult objects.In
@claas/eval/runner.py:
- Around line 188-214: The cache lookup in _fetch_cached_completion currently
keys only on SHA-256(strip_thinking(visible_content)) which can return a cached
{prompt,response,logprobs} from a different generation; change the logic to
include the generation prompt (or its tokenized hash or request_id) in the cache
key or, if you cannot change the proxy key, fetch the cached entry and validate
that data["prompt"] equals the expected prompt/context before returning; if the
prompt does not match, fallback to fetching fresh scores (e.g., call /v1/score
or re-run scoring) and do not return the mismatched entry.In
@claas/proxy/tinker_inference_proxy.py:
- Around line 470-478: The code assumes tokenizer.apply_chat_template returns
either a list or a dict and narrows types unsafely; fix by normalizing the
result: call tokenizer.apply_chat_template into result, then if
isinstance(result, list) set prompt_tokens = result, else if it's a dict extract
prompt_tokens = result.get("input_ids") (and validate it's not None), otherwise
raise/handle error; reference symbols: req.messages,
tokenizer.apply_chat_template, result, prompt_tokens.
Outside diff comments:
In@claas/eval/__main__.py:
- Around line 160-176: Add validation to ensure args.batch_size is >= 1 inside
build_config: check args.batch_size (from the CLI) and if it's less than 1 raise
a clear exception (e.g., ValueError with message "batch_size must be >= 1") or
otherwise terminate, so sampling can't be skipped; modify build_config to
perform this guard before using args.batch_size to construct the HarnessConfig
and reference the args.batch_size symbol and the build_config function in the
change.In
@claas/eval/collapse.py:
- Around line 205-212: The code treats missing logprobs as zero entropy by
returning (0.0, 0.0); instead return None for unavailable entropy so callers can
distinguish "unavailable" from a numeric entropy. In the block that inspects
choice = resp.json()["choices"][0] and logprobs_content = choice.get("logprobs",
{}).get("content", []), change the early return for missing logprobs_content to
return (None, None) (and update any related type hints or callers to accept
Optional[float]) so entropy is represented as unavailable rather than zero.
Nitpick comments:
In@claas/cli.py:
- Around line 201-206: The import of add_eval_arguments runs at module-import
time and should be deferred to avoid loading the eval module for other commands;
remove the top-level "from .eval.main import add_eval_arguments" and instead
import add_eval_arguments lazily inside cmd_eval (the command handler) or right
before calling add_eval_arguments when constructing eval_parser (preferred: move
the add_eval_arguments call into cmd_eval so eval-specific args are added/parsed
only when cmd_eval runs); update references to add_eval_arguments and ensure
eval_parser.set_defaults(func=cmd_eval) remains so argparse still dispatches to
the lazily-importing cmd_eval.In
@claas/eval/README.md:
- Around line 26-28: The code block showing "GET
/v1/eval?results_dir=./eval_results" lacks a language specifier; update the
triple-backtick delimiter for that block to include a language (e.g., changetotext) so the fenced code block becomestext ...; locate the README
section containing the GET /v1/eval?results_dir=./eval_results snippet and
modify the opening fence accordingly.- Around line 61-68: Update the fenced code block in README.md under the
eval_results directory structure to include a language specifier (use "text") so
the block is properly highlighted; locate the block that currently starts with
and replace it withtext to producetext\neval_results/...ensuring
the rest of the block content (summary.json, /metadata.json,
baseline.json, steps.jsonl) remains unchanged.In
@claas/proxy/tinker_inference_proxy.py:
- Around line 494-496: The current slice that builds logprobs (using
logprobs_full[prompt_len : prompt_len + completion_len] into raw and mapping
None to 0.0 into logprobs) silently masks missing values; update the logic in
the function containing this block (the code around variables logprobs_full,
prompt_len, completion_len, raw, logprobs) to detect when any lp is None, emit a
warning/error-level log including contextual info (index or token position,
prompt_len/completion_len, maybe a sample of logprobs_full), and only then
decide on a fallback value or propagate the None; ensure the logger call
includes enough context to debug the source of None values rather than silently
substituting 0.0.</details> <!-- This is an auto-generated comment by CodeRabbit for review status -->
| async def _fetch_cached_completion( | ||
| proxy_url: str, | ||
| visible_content: str, | ||
| ) -> tuple[str, str, list[float]]: | ||
| """Fetch the cached raw completion from the proxy by content hash. | ||
|
|
||
| The proxy caches ``{prompt, response, logprobs}`` keyed by | ||
| ``SHA-256(stripped_content)``. Since the proxy now strips thinking | ||
| before returning content, ``visible_content`` already matches the | ||
| cache key. We apply ``strip_thinking`` defensively in case the caller | ||
| passes un-stripped text. | ||
|
|
||
| Returns ``(real_prompt, raw_response, rollout_logprobs)``. | ||
| """ | ||
| content_hash = hashlib.sha256( | ||
| strip_thinking(visible_content).encode("utf-8"), | ||
| ).hexdigest() | ||
| async with httpx.AsyncClient(base_url=proxy_url, timeout=60.0) as client: | ||
| resp = await client.get( | ||
| "/v1/completions/raw", | ||
| params={"content_hash": content_hash}, | ||
| ) | ||
| resp.raise_for_status() | ||
|
|
||
| data = resp.json() | ||
| return data["prompt"], data["response"], data["logprobs"] | ||
|
|
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Search for the tinker_inference_proxy file
find . -type f -name "*tinker_inference_proxy*" -o -name "*proxy*" | grep -E "\.(py|js|ts)$" | head -20Repository: kfallah/CLaaS
Length of output: 125
🏁 Script executed:
#!/bin/bash
# Also search for /v1/completions/raw endpoint definition
rg -r "completions/raw" --type py -lRepository: kfallah/CLaaS
Length of output: 453
🏁 Script executed:
#!/bin/bash
# Search for cache-related code and content_hash usage
rg -n "content_hash|cache" --type py | head -50Repository: kfallah/CLaaS
Length of output: 4392
🏁 Script executed:
#!/bin/bash
cat -n claas/proxy/tinker_inference_proxy.py | head -100Repository: kfallah/CLaaS
Length of output: 4120
🏁 Script executed:
#!/bin/bash
# Find the /v1/completions/raw endpoint handler
rg -n "completions/raw" claas/proxy/tinker_inference_proxy.py -A 20Repository: kfallah/CLaaS
Length of output: 880
🏁 Script executed:
#!/bin/bash
# Search for cache key and content_hash logic
rg -n "content_hash|cache\[" claas/proxy/tinker_inference_proxy.pyRepository: kfallah/CLaaS
Length of output: 755
🏁 Script executed:
#!/bin/bash
# Find where cache entries are stored (put method calls)
rg -n "_completion_cache.put\|\.put\(" claas/proxy/tinker_inference_proxy.py -B 5 -A 2Repository: kfallah/CLaaS
Length of output: 39
🏁 Script executed:
#!/bin/bash
# Look at the _CompletionCacheEntry definition
rg -n "class _CompletionCacheEntry" claas/proxy/tinker_inference_proxy.py -A 10Repository: kfallah/CLaaS
Length of output: 396
🏁 Script executed:
#!/bin/bash
# Check how runner.py uses the returned cached data
rg -n "await _fetch_cached_completion" claas/eval/runner.py -B 3 -A 10Repository: kfallah/CLaaS
Length of output: 535
🏁 Script executed:
#!/bin/bash
# Search for cache.put or _completion_cache.put with broader context
rg -n "put\|_completion_cache\[" claas/proxy/tinker_inference_proxy.py -B 3 -A 3Repository: kfallah/CLaaS
Length of output: 39
🏁 Script executed:
#!/antml:bin/bash
# Look for the full context around line 339-341 where cache is populated
sed -n '330,360p' claas/proxy/tinker_inference_proxy.pyRepository: kfallah/CLaaS
Length of output: 1245
🏁 Script executed:
#!/bin/bash
# Check what _generate_response returns and how it flows to cache
rg -n "_generate_response\|_completion_cache" claas/proxy/tinker_inference_proxy.py | head -30Repository: kfallah/CLaaS
Length of output: 39
🏁 Script executed:
#!/bin/bash
# Check the full _fetch_cached_completion function and surrounding context
sed -n '188,240p' claas/eval/runner.pyRepository: kfallah/CLaaS
Length of output: 2058
🏁 Script executed:
#!/an/bash
# Check if there's any validation of real_prompt vs actual prompt used
rg -n "real_prompt" claas/eval/runner.py -B 2 -A 2Repository: kfallah/CLaaS
Length of output: 1249
🏁 Script executed:
#!/bin/bash
# Check test cases to understand the expected behavior
cat -n tests/test_tinker_proxy.py | head -150Repository: kfallah/CLaaS
Length of output: 6690
🏁 Script executed:
#!/bin/bash
# Check if there's any verification that returned prompt matches what was expected
rg -n "prompt.*=.*_generate_response\|prompt.*expected\|assert.*prompt" claas/eval/runner.py -B 2 -A 2Repository: kfallah/CLaaS
Length of output: 39
🏁 Script executed:
#!/bin/bash
# Check if the scenario of different prompts generating same content is possible
# Look at how OpenClaw templating works
rg -n "openclaw\|OpenClaw" claas/eval/runner.py | head -20Repository: kfallah/CLaaS
Length of output: 39
🏁 Script executed:
#!/bin/bash
# Let's look at the actual flow - see _generate_response and how content flows
sed -n '150,240p' claas/eval/runner.pyRepository: kfallah/CLaaS
Length of output: 3260
🏁 Script executed:
#!/bin/bash
# Understand if two different prompts could generate identical visible_content
# Check strip_thinking and visible content handling
rg -n "strip_thinking|visible_content" claas/eval/runner.py -B 2 -A 2 | head -50Repository: kfallah/CLaaS
Length of output: 1286
🏁 Script executed:
#!/bin/bash
# Check the scenario more carefully - can two different OpenClaw-templated prompts
# realistically produce identical visible content?
rg -n "OpenClaw\|_generate_response\|visible" claas/eval/runner.py | head -40Repository: kfallah/CLaaS
Length of output: 39
🏁 Script executed:
#!/bin/bash
# Check if there's any request ID or correlation between generation and cache retrieval
rg -n "completion_id|request_id|uuid" claas/proxy/tinker_inference_proxy.py | head -20Repository: kfallah/CLaaS
Length of output: 630
🏁 Script executed:
#!/bin/bash
# Check the flow: when a request comes in, is the cache entry guaranteed to be
# for the same request that's retrieving it?
# Look at tinker_inference_proxy.py POST handler
sed -n '250,350p' claas/proxy/tinker_inference_proxy.pyRepository: kfallah/CLaaS
Length of output: 3473
🏁 Script executed:
#!/bin/bash
# Final check: see if there are any concurrent access patterns that could cause
# mismatches - e.g., multiple requests processing simultaneously
rg -n "threading|async|concurrent" claas/proxy/tinker_inference_proxy.py | head -15Repository: kfallah/CLaaS
Length of output: 910
🏁 Script executed:
#!/bin/bash
# Let me verify the cache behavior when content_hash collides
# Look at the _CompletionCache.put method again
sed -n '200,230p' claas/proxy/tinker_inference_proxy.pyRepository: kfallah/CLaaS
Length of output: 1213
🏁 Script executed:
#!/bin/bash
# Check if there's any deduplication or collision detection in cache
rg -n "class _CompletionCache" claas/proxy/tinker_inference_proxy.py -A 40Repository: kfallah/CLaaS
Length of output: 2315
🏁 Script executed:
#!/bin/bash
# Verify if there's any mitigation - check if real_prompt is validated after retrieval
sed -n '560,590p' claas/eval/runner.pyRepository: kfallah/CLaaS
Length of output: 1391
Cache key based solely on visible content can mix prompts.
The proxy cache lookup is keyed only on the SHA-256 hash of stripped response text. If two different prompts yield identical visible content, the cached {prompt, response, logprobs} entry will be shared or overwritten, causing the retrieved real_prompt to mismatch the actual generation request. This corrupts rollout logprobs and training signals.
Extend the cache key to include the tokenized prompt hash (or request ID), or validate that the cached entry's prompt matches the generation context before using it. Alternatively, add a fallback to /v1/score when the cached prompt is inconsistent.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@claas/eval/runner.py` around lines 188 - 214, The cache lookup in
_fetch_cached_completion currently keys only on
SHA-256(strip_thinking(visible_content)) which can return a cached
{prompt,response,logprobs} from a different generation; change the logic to
include the generation prompt (or its tokenized hash or request_id) in the cache
key or, if you cannot change the proxy key, fetch the cached entry and validate
that data["prompt"] equals the expected prompt/context before returning; if the
prompt does not match, fallback to fetching fresh scores (e.g., call /v1/score
or re-run scoring) and do not return the mismatched entry.
There was a problem hiding this comment.
The cache is designed for single-request lookups immediately after generation — the eval runner generates a response, then immediately retrieves the cached entry by content hash. In practice, different prompts producing byte-identical visible responses within the cache's TTL window is extremely unlikely. Adding prompt hashing would complicate the cache key and require the caller to know the tokenized prompt at lookup time, which defeats the purpose of the content-addressed design.
Comment generated by Claude Code
There was a problem hiding this comment.
@kfallah, thank you for the clarification. Your explanation makes sense — given the sequential generate-then-immediately-retrieve pattern in _generate_and_collect, the risk of a cache collision within that tight window is indeed negligible. The content-addressed design keeps the lookup simple and avoids the need to thread prompt context through the cache layer.
I'll consider this resolved. If collision edge cases ever become a concern in production, a lightweight validation (e.g., logging a warning if data["prompt"] doesn't match expectations) could be added as a safeguard, but it's not necessary given the current usage pattern.
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (2)
claas/eval_dashboard.html (1)
113-121: Expose expand/collapse state to assistive tech.Consider updating
aria-expanded(andaria-controls) when toggling details for better accessibility.♿ Suggested tweak
function toggleDetails(rowId, button) { const detailRow = document.getElementById(rowId); + button.setAttribute('aria-controls', rowId); if (detailRow.style.display === 'table-row') { detailRow.style.display = 'none'; button.textContent = 'Expand'; + button.setAttribute('aria-expanded', 'false'); return; } detailRow.style.display = 'table-row'; button.textContent = 'Collapse'; + button.setAttribute('aria-expanded', 'true'); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@claas/eval_dashboard.html` around lines 113 - 121, The toggleDetails function should update ARIA attributes so assistive technology knows the expanded state: when you get the detail row by id and toggle its display, also set the triggering button's aria-expanded to "true" or "false" accordingly and ensure the button has aria-controls set to the detail row id (add aria-controls if missing); update the button.textContent as before and keep aria state in sync with the visible state in the toggleDetails function.claas/eval/__main__.py (1)
159-164: Consider validating--batch-sizeto prevent invalid values.The
--batch-sizeargument accepts any integer, including 0 or negative values, which would cause issues in the runner loop. Consider adding validation.✨ Add validation
parser.add_argument( "--batch-size", type=int, default=1, + choices=range(1, 17), + metavar="N", help="Samples per feedback step (default: 1 = progressive, 4 = batched)", )Or validate in
build_config:if args.batch_size < 1: raise ValueError("--batch-size must be >= 1")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@claas/eval/__main__.py` around lines 159 - 164, Validate the --batch-size argument to ensure it's >= 1 to prevent zero/negative values causing failures; either add an argparse type/check by updating the parser.add_argument("--batch-size", ...) to include a custom validator or after parsing, assert/check args.batch_size and raise ValueError if args.batch_size < 1 (or normalize to 1). Locate the batch-size handling near parser.add_argument and/or inside build_config (refer to args.batch_size and the build_config function) and enforce the check with a clear error message like "--batch-size must be >= 1".
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@claas/eval/dashboard.py`:
- Around line 106-131: The _embed_plots function currently opens and
base64-encodes each PNG and will crash the dashboard if a file is unreadable;
wrap the per-file read/encode/append logic (the block that opens fpath, reads,
base64.b64encode, and images.append) in a try/except that catches
OSError/IOError and ValueError (and optionally Exception) to skip problematic
files and continue; on exception, emit a warning (use the module logger or
print) including the fname/fpath and the exception, and do not add that image to
images so img_tags generation remains safe.
In `@claas/eval/README.md`:
- Around line 26-28: Summary: The README fenced code blocks lack language
identifiers; add them to fix MD040 and improve rendering. Edit
claas/eval/README.md: change the block containing "GET
/v1/eval?results_dir=./eval_results" to start with ```http and change the
directory listing block beginning with "eval_results/" to start with ```text;
apply the same change for the other occurrence referenced (lines ~61-68) so all
relevant triple-backtick blocks have appropriate language identifiers.
In `@claas/eval/runner.py`:
- Around line 610-611: The fallback literal "I'd be happy to help you with
that." can contaminate metrics when all generations in a batch fail; change the
logic around the response_text fallback so that if response_text is None and
samples is empty you log a warning and either skip calling _measure_eval_metrics
for that step or pass None/empty to it (avoid the dummy string). Update the
block that sets response_text and the subsequent call site of
_measure_eval_metrics (and any callers that rely on response_text) to handle a
generation-failure case without injecting the placeholder.
In `@claas/proxy/tinker_inference_proxy.py`:
- Around line 470-472: The incoming req.messages can contain
ChatCompletionMessage.content that is not a string, which will break
tokenizer.apply_chat_template; in the block building dicts (where dicts =
[{"role": m.role, "content": m.content or ""} for m in req.messages]) coerce
each m.content to a string (e.g., use str(m.content) or json.dumps for dict/list
payloads) so tokenizer.apply_chat_template always receives a string content;
update the dicts construction in tinker_inference_proxy.py accordingly and
ensure the coercion handles None (fallback to "").
---
Duplicate comments:
In `@claas/api.py`:
- Around line 844-857: The eval_dashboard endpoint passes the untrusted
results_dir directly to eval_dashboard_html, creating a path traversal risk; fix
by validating and normalizing the path before calling eval_dashboard_html:
resolve results_dir using pathlib.Path(...).resolve(), ensure it is a subpath of
a configured allowed base directory (e.g., EVAL_RESULTS_BASE or the app’s data
dir) and reject absolute or upward-traversing inputs (return HTTPException
400/403), and also verify that the expected summary.json exists inside the
resolved directory before calling eval_dashboard_html; update the eval_dashboard
function to perform these checks (use symbols eval_dashboard, results_dir,
eval_dashboard_html, summary.json) and only pass validated paths through.
In `@claas/eval/dashboard.py`:
- Around line 33-53: Wrap the JSON parsing in both _load_json and _load_steps
with try/except that catches json.JSONDecodeError (and optionally OSError) so
malformed files don't raise and crash the dashboard; in _load_json catch errors
around json.load(f), log the error and return {}; in _load_steps catch errors
around json.loads(line) per-line, log the offending line/path and continue (skip
that line) so you return the valid StepResult list created via
step_result_from_dict without raising.
In `@claas/eval/runner.py`:
- Around line 188-214: The cache currently keys only on
SHA-256(stripped_content) in _fetch_cached_completion which can collide when
different prompts produce the same visible content; change the function
signature of _fetch_cached_completion to accept the original prompt, compute a
combined hash (e.g., SHA-256 of prompt + a separator +
strip_thinking(visible_content)) instead of content_hash alone, and send that
combined hash in the request params (replace params={"content_hash":
content_hash} accordingly); update all callers of _fetch_cached_completion to
pass the prompt so the cache key uniquely includes both prompt and visible
content.
In `@claas/proxy/tinker_inference_proxy.py`:
- Around line 65-74: The apply_chat_template return value can be several shapes
(list[int], BatchEncoding/dict with "input_ids", or nested lists); update
build_generation_prompt to normalize result into a flat list[int] before calling
T.ModelInput.from_ints: call self._tokenizer.apply_chat_template as before, then
if result is a list ensure its elements are ints (and not a list of lists), if
result is a mapping or BatchEncoding extract result["input_ids"], validate it's
a list[int], and otherwise raise a TypeError with a clear message; keep the rest
of the function using token_ids and T.ModelInput.from_ints(token_ids). Also
apply the same normalization pattern to the other call site that invokes
_tokenizer.apply_chat_template (the second similar prompt builder).
---
Nitpick comments:
In `@claas/eval_dashboard.html`:
- Around line 113-121: The toggleDetails function should update ARIA attributes
so assistive technology knows the expanded state: when you get the detail row by
id and toggle its display, also set the triggering button's aria-expanded to
"true" or "false" accordingly and ensure the button has aria-controls set to the
detail row id (add aria-controls if missing); update the button.textContent as
before and keep aria state in sync with the visible state in the toggleDetails
function.
In `@claas/eval/__main__.py`:
- Around line 159-164: Validate the --batch-size argument to ensure it's >= 1 to
prevent zero/negative values causing failures; either add an argparse type/check
by updating the parser.add_argument("--batch-size", ...) to include a custom
validator or after parsing, assert/check args.batch_size and raise ValueError if
args.batch_size < 1 (or normalize to 1). Locate the batch-size handling near
parser.add_argument and/or inside build_config (refer to args.batch_size and the
build_config function) and enforce the check with a clear error message like
"--batch-size must be >= 1".
| if response_text is None: | ||
| response_text = "I'd be happy to help you with that." |
There was a problem hiding this comment.
Fallback response may lead to training on dummy data.
When all samples in a batch fail to generate, response_text defaults to "I'd be happy to help you with that.". While samples will be empty (preventing feedback submission), the dummy text still flows to _measure_eval_metrics via the response_text parameter, potentially skewing metrics that rely on it.
Consider logging a warning or skipping metrics for steps with generation failures.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@claas/eval/runner.py` around lines 610 - 611, The fallback literal "I'd be
happy to help you with that." can contaminate metrics when all generations in a
batch fail; change the logic around the response_text fallback so that if
response_text is None and samples is empty you log a warning and either skip
calling _measure_eval_metrics for that step or pass None/empty to it (avoid the
dummy string). Update the block that sets response_text and the subsequent call
site of _measure_eval_metrics (and any callers that rely on response_text) to
handle a generation-failure case without injecting the placeholder.
There was a problem hiding this comment.
The fallback only triggers when ALL samples in a batch fail to generate. In that case samples is empty, so no feedback is submitted and the SDPO training path is skipped entirely. The response_text fallback only flows to _measure_eval_metrics for dashboard display of a failed step — it doesn't affect training data.
Comment generated by Claude Code
There was a problem hiding this comment.
@kfallah, thank you for the clarification! You're absolutely right—I see now that the if samples: guard at line 627 ensures no feedback is submitted when all generations fail, so the training path is completely protected. The fallback text only serves as a display placeholder for _measure_eval_metrics, which is a reasonable design for handling failed steps in the dashboard.
I appreciate you taking the time to explain the intended behavior!
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ ((\ ┃
┃ ( -.-) Got it! ┃
┃ o_(")(") Thanks for ┃
┃ explaining! ┃
┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛
📝 Learnings were identified and not saved (knowledge base disabled). Enable
4b9bf99 to
6a49e98
Compare
Extract _apply_chat_template_ids() helper with Any-typed tokenizer param so ty cannot narrow the complex union return type. Fix score endpoint tests to wrap compute_logprobs return in a mock Future (matching the .result() call in the implementation). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
6a49e98 to
33235b6
Compare
Matches the coercion already applied in /v1/chat/completions. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Catch JSONDecodeError in _load_json and _load_steps, and OSError in _embed_plots so that corrupted or partially-written files don't crash the dashboard. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Validate that the resolved path is within the base eval_results directory to prevent path traversal. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Coalesce data["logprobs"] to [] since _CompletionCacheEntry.logprobs is Optional[list[float]]. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
claas/eval/collapse.py (1)
205-213:⚠️ Potential issue | 🟡 MinorAvoid treating missing logprobs as zero entropy.
Iflogprobs_contentis empty (e.g., upstream omits top_logprobs), returning(0.0, 0.0)can trigger false collapse alerts. Since entropy is now optional, returnNone(and warn) when logprobs are unavailable.🛠️ Suggested fix
- if not logprobs_content: - return (0.0, 0.0) + if not logprobs_content: + logger.warning("No logprobs returned; entropy unavailable.") + return (None, 0.0)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@claas/eval/collapse.py` around lines 205 - 213, The current branch in collapse.py treats an empty logprobs_content as zero entropy by returning (0.0, 0.0); change this so missing top_logprobs returns None instead to avoid false collapse alerts: inside the block that reads choice = resp.json()["choices"][0] and computes logprobs_content, if logprobs_content is falsy, emit a warning (use the module logger or warnings.warn) mentioning the response_text/choice context and return (None, None) rather than (0.0, 0.0) so callers can detect unavailable entropy.
🧹 Nitpick comments (4)
claas/eval/runner.py (1)
231-231: Consider a more descriptive assertion message.The bare
assert config.proxy_url is not Nonewill produce an unhelpfulAssertionErrorif triggered. A message would aid debugging.♻️ Suggested improvement
- assert config.proxy_url is not None + assert config.proxy_url is not None, "_generate_and_collect requires proxy_url"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@claas/eval/runner.py` at line 231, Replace the bare assertion with a descriptive message (or explicit check) so failures are informative: change the line with assert config.proxy_url is not None to include a message like assert config.proxy_url is not None, "config.proxy_url must be set (proxy_url missing from config)" or alternatively perform an explicit check and raise a ValueError/RuntimeError with that message to locate and diagnose the missing proxy_url when functions referencing config.proxy_url (e.g., in runner.py where config.proxy_url is used) fail.claas/eval/dashboard.py (1)
26-26: Template path may be inconsistent withclaas/api.py.This file defines
EVAL_DASHBOARD_TEMPLATEasPath(__file__).resolve().parent.parent / "eval_dashboard.html"(i.e.,claas/eval_dashboard.html), whileclaas/api.pyline 81 defines it asPath(__file__).resolve().parent / "eval_dashboard.html"(i.e., alsoclaas/eval_dashboard.html). The paths resolve to the same location, but having two separate constants is redundant.♻️ Consider importing from a single source
Either import the constant from
claas/api.pyor define it in a shared location to avoid duplication:-EVAL_DASHBOARD_TEMPLATE = Path(__file__).resolve().parent.parent / "eval_dashboard.html" +from claas.api import EVAL_DASHBOARD_TEMPLATEOr move both template constants to a shared
claas/templates.pymodule.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@claas/eval/dashboard.py` at line 26, The EVAL_DASHBOARD_TEMPLATE constant is duplicated between cla as/eval/dashboard.py (EVAL_DASHBOARD_TEMPLATE) and cla as/api.py; refactor to a single source of truth by removing the duplicate and importing the constant instead of redefining it, or move the constant to a new shared module (e.g., cla as/templates.py) and update both places to import that symbol; update references in functions/classes that use EVAL_DASHBOARD_TEMPLATE to import the shared constant so only one definition exists.tests/integration/test_tinker_stack_integration.py (1)
21-22: Consider reusing the sharedstrip_thinkinghelper instead of duplicating logic.The
_THINK_REpattern and orphaned</think>stripping logic are duplicated fromclaas/proxy/tinker_inference_proxy.pyandclaas/eval/verifiers.py. This creates maintenance risk if the stripping behavior needs to change.♻️ Suggested refactor
+from claas.eval.verifiers import strip_thinking + -_THINK_RE = re.compile(r"<think>.*?</think>", re.DOTALL) - ... def _fetch_raw_completion(...): - visible = _THINK_RE.sub("", response_content) - idx = visible.find("</think>") - if idx >= 0: - visible = visible[idx + len("</think>"):] - visible = visible.strip() + visible = strip_thinking(response_content) content_hash = hashlib.sha256(visible.encode("utf-8")).hexdigest()Also applies to: 43-48
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/integration/test_tinker_stack_integration.py` around lines 21 - 22, The test defines a duplicate regex _THINK_RE and manual </think> stripping; instead import and use the shared strip_thinking helper (or equivalent) used by the production modules to avoid drift: replace usages of _THINK_RE and any manual think-tag removal in tests/integration/test_tinker_stack_integration.py with a call to strip_thinking (or the function name used in claas/proxy/tinker_inference_proxy.py / claas/eval/verifiers.py), updating the test imports accordingly and removing the orphaned regex and trim logic so all think-tag behavior is centralized.claas/eval/collapse.py (1)
262-308: Pass system_prompt and prompt_preamble to Tinker mode or make the silent drop explicit.
_entropy_and_logprob_tinkeralways uses a single user message, silently ignoring anysystem_promptorprompt_preamblepassed throughmeasure_entropy_and_mean_logprob. This creates inconsistency: the fallback vLLM path (lines 175–187) properly constructs messages from these parameters, but Tinker mode discards them. Thread a prebuiltmessageslist through, matching the pattern used in the fallback path, or add an assertion/log warning when these parameters are provided to surface the divergence.♻️ Proposed refactor to pass messages
- if openclaw_url and proxy_url: + if openclaw_url and proxy_url: + tinker_messages: list[dict[str, str]] = list(prompt_preamble or []) + if system_prompt and not any( + m.get("role") == "system" and m.get("content") == system_prompt + for m in tinker_messages + ): + tinker_messages.append({"role": "system", "content": system_prompt}) + tinker_messages.append({"role": "user", "content": prompt}) return await _entropy_and_logprob_tinker( openclaw_url=openclaw_url, openclaw_api_key=openclaw_api_key, proxy_url=proxy_url, prompt=prompt, timeout_s=timeout_s, rollout_log=rollout_log, + messages=tinker_messages, ) @@ async def _entropy_and_logprob_tinker( openclaw_url: str, openclaw_api_key: str, proxy_url: str, prompt: str = COLLAPSE_PROBE, timeout_s: float = 60.0, rollout_log: list[EvalRollout] | None = None, + messages: list[dict[str, str]] | None = None, ) -> tuple[None, float]: @@ - # Generate through OpenClaw + # Generate through OpenClaw + messages = messages or [{"role": "user", "content": prompt}] async with httpx.AsyncClient(base_url=openclaw_url, timeout=timeout_s) as client: resp = await client.post( "/v1/chat/completions", json={ "model": "openclaw", - "messages": [{"role": "user", "content": prompt}], + "messages": messages, "temperature": 0, "max_tokens": 2048, }, headers=headers, ) @@ async with httpx.AsyncClient(base_url=proxy_url, timeout=timeout_s) as client: score_resp = await client.post( "/v1/score", json={ - "messages": [{"role": "user", "content": prompt}], + "messages": messages, "completion": response_text, }, )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@claas/eval/collapse.py` around lines 262 - 308, The tinker path in _entropy_and_logprob_tinker ignores system_prompt and prompt_preamble (unlike the fallback in measure_entropy_and_mean_logprob), so either thread a prebuilt messages list through or surface the mismatch; update _entropy_and_logprob_tinker to accept or build the same messages structure used by measure_entropy_and_mean_logprob (include the system message if system_prompt is set, include a prompt_preamble user message if set, then the main user prompt), use that messages list in the OpenClaw /v1/chat/completions request and in the proxy /v1/score call instead of the single-user message, or if you prefer not to change the signature add a clear warning/assert when system_prompt or prompt_preamble are passed so the silent drop is explicit. Ensure references to functions/parameters: _entropy_and_logprob_tinker, measure_entropy_and_mean_logprob, system_prompt, prompt_preamble, and the /v1/chat/completions and /v1/score request payloads are updated consistently.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@claas/eval/__main__.py`:
- Around line 159-164: The CLI currently allows non-positive values for
--batch-size which will break the runner; validate args.batch_size >= 1 and
surface a clear error. Add a check after parsing (or in build_config) that
inspects args.batch_size and calls parser.error("--batch-size must be at least
1") or raise a ValueError with that message; locate the validation near
parser.add_argument(...) or inside build_config where args are processed to
ensure the runner never receives a non-positive batch_size.
In `@claas/eval/logprob.py`:
- Around line 138-143: The loop over raw_logprobs[prompt_token_count:] can crash
when an entry is an empty dict because next(iter(entry.values())) raises
StopIteration; update the loop in cla as/eval/logprob.py (the loop handling
raw_logprobs[prompt_token_count:]) to guard against empty or falsy entries by
skipping when entry is None or not entry (or when entry.values() is empty)
before calling next(iter(...)), so it safely continues on malformed API
responses and only sums top["logprob"] for valid entries.
In `@claas/proxy/tinker_inference_proxy.py`:
- Around line 63-68: The call to tokenizer.apply_chat_template may return
unexpected types causing KeyError/TypeError when accessing result["input_ids"];
update the logic around tokenizer.apply_chat_template (in
tinker_inference_proxy) to defensively handle results by: 1) wrapping the call
in a try/except to catch exceptions from the tokenizer, 2) if result is a list
return it, 3) if result is a dict-like object check for the "input_ids" key and
return it, and 4) for any other type or missing key raise a clear ValueError (or
log and return a safe default) describing the unexpected return type and include
the actual result for debugging. Ensure references to
tokenizer.apply_chat_template and the result variable are used in the error
message for easier tracing.
---
Outside diff comments:
In `@claas/eval/collapse.py`:
- Around line 205-213: The current branch in collapse.py treats an empty
logprobs_content as zero entropy by returning (0.0, 0.0); change this so missing
top_logprobs returns None instead to avoid false collapse alerts: inside the
block that reads choice = resp.json()["choices"][0] and computes
logprobs_content, if logprobs_content is falsy, emit a warning (use the module
logger or warnings.warn) mentioning the response_text/choice context and return
(None, None) rather than (0.0, 0.0) so callers can detect unavailable entropy.
---
Duplicate comments:
In `@claas/api.py`:
- Around line 844-857: The eval_dashboard handler accepts a user-controlled
results_dir which can enable path traversal; before calling eval_dashboard_html,
resolve and validate the requested path against an application base directory
(e.g., a configured EVAL_BASE_DIR) using pathlib.Path.resolve(), disallow
absolute paths or any path whose resolved form is not a subpath of the base
(check by comparing parents or using Path.relative_to in a try/except), and
return an error (400/403) if validation fails; update the async def
eval_dashboard(...) function to perform this sanitization on results_dir and
pass only the validated path to eval_dashboard_html.
In `@claas/eval/dashboard.py`:
- Around line 33-53: Wrap the JSON parsing in _load_json and _load_steps with
try/except to catch json.JSONDecodeError and handle it gracefully: in _load_json
(function _load_json) surround json.load(f) with a try/except
json.JSONDecodeError and return {} (and optionally log a warning) instead of
letting the exception propagate; in _load_steps (function _load_steps) surround
json.loads(line) / step_result_from_dict(...) with try/except
json.JSONDecodeError (and/or ValueError from step_result_from_dict) to skip
malformed lines (log a warning including the filename/run_dir and the offending
line) and continue processing the rest of the file so a single bad record
doesn't cause a 500. Ensure logging uses the module logger (e.g.,
logging.getLogger(__name__)) if not already present.
- Around line 106-131: The _embed_plots function currently opens and encodes PNG
files without handling read errors which can crash the dashboard; wrap the file
open/read/base64.b64encode section (the block that reads fpath and builds b64
and appends to images) in a try/except that catches IO and decoding exceptions,
log or warn the failure (using the module logger or logging.warning) with the
filename and exception, and continue so corrupted/missing files are skipped;
keep the existing with open(...) context and ensure only successfully encoded
files are appended to images so img_tags generation remains safe.
In `@claas/eval/README.md`:
- Around line 26-28: Add language identifiers to the fenced code blocks in
claas/eval/README.md: change the HTTP request fence that currently wraps "GET
/v1/eval?results_dir=./eval_results" to use ```http and change the
filesystem/tree fence around the eval_results example to use ```text; apply the
same fix to the other fenced block mentioned (lines 61-68) so all raw fences
include appropriate language identifiers for syntax highlighting.
In `@claas/eval/runner.py`:
- Around line 610-611: The fallback response_text ("I'd be happy to help you
with that.") must not be passed into _measure_eval_metrics because it skews
metrics; instead detect when response_text is still None after generation
attempts and either return early or record the sample as a failure (increment
the failure counter / call the existing failure-tracking helper) without calling
_measure_eval_metrics. Update the logic around the response_text assignment and
the call site of _measure_eval_metrics (the variable response_text and function
_measure_eval_metrics in runner.py) so real generated outputs only are measured
and fallback cases are flagged/returned as failures.
- Around line 188-214: The cache is currently keyed only by
SHA-256(stripped_visible_content) in _fetch_cached_completion, causing
collisions when different prompts produce the same visible content; update the
function to include the original prompt (or other distinguishing inputs like
model/params) in the cache key and/or validate the returned entry before using
it. Concretely: change _fetch_cached_completion signature to accept the original
prompt (or relevant request metadata), compute a composite cache key (e.g.,
hash(strip_thinking(visible_content) + original_prompt or include model/params),
use that key for the proxy lookup instead of content_hash alone, and after
fetching verify data["prompt"] matches the expected original prompt (or
reject/ignore the cached entry when it does not) so wrong generations are not
returned. Ensure you update all call sites to pass the new parameter and keep
strip_thinking usage for defensive normalization.
In `@claas/proxy/tinker_inference_proxy.py`:
- Line 498: The dict comprehension building request messages uses naive coercion
("m.content or \"\"") which can leave non-string payloads and break downstream
chat template rendering; update the comprehension in tinker_inference_proxy.py
(where dicts is created) to call the existing _coerce_content helper (i.e.,
{"role": m.role, "content": _coerce_content(m.content)}) so all message contents
are normalized to strings, and ensure _coerce_content is accessible in that
scope (import or move it if necessary).
---
Nitpick comments:
In `@claas/eval/collapse.py`:
- Around line 262-308: The tinker path in _entropy_and_logprob_tinker ignores
system_prompt and prompt_preamble (unlike the fallback in
measure_entropy_and_mean_logprob), so either thread a prebuilt messages list
through or surface the mismatch; update _entropy_and_logprob_tinker to accept or
build the same messages structure used by measure_entropy_and_mean_logprob
(include the system message if system_prompt is set, include a prompt_preamble
user message if set, then the main user prompt), use that messages list in the
OpenClaw /v1/chat/completions request and in the proxy /v1/score call instead of
the single-user message, or if you prefer not to change the signature add a
clear warning/assert when system_prompt or prompt_preamble are passed so the
silent drop is explicit. Ensure references to functions/parameters:
_entropy_and_logprob_tinker, measure_entropy_and_mean_logprob, system_prompt,
prompt_preamble, and the /v1/chat/completions and /v1/score request payloads are
updated consistently.
In `@claas/eval/dashboard.py`:
- Line 26: The EVAL_DASHBOARD_TEMPLATE constant is duplicated between cla
as/eval/dashboard.py (EVAL_DASHBOARD_TEMPLATE) and cla as/api.py; refactor to a
single source of truth by removing the duplicate and importing the constant
instead of redefining it, or move the constant to a new shared module (e.g., cla
as/templates.py) and update both places to import that symbol; update references
in functions/classes that use EVAL_DASHBOARD_TEMPLATE to import the shared
constant so only one definition exists.
In `@claas/eval/runner.py`:
- Line 231: Replace the bare assertion with a descriptive message (or explicit
check) so failures are informative: change the line with assert config.proxy_url
is not None to include a message like assert config.proxy_url is not None,
"config.proxy_url must be set (proxy_url missing from config)" or alternatively
perform an explicit check and raise a ValueError/RuntimeError with that message
to locate and diagnose the missing proxy_url when functions referencing
config.proxy_url (e.g., in runner.py where config.proxy_url is used) fail.
In `@tests/integration/test_tinker_stack_integration.py`:
- Around line 21-22: The test defines a duplicate regex _THINK_RE and manual
</think> stripping; instead import and use the shared strip_thinking helper (or
equivalent) used by the production modules to avoid drift: replace usages of
_THINK_RE and any manual think-tag removal in
tests/integration/test_tinker_stack_integration.py with a call to strip_thinking
(or the function name used in claas/proxy/tinker_inference_proxy.py /
claas/eval/verifiers.py), updating the test imports accordingly and removing the
orphaned regex and trim logic so all think-tag behavior is centralized.
| for entry in raw_logprobs[prompt_token_count:]: | ||
| if entry is None: | ||
| continue | ||
| top = next(iter(entry.values())) | ||
| logprob_sum += top["logprob"] | ||
|
|
There was a problem hiding this comment.
Guard against empty logprob entries.
If entry is an empty dict (malformed API response), next(iter(entry.values())) will raise StopIteration. Consider adding a guard.
🛡️ Suggested fix
for entry in raw_logprobs[prompt_token_count:]:
if entry is None:
continue
+ if not entry:
+ continue
top = next(iter(entry.values()))
logprob_sum += top["logprob"]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@claas/eval/logprob.py` around lines 138 - 143, The loop over
raw_logprobs[prompt_token_count:] can crash when an entry is an empty dict
because next(iter(entry.values())) raises StopIteration; update the loop in cla
as/eval/logprob.py (the loop handling raw_logprobs[prompt_token_count:]) to
guard against empty or falsy entries by skipping when entry is None or not entry
(or when entry.values() is empty) before calling next(iter(...)), so it safely
continues on malformed API responses and only sums top["logprob"] for valid
entries.
claas/eval/__main__.py
Outdated
| default=None, | ||
| help=( | ||
| "Output directory for results " | ||
| "(default: auto ./data/evals/<UTC timestamp>)" |
There was a problem hiding this comment.
Why do we have default as a comment but the default value is None? This is ridiculous
There was a problem hiding this comment.
Fixed in 5ff74ce. Simplified the help text to "Output directory for results (auto-generated if omitted)".
Comment generated by Claude Code
There was a problem hiding this comment.
This missed the point of my message. If we have a default value for the output_dir, why dont we add it as this arg? Why have None that confusingly maps to a default
There was a problem hiding this comment.
Fixed in 9bb7ed2. --output-dir now defaults to ./data/evals (not None). A timestamped subdir is always appended in build_config.
Comment generated by Claude Code
claas/eval/collapse.py
Outdated
| via proxy /v1/score to get mean_logprob for drift detection. | ||
| - **Direct vLLM** (fallback, no openclaw_url): Legacy direct-to-vLLM path. | ||
| """ | ||
| if openclaw_url and proxy_url: |
There was a problem hiding this comment.
This is super confusing. We have different logic for local vs tinker, but we try and infer it from what was passed in. It makes NO sense. Why not just have an explicit tinker or local mode and we force at the beginning that all the right config is passed in.
There was a problem hiding this comment.
Addressed in 85170af. Removed the engine-specific Tinker branch — the function no longer infers mode from parameters. The OpenClaw path handles both local and Tinker routing uniformly.
Comment generated by Claude Code
There was a problem hiding this comment.
You did not address my larger point: why do we not have a way to configure execution mode in eval the way we do in training?
There was a problem hiding this comment.
Fixed in 9bb7ed2. Added explicit --mode local|tinker to HarnessConfig and CLI. Extracted openclaw_chat_params() and direct_vllm_chat_params() as two clear routing functions in types.py. All callers (runner, collapse) now use these instead of inline if/else branching on URL presence. The runner checks config.mode for tinker-vs-local decisions (LoRA loading, proxy cache).
Comment generated by Claude Code
data/feedback/20260217T184944-4741c07f6d8f4e72a9cc9d09b1e6c8df.json
Outdated
Show resolved
Hide resolved
Engines should emit canonical metric keys directly. Remove the normalize_sdpo_metrics helper and its callers in api.py and runner.py. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Delete the Tinker-specific collapse implementation and remove proxy_url routing from measure_entropy_and_mean_logprob and measure_collapse. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…branch The score endpoint only needs raw prompt text + completion. Remove the messages/chat-template codepath that added unnecessary complexity. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Remove ge/le validators so the API doesn't reject unusual learning rates. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Files remain on disk but are no longer tracked (.gitignore updated in prior commit). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Use DistillBatchItem for feedback samples instead of dict[str, object] - Use FeedbackBatchRequest/DistillRequest to build the API payload - Add ExperimentSummary dataclass for _write_summary entries Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add --mode local|tinker to HarnessConfig and CLI, replacing inference from URL presence. Tinker mode validates --openclaw-url is provided. - Set --output-dir default to ./data/evals (not None), always append a timestamped subdir. - Replace config.proxy_url checks in runner.py with config.mode checks. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add ChatRequestParams dataclass and two builder functions in types.py for clear separation of OpenClaw vs direct-vLLM communication - Replace inline if/else branching in runner._generate_response, collapse.measure_entropy_and_mean_logprob, and collapse.measure_self_rouge_l with the shared helpers - Fix --output-dir to default to ./data/evals (not None), always appends a timestamped subdir - Remove tinker --openclaw-url requirement (not needed) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The eval harness was showing null for all SDPO metrics in Tinker mode because the Tinker engine returns different metric names (adv_mean, kl_mean, etc.) than the local engine (distill_loss, kl_reg, etc.). Instead of converting between types, support both metric shapes natively: - Rename SDPOMetrics -> LocalDistillMetrics - Add TinkerDistillMetrics dataclass with native Tinker field names - Branch on metadata keys in runner, plotting, and dashboard - Detect type via "adv_mean" key presence during deserialization Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move DEFAULT_SYSTEM_PROMPT ("You are a helpful assistant.") into
core/config.py so both eval types and teacher_helpers share one source.
Change teacher default from "expert coding assistant" to match, and
update feedback framing from "unsuccessful earlier attempt" to
"positive or negative feedback" since feedback can be either.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The config now defaults feedback_log_dir to ./data/feedback, so the explicit FEEDBACK_LOG_DIR env var in Docker Compose and the local stack script is no longer needed. Update volume mounts to match. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
@codex review |
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ae09066fa7
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| prompt = "\n".join( | ||
| f"{m.get('role', 'user')}: {m.get('content', '')}" for m in messages | ||
| ) |
There was a problem hiding this comment.
Use chat-template inputs for proxy logprob scoring
When proxy_url is set, the scorer converts structured chat messages into a plain newline string ("role: content") before calling /v1/score, so Tinker-mode margins are computed on a different prompt format than generation and direct-vLLM scoring. This can materially skew or even invert margin deltas for chat models (especially with system prompts), which undermines the core evaluation metric; pass chat-structured inputs (or a properly rendered chat template) to the proxy instead of flattening roles into raw text.
Useful? React with 👍 / 👎.
| # Gemini feedback override (optional) | ||
| if needs_generation and gemini_user: | ||
| try: | ||
| gemini_result = await gemini_user.evaluate_response(response_text, prompt) |
There was a problem hiding this comment.
Pair Gemini evaluation with the matching batch prompt
In batched mode, response_text is captured from the first successful sample, but prompt here is the loop variable after batching (typically the last sample), so Gemini can evaluate a response against the wrong prompt whenever batch_size > 1. If Gemini feedback is enabled, this mismatch can produce irrelevant feedback and then overwrite all samples in the batch with that incorrect feedback.
Useful? React with 👍 / 👎.
kl_mean from TinkerDistillMetrics is a signed log-prob delta, not a true KL divergence. Stop plotting it and hide the right panel for Tinker runs. Also differentiate axis labels between Tinker (advantage) and local (GJS loss) engines. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Summary
_submit_feedbackto useFeedbackBatchRequestformat withrequests[],rollout_logprobs, andorchestration(broken since Batched Feedback PR Batched Feedback #12)POST /v1/scoreendpoint for logprob scoring, proxy-based rollout collection from cache, skip vLLM LoRA ops in Tinker mode--batch-sizeflag (1=progressive, 4=batched),--base-modelflag (replaces hardcodedQwen/Qwen3-8B),--proxy-urlflag to enable Tinker modeFiles Changed
claas/proxy/tinker_inference_proxy.pyScoreRequestmodel +POST /v1/scoreendpointscripts/eval/types.pyproxy_url,base_model,batch_sizeonHarnessConfig;proxy_urlonMetricContextscripts/eval/__main__.py--proxy-url,--base-model,--batch-sizeCLI flagsscripts/eval/logprob.pyfetch_response_logprob_sum_via_proxy(); proxy dispatch inmeasure_logprob_margin()scripts/eval/metrics.pyproxy_urlfromMetricContextto logprob scoringscripts/eval/runner.pytests/test_tinker_proxy.py/v1/scoreendpointTest plan
uv run ruff check claas/ tests/ scripts/ --fix— cleanuv run ty check— only pre-existing unresolved-import errors for GPU depsuv run pytest tests/ -v -m "not integration"— 95 passedpython -m scripts.eval --proxy-url http://localhost:8000 --base-model Qwen/Qwen3-30B-A3B --preferences identity --num-steps 5 --metrics logprob --plotspython -m scripts.eval --proxy-url http://localhost:8000 --base-model Qwen/Qwen3-30B-A3B --preferences identity --num-steps 5 --batch-size 4 --metrics logprob --plotssteps.jsonlshows positive margin deltas across steps🤖 Generated with Claude Code
Summary by CodeRabbit
Release Notes
New Features
/v1/scoreendpoint to compute per-token log-probabilities for prompt-completion pairs./v1/evalfor viewing and analyzing evaluation results.evalCLI command for running model evaluations.Bug Fixes
Documentation
Tests