Skip to content

Eval harness: Tinker proxy scoring, batched feedback, configurable base model#14

Merged
kfallah merged 33 commits intomainfrom
eval-harness-tinker-batch
Feb 18, 2026
Merged

Eval harness: Tinker proxy scoring, batched feedback, configurable base model#14
kfallah merged 33 commits intomainfrom
eval-harness-tinker-batch

Conversation

@kfallah
Copy link
Owner

@kfallah kfallah commented Feb 16, 2026

Summary

  • Fix broken feedback payload: Rewrite _submit_feedback to use FeedbackBatchRequest format with requests[], rollout_logprobs, and orchestration (broken since Batched Feedback PR Batched Feedback #12)
  • Add Tinker proxy support: New POST /v1/score endpoint for logprob scoring, proxy-based rollout collection from cache, skip vLLM LoRA ops in Tinker mode
  • Add batch mode and configurable base model: --batch-size flag (1=progressive, 4=batched), --base-model flag (replaces hardcoded Qwen/Qwen3-8B), --proxy-url flag to enable Tinker mode

Files Changed

File Key Changes
claas/proxy/tinker_inference_proxy.py ScoreRequest model + POST /v1/score endpoint
scripts/eval/types.py proxy_url, base_model, batch_size on HarnessConfig; proxy_url on MetricContext
scripts/eval/__main__.py --proxy-url, --base-model, --batch-size CLI flags
scripts/eval/logprob.py fetch_response_logprob_sum_via_proxy(); proxy dispatch in measure_logprob_margin()
scripts/eval/metrics.py Pass proxy_url from MetricContext to logprob scoring
scripts/eval/runner.py Batched feedback, rollout collection, vLLM logprob scoring, Tinker mode, configurable base_model
tests/test_tinker_proxy.py Unit tests for /v1/score endpoint

Test plan

  • uv run ruff check claas/ tests/ scripts/ --fix — clean
  • uv run ty check — only pre-existing unresolved-import errors for GPU deps
  • uv run pytest tests/ -v -m "not integration" — 95 passed
  • Run progressive test: python -m scripts.eval --proxy-url http://localhost:8000 --base-model Qwen/Qwen3-30B-A3B --preferences identity --num-steps 5 --metrics logprob --plots
  • Run batched test: python -m scripts.eval --proxy-url http://localhost:8000 --base-model Qwen/Qwen3-30B-A3B --preferences identity --num-steps 5 --batch-size 4 --metrics logprob --plots
  • Verify steps.jsonl shows positive margin deltas across steps

🤖 Generated with Claude Code

Summary by CodeRabbit

Release Notes

  • New Features

    • Added /v1/score endpoint to compute per-token log-probabilities for prompt-completion pairs.
    • Added evaluation dashboard accessible at /v1/eval for viewing and analyzing evaluation results.
    • Added eval CLI command for running model evaluations.
  • Bug Fixes

    • Fixed handling of orphaned thinking block closing tags in response content.
  • Documentation

    • Added evaluation harness documentation.
  • Tests

    • Added comprehensive tests for the scoring endpoint.
    • Added cache integrity tests with thinking block handling.

…se model

Fix eval harness for the batched feedback API (broken since PR #12) and add
Tinker proxy support so evals can run against the Tinker stack without vLLM.

- Add POST /v1/score endpoint to Tinker proxy for logprob scoring
- Rewrite _submit_feedback to use FeedbackBatchRequest with rollout_logprobs
- Add --proxy-url, --base-model, --batch-size CLI flags
- Add proxy-based logprob scoring path (fetch_response_logprob_sum_via_proxy)
- Add rollout collection from proxy cache (_generate_and_collect)
- Skip vLLM LoRA load/reload when in Tinker mode
- Use config.base_model instead of hardcoded Qwen/Qwen3-8B in _init_lora
- Add unit tests for /v1/score endpoint

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@coderabbitai
Copy link

coderabbitai bot commented Feb 16, 2026

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

A new /v1/score endpoint is introduced to compute per-token logprobs for prompt-completion pairs. The evaluation harness is extended with proxy-based generation support via optional CLI flags, enabling batch processing and alternative data retrieval through a proxy cache instead of direct vLLM API calls.

Changes

Cohort / File(s) Summary
Proxy Scoring Endpoint
claas/proxy/tinker_inference_proxy.py, tests/test_tinker_proxy.py
Added new ScoreRequest data model and POST /v1/score endpoint to compute per-token logprobs for prompt-completion pairs. Tests verify logprob computation and edge cases (None values).
Configuration & Type System
scripts/eval/__main__.py, scripts/eval/types.py
Extended HarnessConfig with three fields: proxy_url (optional), base_model (default "Qwen/Qwen3-8B"), and batch_size (default 1). Added CLI option parsing for these fields and proxy_url to MetricContext.
Logprob Measurement with Proxy
scripts/eval/logprob.py, scripts/eval/metrics.py
Added fetch_response_logprob_sum_via_proxy helper to fetch logprobs via proxy endpoint. Extended measure_logprob_margin to conditionally use proxy or direct vLLM calls based on proxy_url parameter. Metrics layer now propagates proxy_url to logprob measurement.
Evaluation Runner Batching & Proxy Mode
scripts/eval/runner.py
Replaced single-sample feedback submission with batched submission. Added proxy-based generation flow (_generate_and_collect, _fetch_rollout_from_proxy_cache, _fetch_rollout_logprobs_vllm) to support Tinker mode. Extended evaluation run to skip LoRA loading in proxy mode and batch-step across multiple probes. Updated Gemini feedback propagation for batch context.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant EvalRunner as Eval Runner
    participant Proxy
    participant Cache as Proxy Cache
    participant vLLM
    
    rect rgba(100, 150, 200, 0.5)
    Note over Client,vLLM: Proxy Mode (proxy_url configured)
    Client->>EvalRunner: run_preference_experiment(config with proxy_url)
    EvalRunner->>Proxy: POST /v1/completions (prompt)
    Proxy->>vLLM: Generate completion
    vLLM-->>Proxy: completion_text
    Proxy->>Cache: Store completion & logprobs
    Proxy-->>EvalRunner: response
    EvalRunner->>Cache: fetch_rollout_from_proxy_cache(response)
    Cache-->>EvalRunner: completion, logprobs
    EvalRunner->>EvalRunner: measure_logprob_margin(proxy_url=...)
    EvalRunner->>Proxy: POST /v1/score (prompt+completion)
    Proxy-->>EvalRunner: logprobs, logprob_sum
    EvalRunner->>EvalRunner: Collect sample (prompt, response, logprobs, feedback)
    EvalRunner->>EvalRunner: Batch & submit_feedback (multiple samples)
    end
    
    rect rgba(200, 150, 100, 0.5)
    Note over Client,vLLM: Direct vLLM Mode (no proxy_url)
    Client->>EvalRunner: run_preference_experiment(config without proxy_url)
    EvalRunner->>vLLM: Generate completion
    vLLM-->>EvalRunner: completion
    EvalRunner->>vLLM: Get per-token logprobs
    vLLM-->>EvalRunner: logprobs
    EvalRunner->>EvalRunner: measure_logprob_margin(proxy_url=None)
    EvalRunner->>EvalRunner: Collect sample
    EvalRunner->>EvalRunner: Batch & submit_feedback (multiple samples)
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~30 minutes

Poem

🐰 Hops through the code with glee,
A proxy for scoring, oh what we see!
Batches of feedback in one swift leap,
Logprobs now cached, our secrets to keep!
Tinker mode dancing, both vLLM and more,
New endpoints await at each proxy door! 🎯

🚥 Pre-merge checks | ✅ 2 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Merge Conflict Detection ⚠️ Warning ❌ Merge conflicts detected (8 files):

⚔️ README.md (content)
⚔️ claas/proxy/tinker_inference_proxy.py (content)
⚔️ scripts/eval/__main__.py (content)
⚔️ scripts/eval/logprob.py (content)
⚔️ scripts/eval/metrics.py (content)
⚔️ scripts/eval/runner.py (content)
⚔️ scripts/eval/types.py (content)
⚔️ tests/test_tinker_proxy.py (content)

These conflicts must be resolved before merging into main.
Resolve conflicts locally and push changes to this branch.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the three main changes: Tinker proxy scoring support, batched feedback implementation, and configurable base model option.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch eval-harness-tinker-batch

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Fix all issues with AI agents
In `@scripts/eval/logprob.py`:
- Around line 121-134: The proxy call in fetch_response_logprob_sum_via_proxy
currently posts prompt and completion separately which causes tokenization
boundary mismatches versus joint encoding; instead send a single concatenated
string so the proxy encodes the full prompt+completion as one unit (or update
the proxy to provide a joint-encoding mode) — update the JSON payload in
fetch_response_logprob_sum_via_proxy (or adjust the /v1/score handler) to ensure
the tokenizer sees chatml_prefix + response_text together, so returned
"logprob_sum" uses the same tokenization as the vLLM path.

In `@scripts/eval/runner.py`:
- Around line 256-311: The code in _fetch_rollout_logprobs_vllm builds a
hardcoded ChatML string (chatml_prefix using <|im_start|> / <|im_end|> and
constants _ASSISTANT_PREFIX/_IM_END) which will mis-tokenize non-ChatML base
models; change the implementation to either (1) call the vLLM /tokenize and
/v1/completions endpoints using the messages structure (pass the messages list
instead of a ChatML string) if the server supports it, or (2) detect the base
model/template and construct the prefix using the correct template markers (or
expose a template strategy) before calling tokenize and completions; update the
code paths that build chatml_prefix and the /tokenize call in
_fetch_rollout_logprobs_vllm (and the constants _ASSISTANT_PREFIX/_IM_END) to
use the chosen approach so token counts and prompt_logprobs are correct for
non-ChatML models.
- Around line 193-220: The code in _fetch_rollout_from_proxy_cache reads
data["logprobs"] which can be None (per _CompletionCacheEntry), causing
downstream failures; update the function to guard against None by coalescing
data["logprobs"] to an empty list (or validating and raising a clear error)
before assigning to logprobs, e.g., replace the direct assignment to ensure
logprobs is a list[float] (and keep the function return signature unchanged),
and add a brief comment referencing the proxy cache nullable contract so future
readers see why the default is used.
🧹 Nitpick comments (1)
scripts/eval/runner.py (1)

671-686: Single fixed-delay retry may be insufficient for transient failures.

The retry logic uses a single 5-second sleep before retrying. For production-grade resilience, consider exponential backoff or at least a configurable retry count. However, for an eval harness this is pragmatic enough.

The cached response text and logprobs come from the same seq.tokens /
seq.logprobs — they are already aligned. Stripping wrapper tokens from
the text without stripping corresponding logprobs would cause a length
mismatch in the training engine.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3c29a9d0d2

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

kfallah and others added 6 commits February 16, 2026 11:10
…n tests

Centralize ChatML construction in messages_to_chatml() (teacher_helpers)
and delegate to tokenizer.apply_chat_template() via the proxy's /v1/score
endpoint. This eliminates 4 independent <|im_start|>/<|im_end|> construction
sites that would break for non-ChatML models.

- Add optional `messages` field to /v1/score ScoreRequest with model_validator
- Replace LogprobPair.prompt_chatml with prompt_messages (list[ChatMessage])
- Update logprob.py to pass messages instead of pre-built ChatML strings
- Replace manual ChatML loop in runner.py with messages_to_chatml()
- Add \n separator to messages_to_chatml() for canonical ChatML format
- Remove template stripping from integration test _fetch_raw_completion()
- Add tests for /v1/score with messages and validation

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The Tinker SDK's compute_logprobs() returns a Future, not the logprobs
directly. Add .result() to match the pattern used in _sample_async().

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ashboard

- Move all 12 eval files from scripts/eval/ to claas/eval/
- Extract add_eval_arguments() and build_config() for reuse
- Add `claas eval` CLI subcommand in claas/cli.py
- Add /v1/eval endpoint serving HTML dashboard from eval results
- New claas/eval/dashboard.py reads summary.json + steps.jsonl and
  renders summary table with per-preference expandable step details
- New claas/eval_dashboard.html template (same style as index.html)
- Add claas/eval/README.md with usage, metrics, and output format docs

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Wrap apply_chat_template and encode results in list() so the
prompt_tokens variable is always list[int], resolving invalid-assignment
and invalid-argument-type errors from ty.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The return type of apply_chat_template is a complex union that ty
cannot narrow through isinstance checks. Annotating the result as Any
resolves invalid-assignment and invalid-argument-type errors on both
Python 3.11 (CI) and 3.14 (local).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…plots

- Remove CriteriaResult/PreferenceSummary types and all pass/fail/marginal
  verdict logic from runner and dashboard — raw metrics only
- Dashboard now discovers all run subdirectories under results_dir and
  renders each as a collapsible section (newest-first)
- Embed plot PNGs inline as base64 <img> tags in a responsive grid
- Remove threshold lines/bands from all plot functions
- Update HTML template: drop verdict CSS, add run-section and plot-grid styles

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@kfallah
Copy link
Owner Author

kfallah commented Feb 17, 2026

@codex review

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3fc822f208

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
claas/eval/collapse.py (1)

205-212: ⚠️ Potential issue | 🟡 Minor

Treat missing logprobs as unavailable entropy.

If the backend omits logprobs, returning 0.0 entropy can trigger false collapse signals. Consider returning None for entropy in this case to align with the optional type.

🔧 Safer fallback
     logprobs_content = choice.get("logprobs", {}).get("content", [])
 
     if not logprobs_content:
-        return (0.0, 0.0)
+        return (None, 0.0)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@claas/eval/collapse.py` around lines 205 - 212, The code treats missing
logprobs as zero entropy by returning (0.0, 0.0); instead return None for
unavailable entropy so callers can distinguish "unavailable" from a numeric
entropy. In the block that inspects choice = resp.json()["choices"][0] and
logprobs_content = choice.get("logprobs", {}).get("content", []), change the
early return for missing logprobs_content to return (None, None) (and update any
related type hints or callers to accept Optional[float]) so entropy is
represented as unavailable rather than zero.
claas/eval/__main__.py (1)

160-176: ⚠️ Potential issue | 🟡 Minor

Guard against batch_size <= 0.

The CLI accepts any integer; a value of 0 (or negative) will skip sampling and still record steps, producing misleading results. Add a simple validation in build_config (or in argparse) to enforce >= 1.

✅ Minimal validation
 def build_config(args: argparse.Namespace) -> HarnessConfig:
     """Build a HarnessConfig from parsed CLI arguments."""
     # Parse --metrics comma string
     metrics_list = [m.strip() for m in args.metrics.split(",") if m.strip()]
@@
     if args.collapse_steps is not None:
         collapse_steps = {int(s.strip()) for s in args.collapse_steps.split(",") if s.strip()}
+
+    if args.batch_size < 1:
+        raise ValueError("--batch-size must be >= 1")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@claas/eval/__main__.py` around lines 160 - 176, Add validation to ensure
args.batch_size is >= 1 inside build_config: check args.batch_size (from the
CLI) and if it's less than 1 raise a clear exception (e.g., ValueError with
message "batch_size must be >= 1") or otherwise terminate, so sampling can't be
skipped; modify build_config to perform this guard before using args.batch_size
to construct the HarnessConfig and reference the args.batch_size symbol and the
build_config function in the change.
🧹 Nitpick comments (4)
claas/cli.py (1)

201-206: Consider deferring the import to avoid module load overhead.

The from .eval.__main__ import add_eval_arguments runs at module parse time (when setting up subparsers), which loads the eval module even when other commands are invoked. This is inconsistent with the lazy import pattern used in cmd_eval (lines 88-89).

♻️ Defer import to match lazy loading pattern
     # eval command
     eval_parser = subparsers.add_parser("eval", help="Run the evaluation harness")
-    from .eval.__main__ import add_eval_arguments
-
-    add_eval_arguments(eval_parser)
+    # Arguments are added lazily when the eval command is actually invoked
+    # to avoid loading the eval module for other commands
     eval_parser.set_defaults(func=cmd_eval)

Then in cmd_eval, add the arguments dynamically:

def cmd_eval(args: argparse.Namespace) -> int:
    """Run the evaluation harness."""
    from .eval.__main__ import add_eval_arguments, build_config
    from .eval.runner import run_harness
    
    # Re-parse with eval-specific arguments if needed
    config = build_config(args)
    asyncio.run(run_harness(config))
    return 0

However, this would require architectural changes to argparse setup. The current approach is acceptable if the eval module import is lightweight.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@claas/cli.py` around lines 201 - 206, The import of add_eval_arguments runs
at module-import time and should be deferred to avoid loading the eval module
for other commands; remove the top-level "from .eval.__main__ import
add_eval_arguments" and instead import add_eval_arguments lazily inside cmd_eval
(the command handler) or right before calling add_eval_arguments when
constructing eval_parser (preferred: move the add_eval_arguments call into
cmd_eval so eval-specific args are added/parsed only when cmd_eval runs); update
references to add_eval_arguments and ensure
eval_parser.set_defaults(func=cmd_eval) remains so argparse still dispatches to
the lazily-importing cmd_eval.
claas/eval/README.md (2)

26-28: Add language specifier to code block.

Static analysis flagged this code block as missing a language specifier.

📝 Proposed fix
-```
+```text
 GET /v1/eval?results_dir=./eval_results
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against the current code and only fix it if needed.

In @claas/eval/README.md around lines 26 - 28, The code block showing "GET
/v1/eval?results_dir=./eval_results" lacks a language specifier; update the
triple-backtick delimiter for that block to include a language (e.g., change totext) so the fenced code block becomes text ... ; locate the README
section containing the GET /v1/eval?results_dir=./eval_results snippet and
modify the opening fence accordingly.


</details>

---

`61-68`: **Add language specifier to directory structure block.**

Static analysis flagged this code block as missing a language specifier.


<details>
<summary>📝 Proposed fix</summary>

```diff
-```
+```text
 eval_results/
 ├── summary.json              # Per-preference pass/fail verdicts
 └── <preference>/
     ├── metadata.json          # Run config + LoRA ID
     ├── baseline.json          # Pre-training metric snapshot
     └── steps.jsonl            # One JSON object per feedback step
 ```
```
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against the current code and only fix it if needed.

In @claas/eval/README.md around lines 61 - 68, Update the fenced code block in
README.md under the eval_results directory structure to include a language
specifier (use "text") so the block is properly highlighted; locate the block
that currently starts with and replace it withtext to produce
text\neval_results/... ensuring the rest of the block content
(summary.json, /metadata.json, baseline.json, steps.jsonl) remains
unchanged.


</details>

</blockquote></details>
<details>
<summary>claas/proxy/tinker_inference_proxy.py (1)</summary><blockquote>

`494-496`: **Consider logging when logprobs are `None`.**

Replacing `None` logprobs with `0.0` silently masks potential issues. A logprob of `0.0` means probability 1.0, which is likely incorrect if the actual value was `None` due to an error or edge case.


<details>
<summary>♻️ Proposed improvement</summary>

```diff
     raw = logprobs_full[prompt_len : prompt_len + completion_len]
-    logprobs = [lp if lp is not None else 0.0 for lp in raw]
+    if any(lp is None for lp in raw):
+        logger.warning(
+            "Some completion logprobs were None (got %d/%d), defaulting to 0.0",
+            sum(1 for lp in raw if lp is None),
+            len(raw),
+        )
+    logprobs = [lp if lp is not None else 0.0 for lp in raw]
```
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@claas/proxy/tinker_inference_proxy.py` around lines 494 - 496, The current
slice that builds logprobs (using logprobs_full[prompt_len : prompt_len +
completion_len] into raw and mapping None to 0.0 into logprobs) silently masks
missing values; update the logic in the function containing this block (the code
around variables logprobs_full, prompt_len, completion_len, raw, logprobs) to
detect when any lp is None, emit a warning/error-level log including contextual
info (index or token position, prompt_len/completion_len, maybe a sample of
logprobs_full), and only then decide on a fallback value or propagate the None;
ensure the logger call includes enough context to debug the source of None
values rather than silently substituting 0.0.
```

</details>

</blockquote></details>

</blockquote></details>

<details>
<summary>🤖 Prompt for all review comments with AI agents</summary>

Verify each finding against the current code and only fix it if needed.

Inline comments:
In @claas/api.py:

  • Around line 844-857: The eval_dashboard endpoint takes results_dir from the
    query and passes it to eval_dashboard_html allowing arbitrary filesystem reads;
    fix by restricting and validating the incoming path: define a known base
    directory (e.g., DEFAULT_RESULTS_BASE) and in eval_dashboard resolve the
    requested path with pathlib.Path(results_dir).resolve(), join/resolve against
    the base (base_dir.joinpath(requested).resolve()) and reject the request (raise
    HTTPException 400/403) if the resolved path is not a child of the base_dir or if
    results_dir is absolute or contains upward traversal; only pass the
    validated/resolved child path into eval_dashboard_html and log the rejection
    reasons.

In @claas/eval/dashboard.py:

  • Around line 33-53: The current loaders _load_json and _load_steps will raise
    on malformed JSON and crash the dashboard; update them to catch JSONDecodeError
    (and optionally ValueError/UnicodeDecodeError) and handle failures gracefully:
    in _load_json (function _load_json) wrap json.load in try/except and return {}
    while logging or ignoring the error if decoding fails, and in _load_steps
    (function _load_steps) wrap json.loads per-line in try/except and skip lines
    that fail to decode (do not append) while optionally logging a warning; keep
    using step_result_from_dict when json.loads succeeds so valid lines still
    produce StepResult objects.

In @claas/eval/runner.py:

  • Around line 188-214: The cache lookup in _fetch_cached_completion currently
    keys only on SHA-256(strip_thinking(visible_content)) which can return a cached
    {prompt,response,logprobs} from a different generation; change the logic to
    include the generation prompt (or its tokenized hash or request_id) in the cache
    key or, if you cannot change the proxy key, fetch the cached entry and validate
    that data["prompt"] equals the expected prompt/context before returning; if the
    prompt does not match, fallback to fetching fresh scores (e.g., call /v1/score
    or re-run scoring) and do not return the mismatched entry.

In @claas/proxy/tinker_inference_proxy.py:

  • Around line 470-478: The code assumes tokenizer.apply_chat_template returns
    either a list or a dict and narrows types unsafely; fix by normalizing the
    result: call tokenizer.apply_chat_template into result, then if
    isinstance(result, list) set prompt_tokens = result, else if it's a dict extract
    prompt_tokens = result.get("input_ids") (and validate it's not None), otherwise
    raise/handle error; reference symbols: req.messages,
    tokenizer.apply_chat_template, result, prompt_tokens.

Outside diff comments:
In @claas/eval/__main__.py:

  • Around line 160-176: Add validation to ensure args.batch_size is >= 1 inside
    build_config: check args.batch_size (from the CLI) and if it's less than 1 raise
    a clear exception (e.g., ValueError with message "batch_size must be >= 1") or
    otherwise terminate, so sampling can't be skipped; modify build_config to
    perform this guard before using args.batch_size to construct the HarnessConfig
    and reference the args.batch_size symbol and the build_config function in the
    change.

In @claas/eval/collapse.py:

  • Around line 205-212: The code treats missing logprobs as zero entropy by
    returning (0.0, 0.0); instead return None for unavailable entropy so callers can
    distinguish "unavailable" from a numeric entropy. In the block that inspects
    choice = resp.json()["choices"][0] and logprobs_content = choice.get("logprobs",
    {}).get("content", []), change the early return for missing logprobs_content to
    return (None, None) (and update any related type hints or callers to accept
    Optional[float]) so entropy is represented as unavailable rather than zero.

Nitpick comments:
In @claas/cli.py:

  • Around line 201-206: The import of add_eval_arguments runs at module-import
    time and should be deferred to avoid loading the eval module for other commands;
    remove the top-level "from .eval.main import add_eval_arguments" and instead
    import add_eval_arguments lazily inside cmd_eval (the command handler) or right
    before calling add_eval_arguments when constructing eval_parser (preferred: move
    the add_eval_arguments call into cmd_eval so eval-specific args are added/parsed
    only when cmd_eval runs); update references to add_eval_arguments and ensure
    eval_parser.set_defaults(func=cmd_eval) remains so argparse still dispatches to
    the lazily-importing cmd_eval.

In @claas/eval/README.md:

  • Around line 26-28: The code block showing "GET
    /v1/eval?results_dir=./eval_results" lacks a language specifier; update the
    triple-backtick delimiter for that block to include a language (e.g., change totext) so the fenced code block becomes text ... ; locate the README
    section containing the GET /v1/eval?results_dir=./eval_results snippet and
    modify the opening fence accordingly.
  • Around line 61-68: Update the fenced code block in README.md under the
    eval_results directory structure to include a language specifier (use "text") so
    the block is properly highlighted; locate the block that currently starts with
    and replace it withtext to produce text\neval_results/... ensuring
    the rest of the block content (summary.json, /metadata.json,
    baseline.json, steps.jsonl) remains unchanged.

In @claas/proxy/tinker_inference_proxy.py:

  • Around line 494-496: The current slice that builds logprobs (using
    logprobs_full[prompt_len : prompt_len + completion_len] into raw and mapping
    None to 0.0 into logprobs) silently masks missing values; update the logic in
    the function containing this block (the code around variables logprobs_full,
    prompt_len, completion_len, raw, logprobs) to detect when any lp is None, emit a
    warning/error-level log including contextual info (index or token position,
    prompt_len/completion_len, maybe a sample of logprobs_full), and only then
    decide on a fallback value or propagate the None; ensure the logger call
    includes enough context to debug the source of None values rather than silently
    substituting 0.0.

</details>

<!-- This is an auto-generated comment by CodeRabbit for review status -->

Comment on lines 188 to 214
async def _fetch_cached_completion(
proxy_url: str,
visible_content: str,
) -> tuple[str, str, list[float]]:
"""Fetch the cached raw completion from the proxy by content hash.

The proxy caches ``{prompt, response, logprobs}`` keyed by
``SHA-256(stripped_content)``. Since the proxy now strips thinking
before returning content, ``visible_content`` already matches the
cache key. We apply ``strip_thinking`` defensively in case the caller
passes un-stripped text.

Returns ``(real_prompt, raw_response, rollout_logprobs)``.
"""
content_hash = hashlib.sha256(
strip_thinking(visible_content).encode("utf-8"),
).hexdigest()
async with httpx.AsyncClient(base_url=proxy_url, timeout=60.0) as client:
resp = await client.get(
"/v1/completions/raw",
params={"content_hash": content_hash},
)
resp.raise_for_status()

data = resp.json()
return data["prompt"], data["response"], data["logprobs"]

Copy link

@coderabbitai coderabbitai bot Feb 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Search for the tinker_inference_proxy file
find . -type f -name "*tinker_inference_proxy*" -o -name "*proxy*" | grep -E "\.(py|js|ts)$" | head -20

Repository: kfallah/CLaaS

Length of output: 125


🏁 Script executed:

#!/bin/bash
# Also search for /v1/completions/raw endpoint definition
rg -r "completions/raw" --type py -l

Repository: kfallah/CLaaS

Length of output: 453


🏁 Script executed:

#!/bin/bash
# Search for cache-related code and content_hash usage
rg -n "content_hash|cache" --type py | head -50

Repository: kfallah/CLaaS

Length of output: 4392


🏁 Script executed:

#!/bin/bash
cat -n claas/proxy/tinker_inference_proxy.py | head -100

Repository: kfallah/CLaaS

Length of output: 4120


🏁 Script executed:

#!/bin/bash
# Find the /v1/completions/raw endpoint handler
rg -n "completions/raw" claas/proxy/tinker_inference_proxy.py -A 20

Repository: kfallah/CLaaS

Length of output: 880


🏁 Script executed:

#!/bin/bash
# Search for cache key and content_hash logic
rg -n "content_hash|cache\[" claas/proxy/tinker_inference_proxy.py

Repository: kfallah/CLaaS

Length of output: 755


🏁 Script executed:

#!/bin/bash
# Find where cache entries are stored (put method calls)
rg -n "_completion_cache.put\|\.put\(" claas/proxy/tinker_inference_proxy.py -B 5 -A 2

Repository: kfallah/CLaaS

Length of output: 39


🏁 Script executed:

#!/bin/bash
# Look at the _CompletionCacheEntry definition
rg -n "class _CompletionCacheEntry" claas/proxy/tinker_inference_proxy.py -A 10

Repository: kfallah/CLaaS

Length of output: 396


🏁 Script executed:

#!/bin/bash
# Check how runner.py uses the returned cached data
rg -n "await _fetch_cached_completion" claas/eval/runner.py -B 3 -A 10

Repository: kfallah/CLaaS

Length of output: 535


🏁 Script executed:

#!/bin/bash
# Search for cache.put or _completion_cache.put with broader context
rg -n "put\|_completion_cache\[" claas/proxy/tinker_inference_proxy.py -B 3 -A 3

Repository: kfallah/CLaaS

Length of output: 39


🏁 Script executed:

#!/antml:bin/bash
# Look for the full context around line 339-341 where cache is populated
sed -n '330,360p' claas/proxy/tinker_inference_proxy.py

Repository: kfallah/CLaaS

Length of output: 1245


🏁 Script executed:

#!/bin/bash
# Check what _generate_response returns and how it flows to cache
rg -n "_generate_response\|_completion_cache" claas/proxy/tinker_inference_proxy.py | head -30

Repository: kfallah/CLaaS

Length of output: 39


🏁 Script executed:

#!/bin/bash
# Check the full _fetch_cached_completion function and surrounding context
sed -n '188,240p' claas/eval/runner.py

Repository: kfallah/CLaaS

Length of output: 2058


🏁 Script executed:

#!/an/bash
# Check if there's any validation of real_prompt vs actual prompt used
rg -n "real_prompt" claas/eval/runner.py -B 2 -A 2

Repository: kfallah/CLaaS

Length of output: 1249


🏁 Script executed:

#!/bin/bash
# Check test cases to understand the expected behavior
cat -n tests/test_tinker_proxy.py | head -150

Repository: kfallah/CLaaS

Length of output: 6690


🏁 Script executed:

#!/bin/bash
# Check if there's any verification that returned prompt matches what was expected
rg -n "prompt.*=.*_generate_response\|prompt.*expected\|assert.*prompt" claas/eval/runner.py -B 2 -A 2

Repository: kfallah/CLaaS

Length of output: 39


🏁 Script executed:

#!/bin/bash
# Check if the scenario of different prompts generating same content is possible
# Look at how OpenClaw templating works
rg -n "openclaw\|OpenClaw" claas/eval/runner.py | head -20

Repository: kfallah/CLaaS

Length of output: 39


🏁 Script executed:

#!/bin/bash
# Let's look at the actual flow - see _generate_response and how content flows
sed -n '150,240p' claas/eval/runner.py

Repository: kfallah/CLaaS

Length of output: 3260


🏁 Script executed:

#!/bin/bash
# Understand if two different prompts could generate identical visible_content
# Check strip_thinking and visible content handling
rg -n "strip_thinking|visible_content" claas/eval/runner.py -B 2 -A 2 | head -50

Repository: kfallah/CLaaS

Length of output: 1286


🏁 Script executed:

#!/bin/bash
# Check the scenario more carefully - can two different OpenClaw-templated prompts 
# realistically produce identical visible content?
rg -n "OpenClaw\|_generate_response\|visible" claas/eval/runner.py | head -40

Repository: kfallah/CLaaS

Length of output: 39


🏁 Script executed:

#!/bin/bash
# Check if there's any request ID or correlation between generation and cache retrieval
rg -n "completion_id|request_id|uuid" claas/proxy/tinker_inference_proxy.py | head -20

Repository: kfallah/CLaaS

Length of output: 630


🏁 Script executed:

#!/bin/bash
# Check the flow: when a request comes in, is the cache entry guaranteed to be 
# for the same request that's retrieving it?
# Look at tinker_inference_proxy.py POST handler
sed -n '250,350p' claas/proxy/tinker_inference_proxy.py

Repository: kfallah/CLaaS

Length of output: 3473


🏁 Script executed:

#!/bin/bash
# Final check: see if there are any concurrent access patterns that could cause
# mismatches - e.g., multiple requests processing simultaneously
rg -n "threading|async|concurrent" claas/proxy/tinker_inference_proxy.py | head -15

Repository: kfallah/CLaaS

Length of output: 910


🏁 Script executed:

#!/bin/bash
# Let me verify the cache behavior when content_hash collides
# Look at the _CompletionCache.put method again
sed -n '200,230p' claas/proxy/tinker_inference_proxy.py

Repository: kfallah/CLaaS

Length of output: 1213


🏁 Script executed:

#!/bin/bash
# Check if there's any deduplication or collision detection in cache
rg -n "class _CompletionCache" claas/proxy/tinker_inference_proxy.py -A 40

Repository: kfallah/CLaaS

Length of output: 2315


🏁 Script executed:

#!/bin/bash
# Verify if there's any mitigation - check if real_prompt is validated after retrieval
sed -n '560,590p' claas/eval/runner.py

Repository: kfallah/CLaaS

Length of output: 1391


Cache key based solely on visible content can mix prompts.

The proxy cache lookup is keyed only on the SHA-256 hash of stripped response text. If two different prompts yield identical visible content, the cached {prompt, response, logprobs} entry will be shared or overwritten, causing the retrieved real_prompt to mismatch the actual generation request. This corrupts rollout logprobs and training signals.

Extend the cache key to include the tokenized prompt hash (or request ID), or validate that the cached entry's prompt matches the generation context before using it. Alternatively, add a fallback to /v1/score when the cached prompt is inconsistent.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@claas/eval/runner.py` around lines 188 - 214, The cache lookup in
_fetch_cached_completion currently keys only on
SHA-256(strip_thinking(visible_content)) which can return a cached
{prompt,response,logprobs} from a different generation; change the logic to
include the generation prompt (or its tokenized hash or request_id) in the cache
key or, if you cannot change the proxy key, fetch the cached entry and validate
that data["prompt"] equals the expected prompt/context before returning; if the
prompt does not match, fallback to fetching fresh scores (e.g., call /v1/score
or re-run scoring) and do not return the mismatched entry.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cache is designed for single-request lookups immediately after generation — the eval runner generates a response, then immediately retrieves the cached entry by content hash. In practice, different prompts producing byte-identical visible responses within the cache's TTL window is extremely unlikely. Adding prompt hashing would complicate the cache key and require the caller to know the tokenized prompt at lookup time, which defeats the purpose of the content-addressed design.

Comment generated by Claude Code

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kfallah, thank you for the clarification. Your explanation makes sense — given the sequential generate-then-immediately-retrieve pattern in _generate_and_collect, the risk of a cache collision within that tight window is indeed negligible. The content-addressed design keeps the lookup simple and avoids the need to thread prompt context through the cache layer.

I'll consider this resolved. If collision edge cases ever become a concern in production, a lightweight validation (e.g., logging a warning if data["prompt"] doesn't match expectations) could be added as a safeguard, but it's not necessary given the current usage pattern.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (2)
claas/eval_dashboard.html (1)

113-121: Expose expand/collapse state to assistive tech.

Consider updating aria-expanded (and aria-controls) when toggling details for better accessibility.

♿ Suggested tweak
 function toggleDetails(rowId, button) {
   const detailRow = document.getElementById(rowId);
+  button.setAttribute('aria-controls', rowId);
   if (detailRow.style.display === 'table-row') {
     detailRow.style.display = 'none';
     button.textContent = 'Expand';
+    button.setAttribute('aria-expanded', 'false');
     return;
   }
   detailRow.style.display = 'table-row';
   button.textContent = 'Collapse';
+  button.setAttribute('aria-expanded', 'true');
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@claas/eval_dashboard.html` around lines 113 - 121, The toggleDetails function
should update ARIA attributes so assistive technology knows the expanded state:
when you get the detail row by id and toggle its display, also set the
triggering button's aria-expanded to "true" or "false" accordingly and ensure
the button has aria-controls set to the detail row id (add aria-controls if
missing); update the button.textContent as before and keep aria state in sync
with the visible state in the toggleDetails function.
claas/eval/__main__.py (1)

159-164: Consider validating --batch-size to prevent invalid values.

The --batch-size argument accepts any integer, including 0 or negative values, which would cause issues in the runner loop. Consider adding validation.

✨ Add validation
     parser.add_argument(
         "--batch-size",
         type=int,
         default=1,
+        choices=range(1, 17),
+        metavar="N",
         help="Samples per feedback step (default: 1 = progressive, 4 = batched)",
     )

Or validate in build_config:

if args.batch_size < 1:
    raise ValueError("--batch-size must be >= 1")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@claas/eval/__main__.py` around lines 159 - 164, Validate the --batch-size
argument to ensure it's >= 1 to prevent zero/negative values causing failures;
either add an argparse type/check by updating the
parser.add_argument("--batch-size", ...) to include a custom validator or after
parsing, assert/check args.batch_size and raise ValueError if args.batch_size <
1 (or normalize to 1). Locate the batch-size handling near parser.add_argument
and/or inside build_config (refer to args.batch_size and the build_config
function) and enforce the check with a clear error message like "--batch-size
must be >= 1".
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@claas/eval/dashboard.py`:
- Around line 106-131: The _embed_plots function currently opens and
base64-encodes each PNG and will crash the dashboard if a file is unreadable;
wrap the per-file read/encode/append logic (the block that opens fpath, reads,
base64.b64encode, and images.append) in a try/except that catches
OSError/IOError and ValueError (and optionally Exception) to skip problematic
files and continue; on exception, emit a warning (use the module logger or
print) including the fname/fpath and the exception, and do not add that image to
images so img_tags generation remains safe.

In `@claas/eval/README.md`:
- Around line 26-28: Summary: The README fenced code blocks lack language
identifiers; add them to fix MD040 and improve rendering. Edit
claas/eval/README.md: change the block containing "GET
/v1/eval?results_dir=./eval_results" to start with ```http and change the
directory listing block beginning with "eval_results/" to start with ```text;
apply the same change for the other occurrence referenced (lines ~61-68) so all
relevant triple-backtick blocks have appropriate language identifiers.

In `@claas/eval/runner.py`:
- Around line 610-611: The fallback literal "I'd be happy to help you with
that." can contaminate metrics when all generations in a batch fail; change the
logic around the response_text fallback so that if response_text is None and
samples is empty you log a warning and either skip calling _measure_eval_metrics
for that step or pass None/empty to it (avoid the dummy string). Update the
block that sets response_text and the subsequent call site of
_measure_eval_metrics (and any callers that rely on response_text) to handle a
generation-failure case without injecting the placeholder.

In `@claas/proxy/tinker_inference_proxy.py`:
- Around line 470-472: The incoming req.messages can contain
ChatCompletionMessage.content that is not a string, which will break
tokenizer.apply_chat_template; in the block building dicts (where dicts =
[{"role": m.role, "content": m.content or ""} for m in req.messages]) coerce
each m.content to a string (e.g., use str(m.content) or json.dumps for dict/list
payloads) so tokenizer.apply_chat_template always receives a string content;
update the dicts construction in tinker_inference_proxy.py accordingly and
ensure the coercion handles None (fallback to "").

---

Duplicate comments:
In `@claas/api.py`:
- Around line 844-857: The eval_dashboard endpoint passes the untrusted
results_dir directly to eval_dashboard_html, creating a path traversal risk; fix
by validating and normalizing the path before calling eval_dashboard_html:
resolve results_dir using pathlib.Path(...).resolve(), ensure it is a subpath of
a configured allowed base directory (e.g., EVAL_RESULTS_BASE or the app’s data
dir) and reject absolute or upward-traversing inputs (return HTTPException
400/403), and also verify that the expected summary.json exists inside the
resolved directory before calling eval_dashboard_html; update the eval_dashboard
function to perform these checks (use symbols eval_dashboard, results_dir,
eval_dashboard_html, summary.json) and only pass validated paths through.

In `@claas/eval/dashboard.py`:
- Around line 33-53: Wrap the JSON parsing in both _load_json and _load_steps
with try/except that catches json.JSONDecodeError (and optionally OSError) so
malformed files don't raise and crash the dashboard; in _load_json catch errors
around json.load(f), log the error and return {}; in _load_steps catch errors
around json.loads(line) per-line, log the offending line/path and continue (skip
that line) so you return the valid StepResult list created via
step_result_from_dict without raising.

In `@claas/eval/runner.py`:
- Around line 188-214: The cache currently keys only on
SHA-256(stripped_content) in _fetch_cached_completion which can collide when
different prompts produce the same visible content; change the function
signature of _fetch_cached_completion to accept the original prompt, compute a
combined hash (e.g., SHA-256 of prompt + a separator +
strip_thinking(visible_content)) instead of content_hash alone, and send that
combined hash in the request params (replace params={"content_hash":
content_hash} accordingly); update all callers of _fetch_cached_completion to
pass the prompt so the cache key uniquely includes both prompt and visible
content.

In `@claas/proxy/tinker_inference_proxy.py`:
- Around line 65-74: The apply_chat_template return value can be several shapes
(list[int], BatchEncoding/dict with "input_ids", or nested lists); update
build_generation_prompt to normalize result into a flat list[int] before calling
T.ModelInput.from_ints: call self._tokenizer.apply_chat_template as before, then
if result is a list ensure its elements are ints (and not a list of lists), if
result is a mapping or BatchEncoding extract result["input_ids"], validate it's
a list[int], and otherwise raise a TypeError with a clear message; keep the rest
of the function using token_ids and T.ModelInput.from_ints(token_ids). Also
apply the same normalization pattern to the other call site that invokes
_tokenizer.apply_chat_template (the second similar prompt builder).

---

Nitpick comments:
In `@claas/eval_dashboard.html`:
- Around line 113-121: The toggleDetails function should update ARIA attributes
so assistive technology knows the expanded state: when you get the detail row by
id and toggle its display, also set the triggering button's aria-expanded to
"true" or "false" accordingly and ensure the button has aria-controls set to the
detail row id (add aria-controls if missing); update the button.textContent as
before and keep aria state in sync with the visible state in the toggleDetails
function.

In `@claas/eval/__main__.py`:
- Around line 159-164: Validate the --batch-size argument to ensure it's >= 1 to
prevent zero/negative values causing failures; either add an argparse type/check
by updating the parser.add_argument("--batch-size", ...) to include a custom
validator or after parsing, assert/check args.batch_size and raise ValueError if
args.batch_size < 1 (or normalize to 1). Locate the batch-size handling near
parser.add_argument and/or inside build_config (refer to args.batch_size and the
build_config function) and enforce the check with a clear error message like
"--batch-size must be >= 1".

Comment on lines +610 to +611
if response_text is None:
response_text = "I'd be happy to help you with that."
Copy link

@coderabbitai coderabbitai bot Feb 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fallback response may lead to training on dummy data.

When all samples in a batch fail to generate, response_text defaults to "I'd be happy to help you with that.". While samples will be empty (preventing feedback submission), the dummy text still flows to _measure_eval_metrics via the response_text parameter, potentially skewing metrics that rely on it.

Consider logging a warning or skipping metrics for steps with generation failures.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@claas/eval/runner.py` around lines 610 - 611, The fallback literal "I'd be
happy to help you with that." can contaminate metrics when all generations in a
batch fail; change the logic around the response_text fallback so that if
response_text is None and samples is empty you log a warning and either skip
calling _measure_eval_metrics for that step or pass None/empty to it (avoid the
dummy string). Update the block that sets response_text and the subsequent call
site of _measure_eval_metrics (and any callers that rely on response_text) to
handle a generation-failure case without injecting the placeholder.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fallback only triggers when ALL samples in a batch fail to generate. In that case samples is empty, so no feedback is submitted and the SDPO training path is skipped entirely. The response_text fallback only flows to _measure_eval_metrics for dashboard display of a failed step — it doesn't affect training data.

Comment generated by Claude Code

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kfallah, thank you for the clarification! You're absolutely right—I see now that the if samples: guard at line 627 ensures no feedback is submitted when all generations fail, so the training path is completely protected. The fallback text only serves as a display placeholder for _measure_eval_metrics, which is a reasonable design for handling failed steps in the dashboard.

I appreciate you taking the time to explain the intended behavior!

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ ((\ ┃
┃ ( -.-) Got it! ┃
┃ o_(")(") Thanks for ┃
┃ explaining! ┃
┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛


📝 Learnings were identified and not saved (knowledge base disabled). Enable

@kfallah kfallah force-pushed the eval-harness-tinker-batch branch from 4b9bf99 to 6a49e98 Compare February 17, 2026 05:40
Extract _apply_chat_template_ids() helper with Any-typed tokenizer
param so ty cannot narrow the complex union return type. Fix score
endpoint tests to wrap compute_logprobs return in a mock Future
(matching the .result() call in the implementation).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@kfallah kfallah force-pushed the eval-harness-tinker-batch branch from 6a49e98 to 33235b6 Compare February 17, 2026 05:42
kfallah and others added 5 commits February 16, 2026 21:46
Matches the coercion already applied in /v1/chat/completions.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Catch JSONDecodeError in _load_json and _load_steps, and OSError in
_embed_plots so that corrupted or partially-written files don't crash
the dashboard.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Validate that the resolved path is within the base eval_results
directory to prevent path traversal.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Coalesce data["logprobs"] to [] since _CompletionCacheEntry.logprobs
is Optional[list[float]].

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
claas/eval/collapse.py (1)

205-213: ⚠️ Potential issue | 🟡 Minor

Avoid treating missing logprobs as zero entropy.
If logprobs_content is empty (e.g., upstream omits top_logprobs), returning (0.0, 0.0) can trigger false collapse alerts. Since entropy is now optional, return None (and warn) when logprobs are unavailable.

🛠️ Suggested fix
-    if not logprobs_content:
-        return (0.0, 0.0)
+    if not logprobs_content:
+        logger.warning("No logprobs returned; entropy unavailable.")
+        return (None, 0.0)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@claas/eval/collapse.py` around lines 205 - 213, The current branch in
collapse.py treats an empty logprobs_content as zero entropy by returning (0.0,
0.0); change this so missing top_logprobs returns None instead to avoid false
collapse alerts: inside the block that reads choice = resp.json()["choices"][0]
and computes logprobs_content, if logprobs_content is falsy, emit a warning (use
the module logger or warnings.warn) mentioning the response_text/choice context
and return (None, None) rather than (0.0, 0.0) so callers can detect unavailable
entropy.
🧹 Nitpick comments (4)
claas/eval/runner.py (1)

231-231: Consider a more descriptive assertion message.

The bare assert config.proxy_url is not None will produce an unhelpful AssertionError if triggered. A message would aid debugging.

♻️ Suggested improvement
-    assert config.proxy_url is not None
+    assert config.proxy_url is not None, "_generate_and_collect requires proxy_url"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@claas/eval/runner.py` at line 231, Replace the bare assertion with a
descriptive message (or explicit check) so failures are informative: change the
line with assert config.proxy_url is not None to include a message like assert
config.proxy_url is not None, "config.proxy_url must be set (proxy_url missing
from config)" or alternatively perform an explicit check and raise a
ValueError/RuntimeError with that message to locate and diagnose the missing
proxy_url when functions referencing config.proxy_url (e.g., in runner.py where
config.proxy_url is used) fail.
claas/eval/dashboard.py (1)

26-26: Template path may be inconsistent with claas/api.py.

This file defines EVAL_DASHBOARD_TEMPLATE as Path(__file__).resolve().parent.parent / "eval_dashboard.html" (i.e., claas/eval_dashboard.html), while claas/api.py line 81 defines it as Path(__file__).resolve().parent / "eval_dashboard.html" (i.e., also claas/eval_dashboard.html). The paths resolve to the same location, but having two separate constants is redundant.

♻️ Consider importing from a single source

Either import the constant from claas/api.py or define it in a shared location to avoid duplication:

-EVAL_DASHBOARD_TEMPLATE = Path(__file__).resolve().parent.parent / "eval_dashboard.html"
+from claas.api import EVAL_DASHBOARD_TEMPLATE

Or move both template constants to a shared claas/templates.py module.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@claas/eval/dashboard.py` at line 26, The EVAL_DASHBOARD_TEMPLATE constant is
duplicated between cla as/eval/dashboard.py (EVAL_DASHBOARD_TEMPLATE) and cla
as/api.py; refactor to a single source of truth by removing the duplicate and
importing the constant instead of redefining it, or move the constant to a new
shared module (e.g., cla as/templates.py) and update both places to import that
symbol; update references in functions/classes that use EVAL_DASHBOARD_TEMPLATE
to import the shared constant so only one definition exists.
tests/integration/test_tinker_stack_integration.py (1)

21-22: Consider reusing the shared strip_thinking helper instead of duplicating logic.

The _THINK_RE pattern and orphaned </think> stripping logic are duplicated from claas/proxy/tinker_inference_proxy.py and claas/eval/verifiers.py. This creates maintenance risk if the stripping behavior needs to change.

♻️ Suggested refactor
+from claas.eval.verifiers import strip_thinking
+
-_THINK_RE = re.compile(r"<think>.*?</think>", re.DOTALL)
-
 ...
 
 def _fetch_raw_completion(...):
-    visible = _THINK_RE.sub("", response_content)
-    idx = visible.find("</think>")
-    if idx >= 0:
-        visible = visible[idx + len("</think>"):]
-    visible = visible.strip()
+    visible = strip_thinking(response_content)
     content_hash = hashlib.sha256(visible.encode("utf-8")).hexdigest()

Also applies to: 43-48

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/integration/test_tinker_stack_integration.py` around lines 21 - 22, The
test defines a duplicate regex _THINK_RE and manual </think> stripping; instead
import and use the shared strip_thinking helper (or equivalent) used by the
production modules to avoid drift: replace usages of _THINK_RE and any manual
think-tag removal in tests/integration/test_tinker_stack_integration.py with a
call to strip_thinking (or the function name used in
claas/proxy/tinker_inference_proxy.py / claas/eval/verifiers.py), updating the
test imports accordingly and removing the orphaned regex and trim logic so all
think-tag behavior is centralized.
claas/eval/collapse.py (1)

262-308: Pass system_prompt and prompt_preamble to Tinker mode or make the silent drop explicit.

_entropy_and_logprob_tinker always uses a single user message, silently ignoring any system_prompt or prompt_preamble passed through measure_entropy_and_mean_logprob. This creates inconsistency: the fallback vLLM path (lines 175–187) properly constructs messages from these parameters, but Tinker mode discards them. Thread a prebuilt messages list through, matching the pattern used in the fallback path, or add an assertion/log warning when these parameters are provided to surface the divergence.

♻️ Proposed refactor to pass messages
-    if openclaw_url and proxy_url:
+    if openclaw_url and proxy_url:
+        tinker_messages: list[dict[str, str]] = list(prompt_preamble or [])
+        if system_prompt and not any(
+            m.get("role") == "system" and m.get("content") == system_prompt
+            for m in tinker_messages
+        ):
+            tinker_messages.append({"role": "system", "content": system_prompt})
+        tinker_messages.append({"role": "user", "content": prompt})
         return await _entropy_and_logprob_tinker(
             openclaw_url=openclaw_url,
             openclaw_api_key=openclaw_api_key,
             proxy_url=proxy_url,
             prompt=prompt,
             timeout_s=timeout_s,
             rollout_log=rollout_log,
+            messages=tinker_messages,
         )
@@
 async def _entropy_and_logprob_tinker(
     openclaw_url: str,
     openclaw_api_key: str,
     proxy_url: str,
     prompt: str = COLLAPSE_PROBE,
     timeout_s: float = 60.0,
     rollout_log: list[EvalRollout] | None = None,
+    messages: list[dict[str, str]] | None = None,
 ) -> tuple[None, float]:
@@
-    # Generate through OpenClaw
+    # Generate through OpenClaw
+    messages = messages or [{"role": "user", "content": prompt}]
     async with httpx.AsyncClient(base_url=openclaw_url, timeout=timeout_s) as client:
         resp = await client.post(
             "/v1/chat/completions",
             json={
                 "model": "openclaw",
-                "messages": [{"role": "user", "content": prompt}],
+                "messages": messages,
                 "temperature": 0,
                 "max_tokens": 2048,
             },
             headers=headers,
         )
@@
     async with httpx.AsyncClient(base_url=proxy_url, timeout=timeout_s) as client:
         score_resp = await client.post(
             "/v1/score",
             json={
-                "messages": [{"role": "user", "content": prompt}],
+                "messages": messages,
                 "completion": response_text,
             },
         )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@claas/eval/collapse.py` around lines 262 - 308, The tinker path in
_entropy_and_logprob_tinker ignores system_prompt and prompt_preamble (unlike
the fallback in measure_entropy_and_mean_logprob), so either thread a prebuilt
messages list through or surface the mismatch; update
_entropy_and_logprob_tinker to accept or build the same messages structure used
by measure_entropy_and_mean_logprob (include the system message if system_prompt
is set, include a prompt_preamble user message if set, then the main user
prompt), use that messages list in the OpenClaw /v1/chat/completions request and
in the proxy /v1/score call instead of the single-user message, or if you prefer
not to change the signature add a clear warning/assert when system_prompt or
prompt_preamble are passed so the silent drop is explicit. Ensure references to
functions/parameters: _entropy_and_logprob_tinker,
measure_entropy_and_mean_logprob, system_prompt, prompt_preamble, and the
/v1/chat/completions and /v1/score request payloads are updated consistently.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@claas/eval/__main__.py`:
- Around line 159-164: The CLI currently allows non-positive values for
--batch-size which will break the runner; validate args.batch_size >= 1 and
surface a clear error. Add a check after parsing (or in build_config) that
inspects args.batch_size and calls parser.error("--batch-size must be at least
1") or raise a ValueError with that message; locate the validation near
parser.add_argument(...) or inside build_config where args are processed to
ensure the runner never receives a non-positive batch_size.

In `@claas/eval/logprob.py`:
- Around line 138-143: The loop over raw_logprobs[prompt_token_count:] can crash
when an entry is an empty dict because next(iter(entry.values())) raises
StopIteration; update the loop in cla as/eval/logprob.py (the loop handling
raw_logprobs[prompt_token_count:]) to guard against empty or falsy entries by
skipping when entry is None or not entry (or when entry.values() is empty)
before calling next(iter(...)), so it safely continues on malformed API
responses and only sums top["logprob"] for valid entries.

In `@claas/proxy/tinker_inference_proxy.py`:
- Around line 63-68: The call to tokenizer.apply_chat_template may return
unexpected types causing KeyError/TypeError when accessing result["input_ids"];
update the logic around tokenizer.apply_chat_template (in
tinker_inference_proxy) to defensively handle results by: 1) wrapping the call
in a try/except to catch exceptions from the tokenizer, 2) if result is a list
return it, 3) if result is a dict-like object check for the "input_ids" key and
return it, and 4) for any other type or missing key raise a clear ValueError (or
log and return a safe default) describing the unexpected return type and include
the actual result for debugging. Ensure references to
tokenizer.apply_chat_template and the result variable are used in the error
message for easier tracing.

---

Outside diff comments:
In `@claas/eval/collapse.py`:
- Around line 205-213: The current branch in collapse.py treats an empty
logprobs_content as zero entropy by returning (0.0, 0.0); change this so missing
top_logprobs returns None instead to avoid false collapse alerts: inside the
block that reads choice = resp.json()["choices"][0] and computes
logprobs_content, if logprobs_content is falsy, emit a warning (use the module
logger or warnings.warn) mentioning the response_text/choice context and return
(None, None) rather than (0.0, 0.0) so callers can detect unavailable entropy.

---

Duplicate comments:
In `@claas/api.py`:
- Around line 844-857: The eval_dashboard handler accepts a user-controlled
results_dir which can enable path traversal; before calling eval_dashboard_html,
resolve and validate the requested path against an application base directory
(e.g., a configured EVAL_BASE_DIR) using pathlib.Path.resolve(), disallow
absolute paths or any path whose resolved form is not a subpath of the base
(check by comparing parents or using Path.relative_to in a try/except), and
return an error (400/403) if validation fails; update the async def
eval_dashboard(...) function to perform this sanitization on results_dir and
pass only the validated path to eval_dashboard_html.

In `@claas/eval/dashboard.py`:
- Around line 33-53: Wrap the JSON parsing in _load_json and _load_steps with
try/except to catch json.JSONDecodeError and handle it gracefully: in _load_json
(function _load_json) surround json.load(f) with a try/except
json.JSONDecodeError and return {} (and optionally log a warning) instead of
letting the exception propagate; in _load_steps (function _load_steps) surround
json.loads(line) / step_result_from_dict(...) with try/except
json.JSONDecodeError (and/or ValueError from step_result_from_dict) to skip
malformed lines (log a warning including the filename/run_dir and the offending
line) and continue processing the rest of the file so a single bad record
doesn't cause a 500. Ensure logging uses the module logger (e.g.,
logging.getLogger(__name__)) if not already present.
- Around line 106-131: The _embed_plots function currently opens and encodes PNG
files without handling read errors which can crash the dashboard; wrap the file
open/read/base64.b64encode section (the block that reads fpath and builds b64
and appends to images) in a try/except that catches IO and decoding exceptions,
log or warn the failure (using the module logger or logging.warning) with the
filename and exception, and continue so corrupted/missing files are skipped;
keep the existing with open(...) context and ensure only successfully encoded
files are appended to images so img_tags generation remains safe.

In `@claas/eval/README.md`:
- Around line 26-28: Add language identifiers to the fenced code blocks in
claas/eval/README.md: change the HTTP request fence that currently wraps "GET
/v1/eval?results_dir=./eval_results" to use ```http and change the
filesystem/tree fence around the eval_results example to use ```text; apply the
same fix to the other fenced block mentioned (lines 61-68) so all raw fences
include appropriate language identifiers for syntax highlighting.

In `@claas/eval/runner.py`:
- Around line 610-611: The fallback response_text ("I'd be happy to help you
with that.") must not be passed into _measure_eval_metrics because it skews
metrics; instead detect when response_text is still None after generation
attempts and either return early or record the sample as a failure (increment
the failure counter / call the existing failure-tracking helper) without calling
_measure_eval_metrics. Update the logic around the response_text assignment and
the call site of _measure_eval_metrics (the variable response_text and function
_measure_eval_metrics in runner.py) so real generated outputs only are measured
and fallback cases are flagged/returned as failures.
- Around line 188-214: The cache is currently keyed only by
SHA-256(stripped_visible_content) in _fetch_cached_completion, causing
collisions when different prompts produce the same visible content; update the
function to include the original prompt (or other distinguishing inputs like
model/params) in the cache key and/or validate the returned entry before using
it. Concretely: change _fetch_cached_completion signature to accept the original
prompt (or relevant request metadata), compute a composite cache key (e.g.,
hash(strip_thinking(visible_content) + original_prompt or include model/params),
use that key for the proxy lookup instead of content_hash alone, and after
fetching verify data["prompt"] matches the expected original prompt (or
reject/ignore the cached entry when it does not) so wrong generations are not
returned. Ensure you update all call sites to pass the new parameter and keep
strip_thinking usage for defensive normalization.

In `@claas/proxy/tinker_inference_proxy.py`:
- Line 498: The dict comprehension building request messages uses naive coercion
("m.content or \"\"") which can leave non-string payloads and break downstream
chat template rendering; update the comprehension in tinker_inference_proxy.py
(where dicts is created) to call the existing _coerce_content helper (i.e.,
{"role": m.role, "content": _coerce_content(m.content)}) so all message contents
are normalized to strings, and ensure _coerce_content is accessible in that
scope (import or move it if necessary).

---

Nitpick comments:
In `@claas/eval/collapse.py`:
- Around line 262-308: The tinker path in _entropy_and_logprob_tinker ignores
system_prompt and prompt_preamble (unlike the fallback in
measure_entropy_and_mean_logprob), so either thread a prebuilt messages list
through or surface the mismatch; update _entropy_and_logprob_tinker to accept or
build the same messages structure used by measure_entropy_and_mean_logprob
(include the system message if system_prompt is set, include a prompt_preamble
user message if set, then the main user prompt), use that messages list in the
OpenClaw /v1/chat/completions request and in the proxy /v1/score call instead of
the single-user message, or if you prefer not to change the signature add a
clear warning/assert when system_prompt or prompt_preamble are passed so the
silent drop is explicit. Ensure references to functions/parameters:
_entropy_and_logprob_tinker, measure_entropy_and_mean_logprob, system_prompt,
prompt_preamble, and the /v1/chat/completions and /v1/score request payloads are
updated consistently.

In `@claas/eval/dashboard.py`:
- Line 26: The EVAL_DASHBOARD_TEMPLATE constant is duplicated between cla
as/eval/dashboard.py (EVAL_DASHBOARD_TEMPLATE) and cla as/api.py; refactor to a
single source of truth by removing the duplicate and importing the constant
instead of redefining it, or move the constant to a new shared module (e.g., cla
as/templates.py) and update both places to import that symbol; update references
in functions/classes that use EVAL_DASHBOARD_TEMPLATE to import the shared
constant so only one definition exists.

In `@claas/eval/runner.py`:
- Line 231: Replace the bare assertion with a descriptive message (or explicit
check) so failures are informative: change the line with assert config.proxy_url
is not None to include a message like assert config.proxy_url is not None,
"config.proxy_url must be set (proxy_url missing from config)" or alternatively
perform an explicit check and raise a ValueError/RuntimeError with that message
to locate and diagnose the missing proxy_url when functions referencing
config.proxy_url (e.g., in runner.py where config.proxy_url is used) fail.

In `@tests/integration/test_tinker_stack_integration.py`:
- Around line 21-22: The test defines a duplicate regex _THINK_RE and manual
</think> stripping; instead import and use the shared strip_thinking helper (or
equivalent) used by the production modules to avoid drift: replace usages of
_THINK_RE and any manual think-tag removal in
tests/integration/test_tinker_stack_integration.py with a call to strip_thinking
(or the function name used in claas/proxy/tinker_inference_proxy.py /
claas/eval/verifiers.py), updating the test imports accordingly and removing the
orphaned regex and trim logic so all think-tag behavior is centralized.

Comment on lines +138 to +143
for entry in raw_logprobs[prompt_token_count:]:
if entry is None:
continue
top = next(iter(entry.values()))
logprob_sum += top["logprob"]

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Guard against empty logprob entries.

If entry is an empty dict (malformed API response), next(iter(entry.values())) will raise StopIteration. Consider adding a guard.

🛡️ Suggested fix
     for entry in raw_logprobs[prompt_token_count:]:
         if entry is None:
             continue
+        if not entry:
+            continue
         top = next(iter(entry.values()))
         logprob_sum += top["logprob"]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@claas/eval/logprob.py` around lines 138 - 143, The loop over
raw_logprobs[prompt_token_count:] can crash when an entry is an empty dict
because next(iter(entry.values())) raises StopIteration; update the loop in cla
as/eval/logprob.py (the loop handling raw_logprobs[prompt_token_count:]) to
guard against empty or falsy entries by skipping when entry is None or not entry
(or when entry.values() is empty) before calling next(iter(...)), so it safely
continues on malformed API responses and only sums top["logprob"] for valid
entries.

default=None,
help=(
"Output directory for results "
"(default: auto ./data/evals/<UTC timestamp>)"
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we have default as a comment but the default value is None? This is ridiculous

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 5ff74ce. Simplified the help text to "Output directory for results (auto-generated if omitted)".

Comment generated by Claude Code

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This missed the point of my message. If we have a default value for the output_dir, why dont we add it as this arg? Why have None that confusingly maps to a default

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 9bb7ed2. --output-dir now defaults to ./data/evals (not None). A timestamped subdir is always appended in build_config.

Comment generated by Claude Code

via proxy /v1/score to get mean_logprob for drift detection.
- **Direct vLLM** (fallback, no openclaw_url): Legacy direct-to-vLLM path.
"""
if openclaw_url and proxy_url:
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is super confusing. We have different logic for local vs tinker, but we try and infer it from what was passed in. It makes NO sense. Why not just have an explicit tinker or local mode and we force at the beginning that all the right config is passed in.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 85170af. Removed the engine-specific Tinker branch — the function no longer infers mode from parameters. The OpenClaw path handles both local and Tinker routing uniformly.

Comment generated by Claude Code

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You did not address my larger point: why do we not have a way to configure execution mode in eval the way we do in training?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 9bb7ed2. Added explicit --mode local|tinker to HarnessConfig and CLI. Extracted openclaw_chat_params() and direct_vllm_chat_params() as two clear routing functions in types.py. All callers (runner, collapse) now use these instead of inline if/else branching on URL presence. The runner checks config.mode for tinker-vs-local decisions (LoRA loading, proxy cache).

Comment generated by Claude Code

kfallah and others added 14 commits February 18, 2026 01:51
Engines should emit canonical metric keys directly. Remove the
normalize_sdpo_metrics helper and its callers in api.py and runner.py.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Delete the Tinker-specific collapse implementation and remove proxy_url
routing from measure_entropy_and_mean_logprob and measure_collapse.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…branch

The score endpoint only needs raw prompt text + completion. Remove the
messages/chat-template codepath that added unnecessary complexity.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Remove ge/le validators so the API doesn't reject unusual learning rates.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Files remain on disk but are no longer tracked (.gitignore updated in prior commit).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Use DistillBatchItem for feedback samples instead of dict[str, object]
- Use FeedbackBatchRequest/DistillRequest to build the API payload
- Add ExperimentSummary dataclass for _write_summary entries

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add --mode local|tinker to HarnessConfig and CLI, replacing inference
  from URL presence. Tinker mode validates --openclaw-url is provided.
- Set --output-dir default to ./data/evals (not None), always append
  a timestamped subdir.
- Replace config.proxy_url checks in runner.py with config.mode checks.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add ChatRequestParams dataclass and two builder functions in types.py
  for clear separation of OpenClaw vs direct-vLLM communication
- Replace inline if/else branching in runner._generate_response,
  collapse.measure_entropy_and_mean_logprob, and
  collapse.measure_self_rouge_l with the shared helpers
- Fix --output-dir to default to ./data/evals (not None), always
  appends a timestamped subdir
- Remove tinker --openclaw-url requirement (not needed)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The eval harness was showing null for all SDPO metrics in Tinker mode
because the Tinker engine returns different metric names (adv_mean,
kl_mean, etc.) than the local engine (distill_loss, kl_reg, etc.).

Instead of converting between types, support both metric shapes natively:
- Rename SDPOMetrics -> LocalDistillMetrics
- Add TinkerDistillMetrics dataclass with native Tinker field names
- Branch on metadata keys in runner, plotting, and dashboard
- Detect type via "adv_mean" key presence during deserialization

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move DEFAULT_SYSTEM_PROMPT ("You are a helpful assistant.") into
core/config.py so both eval types and teacher_helpers share one source.
Change teacher default from "expert coding assistant" to match, and
update feedback framing from "unsuccessful earlier attempt" to
"positive or negative feedback" since feedback can be either.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The config now defaults feedback_log_dir to ./data/feedback, so the
explicit FEEDBACK_LOG_DIR env var in Docker Compose and the local
stack script is no longer needed. Update volume mounts to match.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@kfallah
Copy link
Owner Author

kfallah commented Feb 18, 2026

@codex review

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ae09066fa7

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +113 to +115
prompt = "\n".join(
f"{m.get('role', 'user')}: {m.get('content', '')}" for m in messages
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Use chat-template inputs for proxy logprob scoring

When proxy_url is set, the scorer converts structured chat messages into a plain newline string ("role: content") before calling /v1/score, so Tinker-mode margins are computed on a different prompt format than generation and direct-vLLM scoring. This can materially skew or even invert margin deltas for chat models (especially with system prompts), which undermines the core evaluation metric; pass chat-structured inputs (or a properly rendered chat template) to the proxy instead of flattening roles into raw text.

Useful? React with 👍 / 👎.

# Gemini feedback override (optional)
if needs_generation and gemini_user:
try:
gemini_result = await gemini_user.evaluate_response(response_text, prompt)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Pair Gemini evaluation with the matching batch prompt

In batched mode, response_text is captured from the first successful sample, but prompt here is the loop variable after batching (typically the last sample), so Gemini can evaluate a response against the wrong prompt whenever batch_size > 1. If Gemini feedback is enabled, this mismatch can produce irrelevant feedback and then overwrite all samples in the batch with that incorrect feedback.

Useful? React with 👍 / 👎.

kl_mean from TinkerDistillMetrics is a signed log-prob delta, not a
true KL divergence. Stop plotting it and hide the right panel for
Tinker runs. Also differentiate axis labels between Tinker (advantage)
and local (GJS loss) engines.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@kfallah kfallah merged commit 60f0b03 into main Feb 18, 2026
3 checks passed
@kfallah kfallah deleted the eval-harness-tinker-batch branch February 18, 2026 23:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant

Comments