Skip to content

Conversation

@knfreemLD
Copy link
Contributor

@knfreemLD knfreemLD commented Feb 5, 2026

Requirements

  • I have added test coverage for new or changed functionality
  • I have followed the repository's pull request submission guidelines
  • I have validated my changes against all supported platform versions

Related issues

See https://docs.google.com/document/d/1lzYwQqCcTzN_2zkxJZDfJtgUcEJ4jbpx0KSsJ2bRENw/edit?tab=t.0#heading=h.5d8l30brvyuw for context

For other SDK implementations, see:

Describe the solution you've provided

Extending the Go SDK to support AI Config evaluations. This includes custom evaluator support as well.

This SDK was written with hopes to be congruent with the python and node implementations. Changes were verified by a local app that was created; the resultant data can be observed in the evaluator metrics for this AI config.

Describe alternatives you've considered

Provide a clear and concise description of any alternative solutions or features you've considered.

Additional context

Add any other context about the pull request here.


Note

Medium Risk
Adds new evaluation and metric-tracking paths (including dynamic metric keys and new event payload fields), which could affect analytics correctness and runtime behavior if misconfigured. Changes are well-covered by tests but touch core SDK tracking surfaces.

Overview
Adds judge-mode support to AI Configs by extending the config datamodel and builder with mode, evaluationMetricKey/evaluationMetricKeys, and judgeConfiguration (with defensive copying to keep configs immutable).

Introduces Client.JudgeConfig to fetch judge configs while preserving {{message_history}} / {{response_to_evaluate}} placeholders for a second Mustache interpolation pass during evaluation, and adds a new ldai/judge package that samples, interpolates, invokes a structured provider, and parses judge responses.

Extends Tracker with TrackJudgeResponse to emit evaluation scores as metrics (including optional judgeConfigKey in event data), and adds comprehensive tests covering parsing, placeholder preservation, schema generation, sampling, and response validation.

Written by Cursor Bugbot for commit 41141b9. This will update automatically on new commits. Configure here.

@knfreemLD knfreemLD requested a review from jsonbailey February 5, 2026 18:04
@knfreemLD knfreemLD requested a review from a team as a code owner February 5, 2026 18:04
knfreemLD and others added 3 commits February 5, 2026 13:20
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

extendedVariables["response_to_evaluate"] = "{{response_to_evaluate}}"

return c.Config(key, context, defaultValue, extendedVariables)
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JudgeConfig double-tracks config and judge metric events

Medium Severity

JudgeConfig emits $ld:ai:judge:function:single at line 196, then delegates to c.Config which independently emits $ld:ai:config:function:single at line 73. Every judge evaluation is therefore double-counted — once as a judge function call and once as a regular config function call. This inflates the config function metric on the monitoring dashboard, making it appear there are more regular config evaluations than actually occurred.

Additional Locations (1)

Fix in Cursor Fix in Web

}
}

return "", fmt.Errorf("missing evaluationMetricKey")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused logger and configKey parameters in getMetricKey

Low Severity

getMetricKey accepts logger and configKey parameters but uses neither in the function body. These were likely intended for deprecation warnings (when falling back to the deprecated evaluationMetricKeys array) and for providing context in error messages, but appear to be remnants from a previous version where the logging loop was removed based on PR review feedback.

Fix in Cursor Fix in Web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants