Add per-input trial_count support to Eval()#1342
Open
dflynn15 wants to merge 1 commit intobraintrustdata:mainfrom
Open
Add per-input trial_count support to Eval()#1342dflynn15 wants to merge 1 commit intobraintrustdata:mainfrom
dflynn15 wants to merge 1 commit intobraintrustdata:mainfrom
Conversation
Allow data items to specify their own trial_count, overriding the global evaluator setting. This enables targeted debugging of flaky test cases and mixed determinism scenarios without multiplying the entire suite. - Add optional `trial_count` field to `EvalCase` dataclass and TypedDict - Per-item trial_count takes precedence over global trial_count - Items without trial_count use the global value (or 1 if unset) - Works with both EvalCase objects and dict data
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why?
Braintrust's
Eval()function supports atrialCountparameter that runs each input multiple times to measure variance in non-deterministic LLM outputs. However, this setting applies globally to all inputs, which creates some (but minimal) friction in some evaluation workflows. For example:Targeted Debugging is Expensive: When investigating a single flaky test case, you want to run it 10-20 times to understand the variance pattern. With global
trialCount, this means running your entire suite 10-20 times, multiplying costs and wait time unnecessarily.Mixed Determinism is Common: Real evaluation suites contain a mix of deterministic scenarios (math problems, factual lookups) and non-deterministic ones (creative writing, open-ended reasoning). Forcing the same trial count on both wastes resources.
Cost Scales Linearly: Every additional trial means another LLM API call. A global
trialCount: 5on a 100-item dataset means 500 API calls, even if only 10 items actually need variance analysis.In order to address this, we've created a custom solution that I want to propose as a contribution. Specifically it, allows each data item to specify its own
trialCount, overriding the global default. This gives users fine-grained control over where to invest their evaluation budget.What?
There is a corollary JS PR up to match it here: #1341