feat: use histogram samples for t-test analysis#155
Open
jdmarshall wants to merge 2 commits intoRafaelGSS:mainfrom
Open
feat: use histogram samples for t-test analysis#155jdmarshall wants to merge 2 commits intoRafaelGSS:mainfrom
jdmarshall wants to merge 2 commits intoRafaelGSS:mainfrom
Conversation
Justification: Each sample in the histogram represents a durationPerOp sample calculated by dividing a certain number of iterations of executing the function under test divided by the cumulative time of those runs. Which is the average of the execution time of each execution. opsSec and opsSecPerRun are then an average of the samples, which are themselves averages. Therefore, using opsSecPerRun as a t-test inaccurately applying the calculation to an average of averages, when it is meant to be applied to a set of averages totalling a minimum of 30 samples, with 40 preferable. In other words, it's a histogram entry that represents a valid t-test sample.
…s too small. This will help me sort out inconclusive tests without missing misconfigured ones. This is necessitated by the changes in the previous commit that allow for failure instead of forcing success.
RafaelGSS
reviewed
Jan 20, 2026
| const suite = new Suite({ | ||
| ttest: true, // Automatically sets repeatSuite=30 | ||
| ttest: true, | ||
| minSamples: 30, // minSamples x repeatSuite must be > 30 |
|
|
||
| Enable t-test mode with `ttest: true`. This automatically sets `repeatSuite=30` to collect enough | ||
| independent samples for reliable statistical analysis (per the Central Limit Theorem): | ||
| Enable t-test mode with `ttest: true`. Requires 30 independent samples for reliable statistical analysis (per the |
Owner
There was a problem hiding this comment.
This shouldn't be the case. It should be 30 full suites (regardless of samples).
Collaborator
Author
There was a problem hiding this comment.
You have asserted several times now that suites are samples and samples are not samples without expanding on why, aside from restating your assertion in essentially the same words.
Why are samples not samples?
I've been over that code and the way in which count and time are calculated is consistent with the notion of 'sample' as I've seen it described in the literature. Where is my error?
By taking a suite as a sample taken over several seconds, you're taking an average of an average, which makes the results of the t-test less accurate.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Justification:
Each sample in the histogram represents a durationPerOp value calculated by dividing a certain number of iterations of executing the function under test divided by the cumulative time of those runs. Which is the average of the execution time of each execution. opsSec and opsSecPerRun are then an average of the samples, which are themselves averages.
Therefore, using opsSecPerRun as a t-test inaccurately applying the calculation to an average of averages, when it is meant to be applied to a set of averages totalling a minimum of 30 samples, with 40 preferable.
In other words, it's a histogram entry that represents a valid t-test sample.
When repeatSuite > 1, the additional samples accumulate in the histogram.
This change also converts the forced override of the input options to a warning if the sample size is too small. Let the users pick whether they want minSamples: 30 or repeatSuite: 3. The code already had support for omitting the significance data if the low bar is not met.