Skip to content

Conversation

@hauserx
Copy link
Contributor

@hauserx hauserx commented Dec 20, 2025

Previously set{Diff,Inter,Member} functions were walking through first set and checking whether item exist in second using existsInSet. In total it takes O(n * log(n)) time. Taking advantage of the fact that both arrays are already ordered, in all functions walking through both arrays simultaneously.

Added benchmarks and tested with: ./mill bench.runRegressions bench/resources/sjsonnet_suite/*.jsonnet

Before:

[113] RegressionBenchmark.main   bench/resources/sjsonnet_suite/setDiff.jsonnet  avgt       1.237          ms/op
[113] RegressionBenchmark.main  bench/resources/sjsonnet_suite/setInter.jsonnet  avgt       0.870          ms/op
[113] RegressionBenchmark.main  bench/resources/sjsonnet_suite/setUnion.jsonnet  avgt       5.926          ms/op

After:

[113] RegressionBenchmark.main   bench/resources/sjsonnet_suite/setDiff.jsonnet  avgt       0.895          ms/op
[113] RegressionBenchmark.main  bench/resources/sjsonnet_suite/setInter.jsonnet  avgt       0.722          ms/op
[113] RegressionBenchmark.main  bench/resources/sjsonnet_suite/setUnion.jsonnet  avgt       1.471          ms/op

Previously set{Diff,Inter,Member} were walking through first set and
checking whether item exist in second using existsInSet, in total taking
`O(n * log(n))` time. Taking advantage of the fact that both arrays are
already ordered, in all functions walking through both arrays
simultaneously.

Tested with: `./mill bench.runRegressions bench/resources/sjsonnet_suite/*.jsonnet`

Before:
```
[113] RegressionBenchmark.main   bench/resources/sjsonnet_suite/setDiff.jsonnet  avgt       1.237          ms/op
[113] RegressionBenchmark.main  bench/resources/sjsonnet_suite/setInter.jsonnet  avgt       0.870          ms/op
[113] RegressionBenchmark.main  bench/resources/sjsonnet_suite/setUnion.jsonnet  avgt       5.926          ms/op
```

After:
```
[113] RegressionBenchmark.main   bench/resources/sjsonnet_suite/setDiff.jsonnet  avgt       0.897          ms/op
[113] RegressionBenchmark.main  bench/resources/sjsonnet_suite/setInter.jsonnet  avgt       0.701          ms/op
[113] RegressionBenchmark.main  bench/resources/sjsonnet_suite/setUnion.jsonnet  avgt       1.520          ms/op
```
@hauserx
Copy link
Contributor Author

hauserx commented Dec 20, 2025

Ah, I see that the go-jsonnet tests contain sets for which keyF creates duplicates. Let me re-check how they are handling it in their implementation.
https://github.com/google/go-jsonnet/blame/10aef6a96ca825c97c87df137a837e39f5df174c/testdata/stdlib_smoke_test.jsonnet#L150

JoshRosen pushed a commit to JoshRosen/sjsonnet that referenced this pull request Dec 21, 2025
This replaces the previous O(n log n) and O(n * log m) implementations
of setUnion, setInter, and setDiff with efficient O(n+m) merge-walk
algorithms that exploit the fact that sets are already sorted.

Changes:
- setUnion: Changed from concat + sort + dedup to merge-walk
  (was O((n+m) log(n+m)), now O(n+m))
- setInter: Changed from iterate + binary search to merge-walk
  (was O(min(n,m) * log(max(n,m))), now O(n+m))
- setDiff: Changed from iterate + binary search to merge-walk
  (was O(n * log m), now O(n+m))
- Added applyKeyF helper function for cleaner key function handling
- Added validateSet calls to setUnion for consistency with other ops
- Pre-sized ArrayBuffer allocations for better memory efficiency

The setMember function still uses binary search which is optimal for
single-element lookups.

These optimizations address the same performance issue as upstream
PR databricks#574, but avoid the O(n²) bug in that PR's uniqArr implementation
(which calls ArrayBuilder.result() on each iteration).

All changes respect the throwErrorForInvalidSets setting.
},
builtinWithDefaults("setUnion", "a" -> null, "b" -> null, "keyF" -> Val.False(dummyPos)) {
(args, pos, ev) =>
val keyF = args(2)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only ask I have is that you put this new behavior in setUnion behind a flag.
I'm concerned with code relying on the fact that this function will currently create a set from a non-set.
We will have to look at all our users of std.setUnion - validateSet might save us here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants