Skip to content
This repository was archived by the owner on Oct 28, 2022. It is now read-only.

Conversation

@kocolosk
Copy link
Member

Following the pattern from b64url

* update an incomplete type spec for khash:del/2

* use assertEqual instead of assertMatch to silence a spurious warning

* khash:keys/1 was never implemented; use khash:fold/3 in the test
  suite instead and enable the missing test
@nickva
Copy link
Contributor

nickva commented Nov 26, 2021

A thought I had is that we had created khash before maps were available. With optimized maps available since OTP 18, does khash still perform better than maps? Perhaps the easiest here might be to switch to maps and have one less C dependency.

@kocolosk
Copy link
Member Author

Oh, good idea! Better to not need the extra code at all.

@kocolosk
Copy link
Member Author

I wrote a short Escript to benchmark khash against Erlang maps and attached it to this PR. It's not my finest demonstration of scientific rigor, but it does provide some guidance. General observations:

  • Maps are nearly always faster than khash for lookups, sometimes significantly so
  • Maps are typically faster at inserts when the total dataset is ~100 MB or smaller

Larger datasets exhibit a fair amount of noise; I expect this is largely due to garbage collection. Using maps instead of khash instances inside e.g. the ddoc_cache_lru gen_server might introduce some undesirable GC pauses for environments where the cache size is quite large. I'm not sure how a map-of-maps would behave from a GC perspective.

On the topic of LRUs, I noticed that we have different implements for the authentication cache (ets_lru) and the ddoc cache (which replaced ets_lru with the current implementation in apache/couchdb@939761b7183). Meanwhile, the Erlang Solutions folks seem to have started a new project to create a high-performance caching library at esl/segmented_cache that plays tricks with new OTP features like atomics. It seems like it'd be nice to arrive at a common solution over time there.

@kocolosk
Copy link
Member Author

I was thinking about this a bit more in between various college football games today and remembered the obvious answer for large maps: just use ETS.

I also went back and reviewed the two places where we are using khash in the CouchDB codebase today. The first is the couch_event application, which is responsible for notifying processes about updates that happen to DBs of interest. It mantains Listener -> Db and Db -> Listener mappings in separate khash instances. This would seem to be an ideal case to replace with maps: I believe the churn rate for listeners is relatively low, while lookups can be quite frequent.

The other current use case is in the ddoc_cache application. This one is a bit more subtle. There are two mappings:

  1. Pids: khash(Cache Entry Manager Pid -> Cache Key)
  2. Dbs: khash(DbName -> khash(DDocId -> khash(Cache Key -> Cache Entry Manager Pid)))

That second mapping showcases a nice property of the khash NIF, namely that nesting the hashes allows one to update a "leaf" hash without needing to update the parent hash. Nested maps would have rather different runtime characteristics. Given that I wouldn't feel confident about refactoring ddoc_cache to use maps internally today.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants