Skip to content

Review usage of MapDB in HashDictionary #35

@pstutz

Description

@pstutz

This is the class that's being described here: https://github.com/uzh/triplerush/blob/master/src/main/scala/com/signalcollect/triplerush/dictionary/HashDictionary.scala

The purpose of this class is to offer a mapping of RDF IRIs/literals (think strings with lots of shared prefixes, usually between 20 and 80 characters long) to integers, and from integers back to those same strings. For now we only add to this dictionary and never delete an existing mapping.

We use MapDB to store an in-memory mapping from Int->String. For the String->Int mapping we use a hash on the UTF-8 encoded byte array. We only explicitly store the String->Int mapping when there is a collision of this hash (also in a much smaller MapDB B-tree).

For this reason the Int->String mapping is the crucial part of our usage. For this mapping we have configured the MapDB B-tree like this:

val db = DBMaker
  .memoryUnsafeDB
  .closeOnJvmShutdown
  .transactionDisable
  .asyncWriteEnable
  .asyncWriteQueueSize(32768)
  .storeExecutorEnable(Executors.newScheduledThreadPool(math.min(16, Runtime.getRuntime.availableProcessors)))
  .compressionEnable
  .make

val int2String = db.treeMapCreate("int2String")
  .keySerializer(BTreeKeySerializer.INTEGER)
  .valueSerializer(Serializer.BYTE_ARRAY)
  .nodeSize(32)
  .makeOrGet[Int, Array[Byte]]()

Compression is enabled so the strings, that usually share a lot of their prefixes, are not so large. The node size 32 was chosen as a trade-off between the read/write performance and memory usage. A value of 32 also kept GC churn lower than it was with larger values.

We are considering using the G1 GC and I wonder if it is a good fit for this kind of use case. It should result in less fragmentation than other parallel collectors. I noticed that when adding entries in parallel and at full-speed it can be hard for the GC to keep up.

Is there anything else we can do to reduce the GC pressure created by the intermediate objects?

I noticed that async writes and a pretty large queue size increased the write performance by a lot. What's the trade-off and what are the optimal values?

Do we need to compact the tree to keep the direct memory usage low over time?

Are there any other potential improvements?

Existing ideas for improvements:

  • Use Huffman coding for string compression
  • New string/byte[] comparator that works from the end of the array to the front, as many of our strings have long shared prefixes
  • We might be able to speed up loading with DataPump, but it would require significant preprocessing of our data

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions