-
Notifications
You must be signed in to change notification settings - Fork 1
Conversation
5375902 to
8081d28
Compare
8081d28 to
8d38dbf
Compare
b428b25 to
665994e
Compare
|
Now that Keva is ready, we can begin to replace the various bespoke disk operations in |
|
Hi! I've seen Keva quickly and I think I get the idea (the concept of bucket is mainly a performance concern, am I right?). I'm now reading through the simian code to understand what the |
|
Yup, the buckets are similar to the idea of paging: reduce overheads by storing a bunch of things together instead of individually. The main point is that you can store lots and lots of objects by unique key and not have to worry about very long paths or hash collisions or such. To do this, you normally define a bucket/page size, and then either have a set number of buckets up front, or use some sort of allocator to manage what goes in each bucket/page. Keva’s solution is to split buckets once they get full, using the filesystem and a special key format to recursively find buckets in a way that is invariant of the splitting. Simian was (and is still currently) using one directory per node, one file per fingerprint, etc. However, filesystems become inefficient if you have lots and lots of very sparse directories and small files (eg at the very least, a file takes up a whole block regardless of how much data is in it). Because we want to support indexing very large numbers of images, this poses a problem. Keva solves that and abstracts it away. The The one exception is thumbnails—Keva doesn’t have an efficient way to handle large-ish binary objects, so for now the best we can do is store them separately to the nodes. (This was done in the last commit.) This is still an improvement because then we never have to move them if the nodes change (eg during a split operation), but in future the idea is that we don’t store thumbnails in the index at all (and instead the user just stores a reference to the original image as metadata when indexing an image). We currently rely on the thumbnails to generate different size fingerprints; we can get rid of this requirement once we move to DCT-based fingerprints (or we can write a custom resampling algorithm to resize the fingerprints directly, but that seems wasteful when DCT fingerprints are a better solution). |
e870d0c to
5452e10
Compare
a75c95d to
b57c48a
Compare
This is a temporary step; in future we don’t want to store thumbnails in the index at all, but that will require a new fingerprint format.
We can do everything by fingerprints now and no longer need paths.
b57c48a to
78da9fd
Compare
daa6a8c to
557ae91
Compare
557ae91 to
da4b461
Compare
Now that the serialisation has been separated from
IndexNode, we can begin addressing its various inadequacies, which include:This work is partly being done as the Keva key-value store library.