Compact Index Format #2

mandykoh · 2017-06-09T09:04:08Z

Now that the serialisation has been separated from IndexNode, we can begin addressing its various inadequacies, which include:

Inefficient use of space (2.4GB of actual data took up 38GB of space in one use case).
Creating large numbers of filesystem objects.
Difficulty to extend with caching/thread-safety/etc.
Non-handling of hash collisions.
Requiring very long filesystem paths.

This work is partly being done as the Keva key-value store library.

mandykoh · 2017-06-10T12:22:12Z

Now that Keva is ready, we can begin to replace the various bespoke disk operations in DiskIndexStore with Keva operations instead. It probably makes sense to have one Keva store for nodes (and their entries).

gonzalo-bulnes · 2017-06-11T02:09:26Z

Hi! I've seen Keva quickly and I think I get the idea (the concept of bucket is mainly a performance concern, am I right?).

I'm now reading through the simian code to understand what the DiskIndexStore is about, I'm taking notes so I can ask you questions, but for now I'm very much in discovery mode. (It's entertaining so far! I'm trying to remember your presentation as I progress.)

mandykoh · 2017-06-11T02:39:45Z

Yup, the buckets are similar to the idea of paging: reduce overheads by storing a bunch of things together instead of individually. The main point is that you can store lots and lots of objects by unique key and not have to worry about very long paths or hash collisions or such. To do this, you normally define a bucket/page size, and then either have a set number of buckets up front, or use some sort of allocator to manage what goes in each bucket/page. Keva’s solution is to split buckets once they get full, using the filesystem and a special key format to recursively find buckets in a way that is invariant of the splitting.

Simian was (and is still currently) using one directory per node, one file per fingerprint, etc. However, filesystems become inefficient if you have lots and lots of very sparse directories and small files (eg at the very least, a file takes up a whole block regardless of how much data is in it). Because we want to support indexing very large numbers of images, this poses a problem. Keva solves that and abstracts it away.

The DiskIndexStore is about to change a lot because we’ll remove the custom storage operations from it and use Keva instead.

The one exception is thumbnails—Keva doesn’t have an efficient way to handle large-ish binary objects, so for now the best we can do is store them separately to the nodes. (This was done in the last commit.) This is still an improvement because then we never have to move them if the nodes change (eg during a split operation), but in future the idea is that we don’t store thumbnails in the index at all (and instead the user just stores a reference to the original image as metadata when indexing an image).

We currently rely on the thumbnails to generate different size fingerprints; we can get rid of this requirement once we move to DCT-based fingerprints (or we can write a custom resampling algorithm to resize the fingerprints directly, but that seems wasteful when DCT fingerprints are a better solution).

This is a temporary step; in future we don’t want to store thumbnails in the index at all, but that will require a new fingerprint format.

We can do everything by fingerprints now and no longer need paths.

mandykoh requested a review from gonzalo-bulnes June 10, 2017 05:01

mandykoh added the wip label Jun 10, 2017

mandykoh force-pushed the compact-index-format branch from 5375902 to 8081d28 Compare June 10, 2017 11:37

mandykoh closed this Jun 10, 2017

mandykoh force-pushed the compact-index-format branch from 8081d28 to 8d38dbf Compare June 10, 2017 11:38

mandykoh reopened this Jun 10, 2017

mandykoh force-pushed the compact-index-format branch from b428b25 to 665994e Compare June 10, 2017 12:16

mandykoh force-pushed the compact-index-format branch 4 times, most recently from e870d0c to 5452e10 Compare June 16, 2017 11:34

mandykoh force-pushed the compact-index-format branch 2 times, most recently from a75c95d to b57c48a Compare June 27, 2017 13:31

mandykoh added 5 commits August 10, 2017 20:35

Introduce Keva store for nodes.

26e3f47

Scope fingerprint tests using sub-tests.

e2bb619

Save thumbnails to a separate location

488383c

This is a temporary step; in future we don’t want to store thumbnails in the index at all, but that will require a new fingerprint format.

Get rid of IndexNodeHandles

73e454a

We can do everything by fingerprints now and no longer need paths.

Make IndexNodes JSON-serialisable.

78da9fd

mandykoh force-pushed the compact-index-format branch from b57c48a to 78da9fd Compare August 10, 2017 10:35

mandykoh added 2 commits August 10, 2017 22:08

Expose store creation errors.

2ea9ef3

Make IndexEntry JSON serialisable.

cf333d5

mandykoh force-pushed the compact-index-format branch from daa6a8c to 557ae91 Compare August 11, 2017 09:37

Use Keva for storing the index.

da4b461

mandykoh force-pushed the compact-index-format branch from 557ae91 to da4b461 Compare August 11, 2017 11:06

mandykoh added 3 commits August 12, 2017 13:47

Update Keva.

b707778

Stop creating legacy nodes directory.

cbe4ee9

Allow attributes to be stored with images.

04117dc

Add more diagnostic logging.

7f24966

mandykoh removed the wip label Aug 15, 2017

Update Keva.

24ddce1

mandykoh merged commit ce43bbf into master Aug 16, 2017

mandykoh deleted the compact-index-format branch August 16, 2017 08:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compact Index Format #2

Compact Index Format #2

Uh oh!

mandykoh commented Jun 9, 2017

Uh oh!

mandykoh commented Jun 10, 2017 •

edited

Loading

Uh oh!

gonzalo-bulnes commented Jun 11, 2017

Uh oh!

mandykoh commented Jun 11, 2017 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Compact Index Format #2

Compact Index Format #2

Uh oh!

Conversation

mandykoh commented Jun 9, 2017

Uh oh!

mandykoh commented Jun 10, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gonzalo-bulnes commented Jun 11, 2017

Uh oh!

mandykoh commented Jun 11, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mandykoh commented Jun 10, 2017 •

edited

Loading

mandykoh commented Jun 11, 2017 •

edited

Loading