Skip to content
This repository was archived by the owner on Nov 12, 2019. It is now read-only.

Conversation

@mandykoh
Copy link
Owner

@mandykoh mandykoh commented Jun 9, 2017

Now that the serialisation has been separated from IndexNode, we can begin addressing its various inadequacies, which include:

  • Inefficient use of space (2.4GB of actual data took up 38GB of space in one use case).
  • Creating large numbers of filesystem objects.
  • Difficulty to extend with caching/thread-safety/etc.
  • Non-handling of hash collisions.
  • Requiring very long filesystem paths.

This work is partly being done as the Keva key-value store library.

@mandykoh mandykoh requested a review from gonzalo-bulnes June 10, 2017 05:01
@mandykoh mandykoh added the wip label Jun 10, 2017
@mandykoh mandykoh force-pushed the compact-index-format branch from 5375902 to 8081d28 Compare June 10, 2017 11:37
@mandykoh mandykoh closed this Jun 10, 2017
@mandykoh mandykoh force-pushed the compact-index-format branch from 8081d28 to 8d38dbf Compare June 10, 2017 11:38
@mandykoh mandykoh reopened this Jun 10, 2017
@mandykoh mandykoh force-pushed the compact-index-format branch from b428b25 to 665994e Compare June 10, 2017 12:16
@mandykoh
Copy link
Owner Author

mandykoh commented Jun 10, 2017

Now that Keva is ready, we can begin to replace the various bespoke disk operations in DiskIndexStore with Keva operations instead. It probably makes sense to have one Keva store for nodes (and their entries).

@gonzalo-bulnes
Copy link
Collaborator

Hi! I've seen Keva quickly and I think I get the idea (the concept of bucket is mainly a performance concern, am I right?).

I'm now reading through the simian code to understand what the DiskIndexStore is about, I'm taking notes so I can ask you questions, but for now I'm very much in discovery mode. (It's entertaining so far! I'm trying to remember your presentation as I progress.)

@mandykoh
Copy link
Owner Author

mandykoh commented Jun 11, 2017

Yup, the buckets are similar to the idea of paging: reduce overheads by storing a bunch of things together instead of individually. The main point is that you can store lots and lots of objects by unique key and not have to worry about very long paths or hash collisions or such. To do this, you normally define a bucket/page size, and then either have a set number of buckets up front, or use some sort of allocator to manage what goes in each bucket/page. Keva’s solution is to split buckets once they get full, using the filesystem and a special key format to recursively find buckets in a way that is invariant of the splitting.

Simian was (and is still currently) using one directory per node, one file per fingerprint, etc. However, filesystems become inefficient if you have lots and lots of very sparse directories and small files (eg at the very least, a file takes up a whole block regardless of how much data is in it). Because we want to support indexing very large numbers of images, this poses a problem. Keva solves that and abstracts it away.

The DiskIndexStore is about to change a lot because we’ll remove the custom storage operations from it and use Keva instead.

The one exception is thumbnails—Keva doesn’t have an efficient way to handle large-ish binary objects, so for now the best we can do is store them separately to the nodes. (This was done in the last commit.) This is still an improvement because then we never have to move them if the nodes change (eg during a split operation), but in future the idea is that we don’t store thumbnails in the index at all (and instead the user just stores a reference to the original image as metadata when indexing an image).

We currently rely on the thumbnails to generate different size fingerprints; we can get rid of this requirement once we move to DCT-based fingerprints (or we can write a custom resampling algorithm to resize the fingerprints directly, but that seems wasteful when DCT fingerprints are a better solution).

@mandykoh mandykoh force-pushed the compact-index-format branch 4 times, most recently from e870d0c to 5452e10 Compare June 16, 2017 11:34
@mandykoh mandykoh force-pushed the compact-index-format branch 2 times, most recently from a75c95d to b57c48a Compare June 27, 2017 13:31
This is a temporary step; in future we don’t want to store thumbnails in the index at all, but that will require a new fingerprint format.
We can do everything by fingerprints now and no longer need paths.
@mandykoh mandykoh force-pushed the compact-index-format branch from b57c48a to 78da9fd Compare August 10, 2017 10:35
@mandykoh mandykoh force-pushed the compact-index-format branch from daa6a8c to 557ae91 Compare August 11, 2017 09:37
@mandykoh mandykoh force-pushed the compact-index-format branch from 557ae91 to da4b461 Compare August 11, 2017 11:06
@mandykoh mandykoh removed the wip label Aug 15, 2017
@mandykoh mandykoh merged commit ce43bbf into master Aug 16, 2017
@mandykoh mandykoh deleted the compact-index-format branch August 16, 2017 08:24
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants