-
Notifications
You must be signed in to change notification settings - Fork 41
Bundle multiple allocations into workspaces #1379
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
f103112 to
93d235d
Compare
|
@davidchisnall Seeing this will be relevant to your use case, you may want to have a look. It's understood that you'll need a context parameter as well, but this can be a follow-up PR. |
b2df5cf to
652aa43
Compare
|
I don't like the There is also a tradeoff in consolidating the allocations. Fewer calls to the allocator will be better for performance, but larger allocations are more likely to fail. I put the performance / stack usage numbers in this post for the Verilator simulator of the slowest CHERIoT core. The signing and verification steps are 4-5M cycles, so a few hundred extra cycles for heap allocation is in the noise, but failing heap allocation is a problem. |
Why would the same mechanism not work for entropy failure? I was hoping that we could similarly set an error code (which is currently an unrefined What would your suggestion be? |
Yes, that would work. Currently the get-entropy function has a |
Does this need any co-operation from the application/library? If not, it seems the current |
oqs-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SpacemiT K1 8 (Banana Pi F3) benchmarks
Details
| Benchmark suite | Current: 7d16cd7 | Previous: 71773d9 | Ratio |
|---|---|---|---|
ML-KEM-512 keypair |
155119 cycles |
155354 cycles |
1.00 |
ML-KEM-512 encaps |
163151 cycles |
163232 cycles |
1.00 |
ML-KEM-512 decaps |
206149 cycles |
206517 cycles |
1.00 |
ML-KEM-768 keypair |
261039 cycles |
261154 cycles |
1.00 |
ML-KEM-768 encaps |
275418 cycles |
275821 cycles |
1.00 |
ML-KEM-768 decaps |
338056 cycles |
337878 cycles |
1.00 |
ML-KEM-1024 keypair |
395001 cycles |
395312 cycles |
1.00 |
ML-KEM-1024 encaps |
421934 cycles |
423303 cycles |
1.00 |
ML-KEM-1024 decaps |
505554 cycles |
507323 cycles |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
|
I changed the approach to use an explicit error check in the code rather than expecting Also, this is now based on #1389 which doesn't introduce workspaces. |
a876fb7 to
7bbba13
Compare
mkannwischer
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @hanno-becker, one small nit on the code, otherwise this looks good to me.
Two points remain to be discussed:
- The previous approach would allow a user to allocate different buffers in different memories, e.g., one could allocate the large matrix A on the heap while all other buffers stay on the stack. This is no longer possible with the workspace approach. This may be particularly relevant for embedded systems.
- As @davidchisnall noted here allocating everything at once is more likely to fail which is undesirable.
Any thoughts @davidchisnall, @gilles-peskine-arm, @hanno-becker?
2c78165 to
793b43d
Compare
|
Unless @davidchisnall @gilles-peskine-arm have concrete use cases in mind where we know that the monolithic allocation will cause problems (?), I would suggest to add workspaces for now and adjust things on demand. This is internal and can be changed at any time. Esp. for mldsa-native I am hoping that this will also help us to reduce CBMC overhead from dynamic allocation which is currently blocking proofs. |
|
I very much prefer the monolithic allocation approach, unless there is an identified use case for fragmented allocation or monolithic allocation is too hard to implement. Let's consider the three typical types of memory that might be used.
There may be a reason to use different types of memory for some data. If a system has a small amount of memory that's better protected against snooping, that memory should be prioritized for key-equivalent secret data over message-specific secret data during signing. But I think if you want that level of control, you should be prepared to maintain your own platform-specific cryptography code, or at least your own patches. It's not a level of control that we intend to provide in TF-PSA-Crypto (whereas we would like to give a choice between stack and a user-provided buffer). In addition, I would prefer to avoid having low-level functions that can fail: ideally some high-level function would reserve the necessary amount of memory (and generate the random data), and then all low-level functions can't fail. One benefit of that is that it reduces the risk of doing error handling wrong, although having a CBMC proof mostly mitigates that risk. Another benefit is the code size: the cost of all |
| MLK_ALIGN uint8_t buf[2 * MLKEM_SYMBYTES]; | ||
| MLK_ALIGN uint8_t kr[2 * MLKEM_SYMBYTES]; | ||
| MLK_ALIGN uint8_t tmp[MLKEM_SYMBYTES + MLKEM_INDCCA_CIPHERTEXTBYTES]; | ||
| } workspace; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could that struct be made public? In the short term, I think TF-PSA-Crypto will just put the workspace on the heap, or maybe give a choice between heap and stack. But in the longer term, we'd like the option of making it a global buffer, reserved at static link time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could that not be achieved with a custom MLK_ALLOC which uses a static declaration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, that does work for a global buffer that's reserved for MLKEM, and I think it also allows protecting that buffer with a mutex (alloc = lock and free = unlock).
But I don't think it allows what I was actually thinking of, which is a buffer that can be used for different things, such as using the same buffer for MLKEM and MLDSA (but only one at a time).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(But to be clear, I'd prefer having the option to use a heap buffer quickly, and think about fancier stuff in version 2, than to spend a lot of time now to try to cover more use cases.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I don't think it allows what I was actually thinking of, which is a buffer that can be used for different things, such as using the same buffer for MLKEM and MLDSA (but only one at a time).
Got it. It feels wrong through that you would need to know the (internal) workspace structure for that. Would a global bump allocator not be enough for that purpose? This is what's already being tested [without workspaces] in test/test_alloc.c.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something like this #1442?
I think independenly we need to move all allocs to the top level, i.e., make sure that there is only a single invocation of MLK_ALLOC, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having multiple workspaces isn't a problem if I just want one global workspace, as long as the function call structure is known. In the simplest case, if all the function calls are nested (and there's no recursion, obviously), I can just add their sizes for each of the entry points that I use, at the integration level just above mlkem-native. For that aspect, I only want help from mlkem-native if there are parts that never overlap, e.g. if entry_point() calls foo() and bar() successively and each three have their own workspace then I'd like mlkem-native to help me figure out that the workspace size for entry_point() is sizeof(entry_point_struct) + max(sizeof(foo_struct), sizeof(bar_struct)). and not the sum of all three.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having a single top-level allocation would make it easy to define an upper bound for the required memory. But we can also simply measure the total memory usage in the bump allocation test (we already do that, actually) and set the global constants accordingly and then size the bump allocator according to those constants, so we notice automatically when they get out of date. That may, for now, be a more pragmatic solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gilles-peskine-arm Do you have concerns with a 'simple-minded' approach of simply measuring the bump allocator's memory usage and providing defined constants for that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems fine to me.
Thanks @gilles-peskine-arm for this extensive comment. Makes sense to me. |
793b43d to
cce203f
Compare
mkannwischer
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @hanno-becker. LGTM.
cce203f to
4cbb5f7
Compare
b0b7a86 to
c37d1ea
Compare
This commit introduces 'workspace structures', which are are function-local structures wrapping all objects and buffers utilized by the function into a single struct. The benefit of the workspace struct is a simpler allocation boilerplate: Instead of allocating multiple structures and separately handling their success/failure, we now only need to allocate a single object per function. Signed-off-by: Hanno Becker <beckphan@amazon.co.uk>
7840809 to
76a56c6
Compare
This really isn't the case for embedded targets. Heap sizes of well under 1 MiB are almost universal and under 128 KiB is common. There's a massive difference in the likelihood of being able to satisfy a single 60 KiB allocation and two 30 KiB ones in a long-running application. A single 60 KiB allocation is likely to be close to half of the total heap size, which means that you can't allocate it even with a small amount of existing fragmentation. |
|
Thank you @davidchisnall for chiming back in (and happy new year)! I am open to reconsidering this. On mldsa-native, the hope was that fewer allocations would reduce CBMC proof complexity, but so far, this hope hasn't materialized (and we have not yet introduced workspaces, there). It might be more effective to have a wrapper doing the allocation and passing a myriad of pointers to a function doing the actual work. That's harder to read, but it should contain the proof complexity (the visual clutter added to the CBMC specs might be contained with something like pq-code-package/mldsa-native#817). Also, @gilles-peskine-arm's request of having a public constant indicating the memory usage can be achieved without workspaces (cf #1379 (comment)). |
|
@mkannwischer @gilles-peskine-arm Please chime in. |
As a former embedded OS architect, the systems I worked on reserved a static (or startup-time) amount of memory for critical systems such as cryptography, so that we knew in advance that crypto wouldn't fail midway due to lack of resources. distinct from heap which was used in less critical components where failing midway was acceptable. Now, as a crypto library maintainer (TF-PSA-Crypto), I have known users who want to reserve a static area for ML-DSA verification, because the system's stack limit is too small, and the verifier has no heap. I also have known users who have a heap, including resource-constrained platforms where a few kilobytes can make a difference, but many of these users use the heap implementation that comes with our crypto library because they don't use a heap for anything else, and they would prefer if our crypto library didn't require a heap at all. (Not an option for our RSA code or most of our ECC code, but a long-term goal for us.) I also have known users who have a large heap and are concerned about how many thousands of concurrent connections they can establish, but this kind of high-end application is less of a priority for us. So, in my experience, the case of a heap that's just big enough if you fragment the allocation is not very important, it's not something that TF-PSA-Crypto intends to cater for. However, it's a vast world and if there are environments where this is desirable, I certainly won't argue that mldsa-native shoult not provide this capability. That's not my call to make. |
#1389 plus consolidation of multiple allocations into workspaces.