Skip to content

Conversation

@hanno-becker
Copy link
Contributor

@hanno-becker hanno-becker commented Dec 14, 2025

#1389 plus consolidation of multiple allocations into workspaces.

@hanno-becker hanno-becker force-pushed the alloc_free_config branch 10 times, most recently from f103112 to 93d235d Compare December 15, 2025 05:40
@hanno-becker hanno-becker marked this pull request as ready for review December 15, 2025 05:41
@hanno-becker hanno-becker requested a review from a team as a code owner December 15, 2025 05:41
@hanno-becker hanno-becker marked this pull request as draft December 15, 2025 05:41
@hanno-becker
Copy link
Contributor Author

@davidchisnall Seeing this will be relevant to your use case, you may want to have a look. It's understood that you'll need a context parameter as well, but this can be a follow-up PR.

@hanno-becker hanno-becker marked this pull request as ready for review December 15, 2025 05:54
@hanno-becker hanno-becker marked this pull request as draft December 15, 2025 05:57
@hanno-becker
Copy link
Contributor Author

hanno-becker commented Dec 15, 2025

Still needs work since CBMC can't yet handle the consolidation into workspaces. Done in #1388

Strictly speaking, the introduction of workspace structs could be deferred, since for now we test the custom heap-based allocation with C11 only and it's not part of the main source. Trialed this in #1389

@davidchisnall
Copy link

I don't like the goto cleanup pattern, because it isn't something I can use for other kinds of error handling. That might be something that would be possible to fix in a later commit but, for example, we'd like running out of stack space to do a longjmp and free all heap allocations that were created on the way (we have some SEH-like error handling that handles stack unwinding with some on-stack jump buffers, which is triggered automatically on stack overflow). We'd also like to be able to use the same mechanism to report that getting entropy failed (currently it's assumed to succeed, which means that the system fails open).

There is also a tradeoff in consolidating the allocations. Fewer calls to the allocator will be better for performance, but larger allocations are more likely to fail. I put the performance / stack usage numbers in this post for the Verilator simulator of the slowest CHERIoT core. The signing and verification steps are 4-5M cycles, so a few hundred extra cycles for heap allocation is in the noise, but failing heap allocation is a problem.

@hanno-becker
Copy link
Contributor Author

I don't like the goto cleanup pattern, because it isn't something I can use for other kinds of error handling. That might be something that would be possible to fix in a later commit but, for example, we'd like running out of stack space to do a longjmp and free all heap allocations that were created on the way (we have some SEH-like error handling that handles stack unwinding with some on-stack jump buffers, which is triggered automatically on stack overflow). We'd also like to be able to use the same mechanism to report that getting entropy failed (currently it's assumed to succeed, which means that the system fails open).

Why would the same mechanism not work for entropy failure? I was hoping that we could similarly set an error code (which is currently an unrefined -1 but which I hope to split up in v2) and jump to cleanup upon RNG failure.

What would your suggestion be?

@davidchisnall
Copy link

Why would the same mechanism not work for entropy failure? I was hoping that we could similarly set an error code (which is currently an unrefined -1 but which I hope to split up in v2) and jump to cleanup upon RNG failure.

Yes, that would work. Currently the get-entropy function has a void return and assumes success. If allocation is in the same function as an entropy call, then it could also goto the same cleanup point.

@hanno-becker
Copy link
Contributor Author

we'd like running out of stack space to do a longjmp and free all heap allocations that were created on the way (we have some SEH-like error handling that handles stack unwinding with some on-stack jump buffers, which is triggered automatically on stack overflow)

Does this need any co-operation from the application/library?

If not, it seems the current goto cleanup approach might be OK, if we also use it for other errors such as RNG failure.

@hanno-becker hanno-becker changed the title Make allocation of large structures/buffers configurable Make allocation of large structures/buffers configurable (with workspaces) Dec 16, 2025
Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SpacemiT K1 8 (Banana Pi F3) benchmarks

Details
Benchmark suite Current: 7d16cd7 Previous: 71773d9 Ratio
ML-KEM-512 keypair 155119 cycles 155354 cycles 1.00
ML-KEM-512 encaps 163151 cycles 163232 cycles 1.00
ML-KEM-512 decaps 206149 cycles 206517 cycles 1.00
ML-KEM-768 keypair 261039 cycles 261154 cycles 1.00
ML-KEM-768 encaps 275418 cycles 275821 cycles 1.00
ML-KEM-768 decaps 338056 cycles 337878 cycles 1.00
ML-KEM-1024 keypair 395001 cycles 395312 cycles 1.00
ML-KEM-1024 encaps 421934 cycles 423303 cycles 1.00
ML-KEM-1024 decaps 505554 cycles 507323 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

@hanno-becker
Copy link
Contributor Author

hanno-becker commented Dec 16, 2025

I changed the approach to use an explicit error check in the code rather than expecting MLK_CUSTOM_ALLOC to do this. This makes it easier to switch between stack and heap allocation without running into C90 errors because of mixed code and declarations. It also gives us control over the error code, while the user-provided MLK_CUSTOM_ALLOC can follow the standard convention that failure is reported through a NULL return value.

Also, this is now based on #1389 which doesn't introduce workspaces.

@hanno-becker hanno-becker force-pushed the alloc_free_config branch 3 times, most recently from a876fb7 to 7bbba13 Compare December 21, 2025 19:20
@hanno-becker hanno-becker changed the title Make allocation of large structures/buffers configurable (with workspaces) Bundle multiple allocations into workspaces Dec 21, 2025
Copy link
Contributor

@mkannwischer mkannwischer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @hanno-becker, one small nit on the code, otherwise this looks good to me.

Two points remain to be discussed:

  1. The previous approach would allow a user to allocate different buffers in different memories, e.g., one could allocate the large matrix A on the heap while all other buffers stay on the stack. This is no longer possible with the workspace approach. This may be particularly relevant for embedded systems.
  2. As @davidchisnall noted here allocating everything at once is more likely to fail which is undesirable.

Any thoughts @davidchisnall, @gilles-peskine-arm, @hanno-becker?

@hanno-becker hanno-becker force-pushed the alloc_free_config branch 2 times, most recently from 2c78165 to 793b43d Compare December 29, 2025 19:25
@hanno-becker
Copy link
Contributor Author

hanno-becker commented Dec 30, 2025

Unless @davidchisnall @gilles-peskine-arm have concrete use cases in mind where we know that the monolithic allocation will cause problems (?), I would suggest to add workspaces for now and adjust things on demand. This is internal and can be changed at any time. Esp. for mldsa-native I am hoping that this will also help us to reduce CBMC overhead from dynamic allocation which is currently blocking proofs.

@gilles-peskine-arm
Copy link

I very much prefer the monolithic allocation approach, unless there is an identified use case for fragmented allocation or monolithic allocation is too hard to implement.

Let's consider the three typical types of memory that might be used.

  • Stack: if everything is on the stack, then monolithic or fragmented doesn't matter if the monolithic workspace is implemented optimally, meaning that it doesn't use more space than the stack would. E.g. a monolithic workspace can waste memory if it ends up being struct {a; b; c;} and you know that only two of a, b and c can ever be active at the same time. I would guess that this doesn't happen with crypto that is constant-time, because that tends to happen when you have code paths that depend on the data. But I'm not familiar enough with MLKEM and MLDSA to know if it's a problem.
  • Heap: a monolithic allocation saves a bit of RAM overhead, but can be problematic if your heap is already fragmented. I think that if fragmenting the allocations is the only way your application can run, you're already too tight on memory and even a small change in the order of operations, or just letting your application run a little longer, would cause heap exhaustion even with smaller fragments.
  • Global: if I have to reserve some global area for a costly operation, I'll just designate one area (or maybe one area per CPU core). I can't see any advantage for fragmentation.

There may be a reason to use different types of memory for some data. If a system has a small amount of memory that's better protected against snooping, that memory should be prioritized for key-equivalent secret data over message-specific secret data during signing. But I think if you want that level of control, you should be prepared to maintain your own platform-specific cryptography code, or at least your own patches. It's not a level of control that we intend to provide in TF-PSA-Crypto (whereas we would like to give a choice between stack and a user-provided buffer).

In addition, I would prefer to avoid having low-level functions that can fail: ideally some high-level function would reserve the necessary amount of memory (and generate the random data), and then all low-level functions can't fail. One benefit of that is that it reduces the risk of doing error handling wrong, although having a CBMC proof mostly mitigates that risk. Another benefit is the code size: the cost of all if (failed) goto cleanup adds up.

MLK_ALIGN uint8_t buf[2 * MLKEM_SYMBYTES];
MLK_ALIGN uint8_t kr[2 * MLKEM_SYMBYTES];
MLK_ALIGN uint8_t tmp[MLKEM_SYMBYTES + MLKEM_INDCCA_CIPHERTEXTBYTES];
} workspace;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could that struct be made public? In the short term, I think TF-PSA-Crypto will just put the workspace on the heap, or maybe give a choice between heap and stack. But in the longer term, we'd like the option of making it a global buffer, reserved at static link time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could that not be achieved with a custom MLK_ALLOC which uses a static declaration?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, that does work for a global buffer that's reserved for MLKEM, and I think it also allows protecting that buffer with a mutex (alloc = lock and free = unlock).

But I don't think it allows what I was actually thinking of, which is a buffer that can be used for different things, such as using the same buffer for MLKEM and MLDSA (but only one at a time).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(But to be clear, I'd prefer having the option to use a heap buffer quickly, and think about fancier stuff in version 2, than to spend a lot of time now to try to cover more use cases.)

Copy link
Contributor Author

@hanno-becker hanno-becker Dec 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I don't think it allows what I was actually thinking of, which is a buffer that can be used for different things, such as using the same buffer for MLKEM and MLDSA (but only one at a time).

Got it. It feels wrong through that you would need to know the (internal) workspace structure for that. Would a global bump allocator not be enough for that purpose? This is what's already being tested [without workspaces] in test/test_alloc.c.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like this #1442?

I think independenly we need to move all allocs to the top level, i.e., make sure that there is only a single invocation of MLK_ALLOC, right?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having multiple workspaces isn't a problem if I just want one global workspace, as long as the function call structure is known. In the simplest case, if all the function calls are nested (and there's no recursion, obviously), I can just add their sizes for each of the entry points that I use, at the integration level just above mlkem-native. For that aspect, I only want help from mlkem-native if there are parts that never overlap, e.g. if entry_point() calls foo() and bar() successively and each three have their own workspace then I'd like mlkem-native to help me figure out that the workspace size for entry_point() is sizeof(entry_point_struct) + max(sizeof(foo_struct), sizeof(bar_struct)). and not the sum of all three.

Copy link
Contributor Author

@hanno-becker hanno-becker Dec 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having a single top-level allocation would make it easy to define an upper bound for the required memory. But we can also simply measure the total memory usage in the bump allocation test (we already do that, actually) and set the global constants accordingly and then size the bump allocator according to those constants, so we notice automatically when they get out of date. That may, for now, be a more pragmatic solution.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gilles-peskine-arm Do you have concerns with a 'simple-minded' approach of simply measuring the bump allocator's memory usage and providing defined constants for that?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems fine to me.

@mkannwischer
Copy link
Contributor

I very much prefer the monolithic allocation approach, unless there is an identified use case for fragmented allocation or monolithic allocation is too hard to implement.

Let's consider the three typical types of memory that might be used.

  • Stack: if everything is on the stack, then monolithic or fragmented doesn't matter if the monolithic workspace is implemented optimally, meaning that it doesn't use more space than the stack would. E.g. a monolithic workspace can waste memory if it ends up being struct {a; b; c;} and you know that only two of a, b and c can ever be active at the same time. I would guess that this doesn't happen with crypto that is constant-time, because that tends to happen when you have code paths that depend on the data. But I'm not familiar enough with MLKEM and MLDSA to know if it's a problem.
  • Heap: a monolithic allocation saves a bit of RAM overhead, but can be problematic if your heap is already fragmented. I think that if fragmenting the allocations is the only way your application can run, you're already too tight on memory and even a small change in the order of operations, or just letting your application run a little longer, would cause heap exhaustion even with smaller fragments.
  • Global: if I have to reserve some global area for a costly operation, I'll just designate one area (or maybe one area per CPU core). I can't see any advantage for fragmentation.

There may be a reason to use different types of memory for some data. If a system has a small amount of memory that's better protected against snooping, that memory should be prioritized for key-equivalent secret data over message-specific secret data during signing. But I think if you want that level of control, you should be prepared to maintain your own platform-specific cryptography code, or at least your own patches. It's not a level of control that we intend to provide in TF-PSA-Crypto (whereas we would like to give a choice between stack and a user-provided buffer).

In addition, I would prefer to avoid having low-level functions that can fail: ideally some high-level function would reserve the necessary amount of memory (and generate the random data), and then all low-level functions can't fail. One benefit of that is that it reduces the risk of doing error handling wrong, although having a CBMC proof mostly mitigates that risk. Another benefit is the code size: the cost of all if (failed) goto cleanup adds up.

Thanks @gilles-peskine-arm for this extensive comment. Makes sense to me.
In that case we go ahead an merge the workspaces in mlkem-native now and also try to get that to work in mldsa-native.

Copy link
Contributor

@mkannwischer mkannwischer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @hanno-becker. LGTM.

@hanno-becker hanno-becker force-pushed the alloc_free_config branch 2 times, most recently from b0b7a86 to c37d1ea Compare January 1, 2026 17:34
This commit introduces 'workspace structures', which are
are function-local structures wrapping all objects and buffers
utilized by the function into a single struct.

The benefit of the workspace struct is a simpler allocation
boilerplate: Instead of allocating multiple structures and
separately handling their success/failure, we now only need
to allocate a single object per function.

Signed-off-by: Hanno Becker <beckphan@amazon.co.uk>
@hanno-becker hanno-becker force-pushed the alloc_free_config branch 3 times, most recently from 7840809 to 76a56c6 Compare January 1, 2026 20:18
@hanno-becker hanno-becker merged commit e379df7 into main Jan 2, 2026
779 checks passed
@hanno-becker hanno-becker deleted the alloc_free_config branch January 2, 2026 03:55
@davidchisnall
Copy link

Heap: a monolithic allocation saves a bit of RAM overhead, but can be problematic if your heap is already fragmented. I think that if fragmenting the allocations is the only way your application can run, you're already too tight on memory and even a small change in the order of operations, or just letting your application run a little longer, would cause heap exhaustion even with smaller fragments.

This really isn't the case for embedded targets. Heap sizes of well under 1 MiB are almost universal and under 128 KiB is common. There's a massive difference in the likelihood of being able to satisfy a single 60 KiB allocation and two 30 KiB ones in a long-running application. A single 60 KiB allocation is likely to be close to half of the total heap size, which means that you can't allocate it even with a small amount of existing fragmentation.

@hanno-becker
Copy link
Contributor Author

hanno-becker commented Jan 5, 2026

Thank you @davidchisnall for chiming back in (and happy new year)!

I am open to reconsidering this. On mldsa-native, the hope was that fewer allocations would reduce CBMC proof complexity, but so far, this hope hasn't materialized (and we have not yet introduced workspaces, there). It might be more effective to have a wrapper doing the allocation and passing a myriad of pointers to a function doing the actual work. That's harder to read, but it should contain the proof complexity (the visual clutter added to the CBMC specs might be contained with something like pq-code-package/mldsa-native#817).

Also, @gilles-peskine-arm's request of having a public constant indicating the memory usage can be achieved without workspaces (cf #1379 (comment)).

@hanno-becker
Copy link
Contributor Author

@mkannwischer @gilles-peskine-arm Please chime in.

@gilles-peskine-arm
Copy link

This really isn't the case for embedded targets. Heap sizes of well under 1 MiB are almost universal and under 128 KiB is common. There's a massive difference in the likelihood of being able to satisfy a single 60 KiB allocation and two 30 KiB ones in a long-running application. A single 60 KiB allocation is likely to be close to half of the total heap size, which means that you can't allocate it even with a small amount of existing fragmentation.

As a former embedded OS architect, the systems I worked on reserved a static (or startup-time) amount of memory for critical systems such as cryptography, so that we knew in advance that crypto wouldn't fail midway due to lack of resources. distinct from heap which was used in less critical components where failing midway was acceptable.

Now, as a crypto library maintainer (TF-PSA-Crypto), I have known users who want to reserve a static area for ML-DSA verification, because the system's stack limit is too small, and the verifier has no heap. I also have known users who have a heap, including resource-constrained platforms where a few kilobytes can make a difference, but many of these users use the heap implementation that comes with our crypto library because they don't use a heap for anything else, and they would prefer if our crypto library didn't require a heap at all. (Not an option for our RSA code or most of our ECC code, but a long-term goal for us.) I also have known users who have a large heap and are concerned about how many thousands of concurrent connections they can establish, but this kind of high-end application is less of a priority for us.

So, in my experience, the case of a heap that's just big enough if you fragment the allocation is not very important, it's not something that TF-PSA-Crypto intends to cater for. However, it's a vast world and if there are environments where this is desirable, I certainly won't argue that mldsa-native shoult not provide this capability. That's not my call to make.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

benchmark this PR should be benchmarked in CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants