Bundle multiple allocations into workspaces #1379

hanno-becker · 2025-12-14T06:40:11Z

#1389 plus consolidation of multiple allocations into workspaces.

hanno-becker · 2025-12-15T05:43:07Z

@davidchisnall Seeing this will be relevant to your use case, you may want to have a look. It's understood that you'll need a context parameter as well, but this can be a follow-up PR.

hanno-becker · 2025-12-15T05:59:45Z

~~Still needs work since CBMC can't yet handle the consolidation into workspaces.~~ Done in #1388

Strictly speaking, the introduction of workspace structs could be deferred, since for now we test the custom heap-based allocation with C11 only and it's not part of the main source. Trialed this in #1389

davidchisnall · 2025-12-15T12:36:48Z

I don't like the goto cleanup pattern, because it isn't something I can use for other kinds of error handling. That might be something that would be possible to fix in a later commit but, for example, we'd like running out of stack space to do a longjmp and free all heap allocations that were created on the way (we have some SEH-like error handling that handles stack unwinding with some on-stack jump buffers, which is triggered automatically on stack overflow). We'd also like to be able to use the same mechanism to report that getting entropy failed (currently it's assumed to succeed, which means that the system fails open).

There is also a tradeoff in consolidating the allocations. Fewer calls to the allocator will be better for performance, but larger allocations are more likely to fail. I put the performance / stack usage numbers in this post for the Verilator simulator of the slowest CHERIoT core. The signing and verification steps are 4-5M cycles, so a few hundred extra cycles for heap allocation is in the noise, but failing heap allocation is a problem.

hanno-becker · 2025-12-15T13:51:24Z

I don't like the goto cleanup pattern, because it isn't something I can use for other kinds of error handling. That might be something that would be possible to fix in a later commit but, for example, we'd like running out of stack space to do a longjmp and free all heap allocations that were created on the way (we have some SEH-like error handling that handles stack unwinding with some on-stack jump buffers, which is triggered automatically on stack overflow). We'd also like to be able to use the same mechanism to report that getting entropy failed (currently it's assumed to succeed, which means that the system fails open).

Why would the same mechanism not work for entropy failure? I was hoping that we could similarly set an error code (which is currently an unrefined -1 but which I hope to split up in v2) and jump to cleanup upon RNG failure.

What would your suggestion be?

davidchisnall · 2025-12-15T14:04:38Z

Why would the same mechanism not work for entropy failure? I was hoping that we could similarly set an error code (which is currently an unrefined -1 but which I hope to split up in v2) and jump to cleanup upon RNG failure.

Yes, that would work. Currently the get-entropy function has a void return and assumes success. If allocation is in the same function as an entropy call, then it could also goto the same cleanup point.

hanno-becker · 2025-12-15T14:23:00Z

we'd like running out of stack space to do a longjmp and free all heap allocations that were created on the way (we have some SEH-like error handling that handles stack unwinding with some on-stack jump buffers, which is triggered automatically on stack overflow)

Does this need any co-operation from the application/library?

If not, it seems the current goto cleanup approach might be OK, if we also use it for other errors such as RNG failure.

oqs-bot

SpacemiT K1 8 (Banana Pi F3) benchmarks

Details

Benchmark suite	Current: `7d16cd7`	Previous: `71773d9`	Ratio
`ML-KEM-512 keypair`	`155119` cycles	`155354` cycles	`1.00`
`ML-KEM-512 encaps`	`163151` cycles	`163232` cycles	`1.00`
`ML-KEM-512 decaps`	`206149` cycles	`206517` cycles	`1.00`
`ML-KEM-768 keypair`	`261039` cycles	`261154` cycles	`1.00`
`ML-KEM-768 encaps`	`275418` cycles	`275821` cycles	`1.00`
`ML-KEM-768 decaps`	`338056` cycles	`337878` cycles	`1.00`
`ML-KEM-1024 keypair`	`395001` cycles	`395312` cycles	`1.00`
`ML-KEM-1024 encaps`	`421934` cycles	`423303` cycles	`1.00`
`ML-KEM-1024 decaps`	`505554` cycles	`507323` cycles	`1.00`

This comment was automatically generated by workflow using github-action-benchmark.

hanno-becker · 2025-12-16T06:20:36Z

I changed the approach to use an explicit error check in the code rather than expecting MLK_CUSTOM_ALLOC to do this. This makes it easier to switch between stack and heap allocation without running into C90 errors because of mixed code and declarations. It also gives us control over the error code, while the user-provided MLK_CUSTOM_ALLOC can follow the standard convention that failure is reported through a NULL return value.

Also, this is now based on #1389 which doesn't introduce workspaces.

mkannwischer

Thanks @hanno-becker, one small nit on the code, otherwise this looks good to me.

Two points remain to be discussed:

The previous approach would allow a user to allocate different buffers in different memories, e.g., one could allocate the large matrix A on the heap while all other buffers stay on the stack. This is no longer possible with the workspace approach. This may be particularly relevant for embedded systems.
As @davidchisnall noted here allocating everything at once is more likely to fail which is undesirable.

Any thoughts @davidchisnall, @gilles-peskine-arm, @hanno-becker?

mlkem/src/kem.c

hanno-becker · 2025-12-30T04:29:11Z

Unless @davidchisnall @gilles-peskine-arm have concrete use cases in mind where we know that the monolithic allocation will cause problems (?), I would suggest to add workspaces for now and adjust things on demand. This is internal and can be changed at any time. Esp. for mldsa-native I am hoping that this will also help us to reduce CBMC overhead from dynamic allocation which is currently blocking proofs.

gilles-peskine-arm · 2025-12-30T14:07:08Z

I very much prefer the monolithic allocation approach, unless there is an identified use case for fragmented allocation or monolithic allocation is too hard to implement.

Let's consider the three typical types of memory that might be used.

Stack: if everything is on the stack, then monolithic or fragmented doesn't matter if the monolithic workspace is implemented optimally, meaning that it doesn't use more space than the stack would. E.g. a monolithic workspace can waste memory if it ends up being struct {a; b; c;} and you know that only two of a, b and c can ever be active at the same time. I would guess that this doesn't happen with crypto that is constant-time, because that tends to happen when you have code paths that depend on the data. But I'm not familiar enough with MLKEM and MLDSA to know if it's a problem.
Heap: a monolithic allocation saves a bit of RAM overhead, but can be problematic if your heap is already fragmented. I think that if fragmenting the allocations is the only way your application can run, you're already too tight on memory and even a small change in the order of operations, or just letting your application run a little longer, would cause heap exhaustion even with smaller fragments.
Global: if I have to reserve some global area for a costly operation, I'll just designate one area (or maybe one area per CPU core). I can't see any advantage for fragmentation.

There may be a reason to use different types of memory for some data. If a system has a small amount of memory that's better protected against snooping, that memory should be prioritized for key-equivalent secret data over message-specific secret data during signing. But I think if you want that level of control, you should be prepared to maintain your own platform-specific cryptography code, or at least your own patches. It's not a level of control that we intend to provide in TF-PSA-Crypto (whereas we would like to give a choice between stack and a user-provided buffer).

In addition, I would prefer to avoid having low-level functions that can fail: ideally some high-level function would reserve the necessary amount of memory (and generate the random data), and then all low-level functions can't fail. One benefit of that is that it reduces the risk of doing error handling wrong, although having a CBMC proof mostly mitigates that risk. Another benefit is the code size: the cost of all if (failed) goto cleanup adds up.

gilles-peskine-arm · 2025-12-30T14:08:50Z

mlkem/src/kem.c

+    MLK_ALIGN uint8_t buf[2 * MLKEM_SYMBYTES];
+    MLK_ALIGN uint8_t kr[2 * MLKEM_SYMBYTES];
+    MLK_ALIGN uint8_t tmp[MLKEM_SYMBYTES + MLKEM_INDCCA_CIPHERTEXTBYTES];
+  } workspace;


Could that struct be made public? In the short term, I think TF-PSA-Crypto will just put the workspace on the heap, or maybe give a choice between heap and stack. But in the longer term, we'd like the option of making it a global buffer, reserved at static link time.

Could that not be achieved with a custom MLK_ALLOC which uses a static declaration?

Good point, that does work for a global buffer that's reserved for MLKEM, and I think it also allows protecting that buffer with a mutex (alloc = lock and free = unlock).

But I don't think it allows what I was actually thinking of, which is a buffer that can be used for different things, such as using the same buffer for MLKEM and MLDSA (but only one at a time).

(But to be clear, I'd prefer having the option to use a heap buffer quickly, and think about fancier stuff in version 2, than to spend a lot of time now to try to cover more use cases.)

But I don't think it allows what I was actually thinking of, which is a buffer that can be used for different things, such as using the same buffer for MLKEM and MLDSA (but only one at a time).

Got it. It feels wrong through that you would need to know the (internal) workspace structure for that. Would a global bump allocator not be enough for that purpose? This is what's already being tested [without workspaces] in test/test_alloc.c.

Something like this #1442?

I think independenly we need to move all allocs to the top level, i.e., make sure that there is only a single invocation of MLK_ALLOC, right?

Having multiple workspaces isn't a problem if I just want one global workspace, as long as the function call structure is known. In the simplest case, if all the function calls are nested (and there's no recursion, obviously), I can just add their sizes for each of the entry points that I use, at the integration level just above mlkem-native. For that aspect, I only want help from mlkem-native if there are parts that never overlap, e.g. if entry_point() calls foo() and bar() successively and each three have their own workspace then I'd like mlkem-native to help me figure out that the workspace size for entry_point() is sizeof(entry_point_struct) + max(sizeof(foo_struct), sizeof(bar_struct)). and not the sum of all three.

Having a single top-level allocation would make it easy to define an upper bound for the required memory. But we can also simply measure the total memory usage in the bump allocation test (we already do that, actually) and set the global constants accordingly and then size the bump allocator according to those constants, so we notice automatically when they get out of date. That may, for now, be a more pragmatic solution.

@gilles-peskine-arm Do you have concerns with a 'simple-minded' approach of simply measuring the bump allocator's memory usage and providing defined constants for that?

Seems fine to me.

mkannwischer · 2025-12-31T04:56:09Z

I very much prefer the monolithic allocation approach, unless there is an identified use case for fragmented allocation or monolithic allocation is too hard to implement.

Let's consider the three typical types of memory that might be used.

Stack: if everything is on the stack, then monolithic or fragmented doesn't matter if the monolithic workspace is implemented optimally, meaning that it doesn't use more space than the stack would. E.g. a monolithic workspace can waste memory if it ends up being struct {a; b; c;} and you know that only two of a, b and c can ever be active at the same time. I would guess that this doesn't happen with crypto that is constant-time, because that tends to happen when you have code paths that depend on the data. But I'm not familiar enough with MLKEM and MLDSA to know if it's a problem.

Heap: a monolithic allocation saves a bit of RAM overhead, but can be problematic if your heap is already fragmented. I think that if fragmenting the allocations is the only way your application can run, you're already too tight on memory and even a small change in the order of operations, or just letting your application run a little longer, would cause heap exhaustion even with smaller fragments.

Global: if I have to reserve some global area for a costly operation, I'll just designate one area (or maybe one area per CPU core). I can't see any advantage for fragmentation.

There may be a reason to use different types of memory for some data. If a system has a small amount of memory that's better protected against snooping, that memory should be prioritized for key-equivalent secret data over message-specific secret data during signing. But I think if you want that level of control, you should be prepared to maintain your own platform-specific cryptography code, or at least your own patches. It's not a level of control that we intend to provide in TF-PSA-Crypto (whereas we would like to give a choice between stack and a user-provided buffer).

In addition, I would prefer to avoid having low-level functions that can fail: ideally some high-level function would reserve the necessary amount of memory (and generate the random data), and then all low-level functions can't fail. One benefit of that is that it reduces the risk of doing error handling wrong, although having a CBMC proof mostly mitigates that risk. Another benefit is the code size: the cost of all if (failed) goto cleanup adds up.

Thanks @gilles-peskine-arm for this extensive comment. Makes sense to me.
In that case we go ahead an merge the workspaces in mlkem-native now and also try to get that to work in mldsa-native.

mkannwischer

Thanks @hanno-becker. LGTM.

This commit introduces 'workspace structures', which are are function-local structures wrapping all objects and buffers utilized by the function into a single struct. The benefit of the workspace struct is a simpler allocation boilerplate: Instead of allocating multiple structures and separately handling their success/failure, we now only need to allocate a single object per function. Signed-off-by: Hanno Becker <beckphan@amazon.co.uk>

davidchisnall · 2026-01-05T12:58:30Z

Heap: a monolithic allocation saves a bit of RAM overhead, but can be problematic if your heap is already fragmented. I think that if fragmenting the allocations is the only way your application can run, you're already too tight on memory and even a small change in the order of operations, or just letting your application run a little longer, would cause heap exhaustion even with smaller fragments.

This really isn't the case for embedded targets. Heap sizes of well under 1 MiB are almost universal and under 128 KiB is common. There's a massive difference in the likelihood of being able to satisfy a single 60 KiB allocation and two 30 KiB ones in a long-running application. A single 60 KiB allocation is likely to be close to half of the total heap size, which means that you can't allocate it even with a small amount of existing fragmentation.

hanno-becker · 2026-01-05T13:58:58Z

Thank you @davidchisnall for chiming back in (and happy new year)!

I am open to reconsidering this. On mldsa-native, the hope was that fewer allocations would reduce CBMC proof complexity, but so far, this hope hasn't materialized (and we have not yet introduced workspaces, there). It might be more effective to have a wrapper doing the allocation and passing a myriad of pointers to a function doing the actual work. That's harder to read, but it should contain the proof complexity (the visual clutter added to the CBMC specs might be contained with something like pq-code-package/mldsa-native#817).

Also, @gilles-peskine-arm's request of having a public constant indicating the memory usage can be achieved without workspaces (cf #1379 (comment)).

hanno-becker · 2026-01-08T09:08:01Z

@mkannwischer @gilles-peskine-arm Please chime in.

gilles-peskine-arm · 2026-01-08T17:00:10Z

This really isn't the case for embedded targets. Heap sizes of well under 1 MiB are almost universal and under 128 KiB is common. There's a massive difference in the likelihood of being able to satisfy a single 60 KiB allocation and two 30 KiB ones in a long-running application. A single 60 KiB allocation is likely to be close to half of the total heap size, which means that you can't allocate it even with a small amount of existing fragmentation.

As a former embedded OS architect, the systems I worked on reserved a static (or startup-time) amount of memory for critical systems such as cryptography, so that we knew in advance that crypto wouldn't fail midway due to lack of resources. distinct from heap which was used in less critical components where failing midway was acceptable.

Now, as a crypto library maintainer (TF-PSA-Crypto), I have known users who want to reserve a static area for ML-DSA verification, because the system's stack limit is too small, and the verifier has no heap. I also have known users who have a heap, including resource-constrained platforms where a few kilobytes can make a difference, but many of these users use the heap implementation that comes with our crypto library because they don't use a heap for anything else, and they would prefer if our crypto library didn't require a heap at all. (Not an option for our RSA code or most of our ECC code, but a long-term goal for us.) I also have known users who have a large heap and are concerned about how many thousands of concurrent connections they can establish, but this kind of high-end application is less of a priority for us.

So, in my experience, the case of a heap that's just big enough if you fragment the allocation is not very important, it's not something that TF-PSA-Crypto intends to cater for. However, it's a vast world and if there are environments where this is desirable, I certainly won't argue that mldsa-native shoult not provide this capability. That's not my call to make.

hanno-becker force-pushed the alloc_free_config branch 10 times, most recently from f103112 to 93d235d Compare December 15, 2025 05:40

hanno-becker marked this pull request as ready for review December 15, 2025 05:41

hanno-becker requested a review from a team as a code owner December 15, 2025 05:41

hanno-becker marked this pull request as draft December 15, 2025 05:41

hanno-becker mentioned this pull request Dec 15, 2025

[RFC] Discuss placement of custom code inside/outside of config #1387

Open

hanno-becker marked this pull request as ready for review December 15, 2025 05:54

hanno-becker marked this pull request as draft December 15, 2025 05:57

hanno-becker mentioned this pull request Dec 15, 2025

Make allocation of large structures/buffers configurable (without workspaces) #1389

Merged

hanno-becker force-pushed the alloc_free_config branch 5 times, most recently from b2df5cf to 652aa43 Compare December 15, 2025 08:55

hanno-becker marked this pull request as ready for review December 15, 2025 08:58

hanno-becker mentioned this pull request Dec 15, 2025

CBMC: Remove uses of object_whole and same_object #1390

Merged

hanno-becker changed the title ~~Make allocation of large structures/buffers configurable~~ Make allocation of large structures/buffers configurable (with workspaces) Dec 16, 2025

oqs-bot reviewed Dec 16, 2025

View reviewed changes

hanno-becker force-pushed the alloc_free_config branch 3 times, most recently from a876fb7 to 7bbba13 Compare December 21, 2025 19:20

hanno-becker changed the title ~~Make allocation of large structures/buffers configurable (with workspaces)~~ Bundle multiple allocations into workspaces Dec 21, 2025

mkannwischer requested changes Dec 27, 2025

View reviewed changes

mlkem/src/kem.c Outdated Show resolved Hide resolved

hanno-becker requested a review from mkannwischer December 27, 2025 20:22

hanno-becker force-pushed the alloc_free_config branch 2 times, most recently from 2c78165 to 793b43d Compare December 29, 2025 19:25

gilles-peskine-arm reviewed Dec 30, 2025

View reviewed changes

mkannwischer force-pushed the alloc_free_config branch from 793b43d to cce203f Compare December 31, 2025 04:56

mkannwischer approved these changes Dec 31, 2025

View reviewed changes

mkannwischer force-pushed the alloc_free_config branch from cce203f to 4cbb5f7 Compare December 31, 2025 07:21

hanno-becker force-pushed the alloc_free_config branch 2 times, most recently from b0b7a86 to c37d1ea Compare January 1, 2026 17:34

hanno-becker force-pushed the alloc_free_config branch 3 times, most recently from 7840809 to 76a56c6 Compare January 1, 2026 20:18

hanno-becker merged commit e379df7 into main Jan 2, 2026
779 checks passed

hanno-becker deleted the alloc_free_config branch January 2, 2026 03:55

Bundle multiple allocations into workspaces #1379

Bundle multiple allocations into workspaces #1379

Uh oh!

Conversation

hanno-becker commented Dec 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hanno-becker commented Dec 15, 2025

Uh oh!

hanno-becker commented Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

davidchisnall commented Dec 15, 2025

Uh oh!

hanno-becker commented Dec 15, 2025

Uh oh!

davidchisnall commented Dec 15, 2025

Uh oh!

hanno-becker commented Dec 15, 2025

Uh oh!

oqs-bot left a comment

Choose a reason for hiding this comment

SpacemiT K1 8 (Banana Pi F3) benchmarks

Uh oh!

hanno-becker commented Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mkannwischer left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

hanno-becker commented Dec 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gilles-peskine-arm commented Dec 30, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hanno-becker Dec 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hanno-becker Dec 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mkannwischer commented Dec 31, 2025

Uh oh!

mkannwischer left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

davidchisnall commented Jan 5, 2026

Uh oh!

hanno-becker commented Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hanno-becker commented Jan 8, 2026

Uh oh!

gilles-peskine-arm commented Jan 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

hanno-becker commented Dec 14, 2025 •

edited

Loading

hanno-becker commented Dec 15, 2025 •

edited

Loading

hanno-becker commented Dec 16, 2025 •

edited

Loading

hanno-becker commented Dec 30, 2025 •

edited

Loading

hanno-becker Dec 31, 2025 •

edited

Loading

hanno-becker Dec 31, 2025 •

edited

Loading

hanno-becker commented Jan 5, 2026 •

edited

Loading