[SYCL][Graph] Add initial draft of malloc/free nodes ...#352
[SYCL][Graph] Add initial draft of malloc/free nodes ...#352
Conversation
EwanC
left a comment
There was a problem hiding this comment.
Tried not to nitpick this too much until we agree on the high level design, but think this looks good. API is simple, clear mapping from CUDA-Graph entry-points for porting code, and the dependency on sycl_ext_oneapi_virtual_mem hopefully means we can hand off having to go into any low level details.
| won't be valid until the allocation is made on graph finalization, as allocating | ||
| at finalization is the only way to benefit from the known graph scope for optimal | ||
| memory allocation, and even optimize to eliminate some allocations entirely. | ||
|
|
There was a problem hiding this comment.
I think people will have the question "if I have two executable graphs created from the same modifiable graph, will that node be different memory allocations?", so we should state something about that.
|
|
||
| | | ||
| Returns a pair of a pointer to memory and a node. The pointer is allocated on the `device` | ||
| that is associated with current graph by first execution of the `command_graph`. |
There was a problem hiding this comment.
first execution of the
command_graph
Question about whether we want to tighten this more to say something like "allocated during the finalization of command-graph to the returned executable graph".
The current wording leaves open the suggestion that if you never execute the graph nothing will be allocated, but we probably will do the allocation on finalization - which means less latency when submitting the graph.
There was a problem hiding this comment.
how about "may be allocated during the finalization ... "? That way we can keep our options, depending on the backend.
There was a problem hiding this comment.
shall we align with line 1723 here?
in line 1723, it says allocating at finalization is the only way
|
|
||
| Exceptions: | ||
|
|
||
| * Throws synchronously with error code `feature_not_supported` if any devices in `context` |
There was a problem hiding this comment.
we don't have a context parameter, "if device associated with the command graph does not not have ..."
There was a problem hiding this comment.
Thanks, I meant: "any device in the context associated with the graph ..." but let's be more accurate!
| can be passed here with a list of nodes to create dependency edges on. | ||
|
|
||
| |=== | ||
|
|
There was a problem hiding this comment.
Heads up that once intel#12366 merges i think we'll need to extend the node_type enum to include malloc_device and free
sycl/doc/extensions/experimental/sycl_ext_oneapi_graph.asciidoc
Outdated
Show resolved
Hide resolved
|
An example with structured binding could look like: ...
sycl_ext::command_graph my_graph(my_queue);
float scalar = 2.0f;
size_t N = 1024;
float *ptrX = malloc_device<float>(N, my_queue); // allocate device memory
auto [ptrY, node_a] = my_graph.add_malloc_device<float>(N); // only reserve virtual memory
auto node_b = my_graph.add([=](handler& cgh){cgh.parallel_for(N, [=](id<1> it){ptrX[it] += scalar * ptrY[it];});}, {sycl_ext::property::node::depends_on(node_a)});
auto node_c = my_graph.add_free(ptrY);
auto my_exec = my_graph.finalize(); // may allocate physical memory and map
my_queue.ext_oneapi_graph(my_exec).wait(); |
| std::pair<void*,node> | ||
| add_malloc_device(size_t num_bytes, const property_list& propList = {}); |
There was a problem hiding this comment.
Could we have this templated so that users can call this similarly to sycl::malloc_device<T>? Just saves the users having to do a cast after the call.
There was a problem hiding this comment.
Yes, will add. My only reason to start with a single version of this API was not having to update too many functions, in case we change the basic design of this interface.
| USM allocations to be used in the graph submission. Before to coming to this | ||
| recommendation we considered the following explicit graph building interfaces | ||
| for adding a memory allocation owned by the graph: | ||
| The following interfaces enables users to define a memory allocation/free operation |
There was a problem hiding this comment.
is there another plan/draft to record the memory allocation/free with record&replay APIs?
The current functions such as sycl::malloc_device are not recorded into a graph with record&replay APIs, we may need new async functions such as sycl::malloc_device_async
... with backend support for deferred memory allocation.
As proposed in intel#8954