Chunk data to an algorithm #3

knoepfel · 2024-12-16T22:19:49Z

knoepfel
Dec 16, 2024
Maintainer

DUNE US S&C R&D item 101

Data chunking is intended to process a logical data product that is too large to fit in memory at once. This demonstrator requires several things:

Ask DUNE for specific example
Establish interface for algorithms that want to take advantage of chunking
Start with vector of numbers and transform them to something else
Combine chunked algorithm results with a fold operation
Understand ramifications for DDL system and the IO system
Input arguments (e.g.) std::span<T> vs. std::vector<T> could imply that some data can be chunked for an algorithm and some cannot.
What about Python algorithms? Annotations or decorators.

To produce a demonstrator we are introducing a concept of chunk-able data product (e.g. a sequence of waveforms), in general a chunk-able data product will be a sequence of something.

Rule for defining data products that are chunk-able, need to understand nature of this virtual data product -- what does it mean to be a sequence, known size at the beginning, size only known once done
- Produce code (certainly C++ but may be both in C++ and Python?)
- Need to be able to represent in memory
The IO system will need the concept of virtual data product, that can be both read and written in chunks. The size of the chunks are under control of the writer of the data product.
- Will need a mockup IO system
- Will not need the ability to read/write these in this demonstrator, need to know enough to convey to the IO group/team
What user will write
- A C++ function that expects a span of waveform (std::span), and partial accumulation and its output is an accumulator that is input to a fold (and output is also an accumulator).
- In python, we need to pay attention to avoid copy data (0 copy for python)
What is the plugin that framework will load
- Declare a framework module that my input is chunk-able sequence of waveforms, and output is what reduction product
Framework will
- Produce compilation error if data product is not chunk-able => work not with Python code, we need some runtime support (may be look about some boilerplate stuff)
- Some runtime check will be needed (when chunk size specified is too big or something that may not work)
- Put together a program that uses flow graph and task to do the processing of the virtual data product in chunks.
Think about parallel processing of chunks (may be pipelined)
How much of a scheduler will be needed? If we are using Meld, then we will already be relying on scheduler. For standalone, we need to figure out.

sabasehrish · 2025-02-05T14:53:39Z

sabasehrish
Feb 5, 2025
Maintainer

Subsystems

Required

Registration
Task management
Mock I/O

Nice to have

Algorithm description
I/O
Configuration
Plugin management
Logging

0 replies

knoepfel · 2025-02-21T22:09:40Z

knoepfel
Feb 21, 2025
Maintainer Author

Related to:

0 replies

knoepfel · 2025-02-24T15:18:19Z

knoepfel
Feb 24, 2025
Maintainer Author

Responsible developers: @marcpaterno and @sabasehrish.

0 replies

brettviren · 2025-04-30T18:46:04Z

brettviren
Apr 30, 2025

Here are some types of chunking needed for DUNE and what implications chunking may have. It has a focus on FD charge data and Wire-Cell implementations so is definitely not comprehensive.

Terms

TPC : a contiguous sub-detector unit with a single "face" of electrodes in which drifting ionization electrons inducde current which is then measured. Ionization electron signal originating in a given TPC only induce current in the TPC's face.
APA : Back to back TPCs with electrodes from both faces providing induced current to a single electronics channel.
TR : DUNE trigger record which comes in two basic time durations (nominal 3-5ms and the 100s long "extended" supernova neutrino burst (SNB) candidate) and a variety of spacial extent (providing data anywhere from 1 to 150 APAs for DUNE FD HD).

Wire-Cell charge waveform simulation

Basic transformation:

(depos) -> [sim] -> (ADC waveforms)

The [sim] (as implemented by Wire-Cell Toolkit) is itself a (data flow) graph that includes these major nodes:

drifting of depos
convolution of drifted charge distribution with detector response
addition of noise
digitization

There are several types of chunking relevant to this simulation:

time : Input (depos) may be chunked into time bins, each group fed into [sim] and the resulting (ADC waveforms) regrouped. The time duration of the output waveforms is longer than the time duration input depos by an amount governed by the detector response which can be O(1ms). Thus, overlaps between neighboring waveform chunks are formed and must be summed. This chunking is only required for "extended" FD trigger records (eg SNB candidates).
space : The [sim] operates independently on each TPC (in principle, on each plane of a TPC face). The output (ADC waveforms) are naturally chunked by TPC or APA. The input (depos) can likewise be pre-chunked, however [sim] will only consume the subset of depos relevant to a given TPC (or APA). There is no concern about overlap across APAs, but combining waveforms from two TPCs in an APA requires a sum that is aware of an electrode-channel mapping.
source : Here, "source" means the physical process (and the implementing code/job) that produced a set of (depos). Some jobs may have a single source of depos (eg, just neutrino interactions) and some may need to properly "mix" different sources (eg nueutrinos + cosmic muons + radiologicals). The mixing can be done to produce the input (depos) or multiple sets of (depos) can be input to [sim], which can then properly to mixing. If mixing is done prior, it poses no "source chunking" issue to [sim]. If [sim] does the mixing, chunking is handled in a "streaming" algorithm which means internal buffering and that feeding input (depos) sets must not "starve" any stream.

Wire-Cell charge waveform signal processing

Basic transformation:

(ADC waveforms) -> [sigproc] -> (signal waveforms)

Signal waveforms represent a reconstruction of the distribution of drifted ionization charge in (transverse) space vs time dimensions of each tomographic wire-plane view. The samples of a signal waveform are in units of number of (drifted) ionization electrons per tick per channel. The signal waveforms are highly sparse and can be represented in a space-efficient way either with sparse arrays or as compressed dense arrays (zero padding the sparse regions).

There are two types of chunking that are relevant:

time : ADC waveform blocks (from one TPC or APA) that are longer than about 10ms become uncomfortably large for signal processing. The nominal DUNE FD TR is 3-5ms. However the "extended" (SNB candidate) TRs are 100s and must be chunked in time. Like [sim], the [sigproc] transformation produces output chunks that have longer duration in time than the input chunks and combining the output must take into account summing the overlap.
space : The [sim] the [sigproc] transformation operates independently at the level of one APA (not one TPC).

Wire-Cell charge sim+sigproc

As a special case, when both simulation and signal processing are needed, it is desired (at least for large scale production) to NOT expose (ADC waveform) data tier to any persistence (file or memory) and so a combined transformation is:

(depos) -> [sim+sigproc] -> (signal waveforms)

Wire-Cell 3D charge imaging

(signal waveforms) -> [img] -> (blobs)

This process reconstructs, with coarse resolution, locations in space/time likely to contain ionization electron signal. It is a per-APA transformation and essentially a streaming algorithm. Thus, robust against space-chunking at APA level and any reasonable time-chunk.

Wire-Cell charge cluster stitching

(blobs) -> [clus] (clusters)

WC (and other) reconstruction chains form "clusters" of some type that represent high resolution reconstruction of ionization locations.

In WC and for the case of compact (nominal, not extended) data, clusters are constructed first on a per-TPC basis. They are then "stitched" across the two TPCs of one APA and then across neighboring APAs. Each type of stitching requires assembly of any chunk-level clusters such that the boundaries are spanned. This can be pair-wise at the 2TPC->APA stitching and then all APA level clusters can be assembled for the cross-APA stitching.

Finding clusters from extended data poses a problem in the face of chunking due to a given set of blobs that should become a single cluster landing on a chunk boundary. Some possible solutions:

Chunk and hope. Clusters on either side of the chunk boundary may be formed from the surviving blobs. This formation will suffer some inefficiency in some cases. Stitching across chunk boundary is needed.
Streaming cluster algorithm. Rework clustering reco to consume blobs in a time-ordered stream, buffering as needed and making smart determination. Blob time chunks can be of arbitrary size.

Wire-Cell Charge-Light matching

Charge clusters and "flashes" reconstructed from the optical detection system must be matched in space and time in order to absolutely locate the cluster.

(clusters) -+
            |
            +-> [q-l match] -> (clusters)
            |
(flashes)  -+

The DUNE FD design does not include optical boundaries at the TPC or APA level and so the matching is done with whole-detector charge and light information. Any prior chunking of these data must be such to allow the required assembly.

Like with clustering, chunking in time may be required for input clusters and/or flashes and similar solutions can be considered ("chunk and hope" vs "streaming alg").

Cross-chain merging

DUNE has multiple, independent reco chains. Eg Wire-Cell and Pandora both split off after signal processing in order to implement different strategies. It is necessary to allow data products from one chain to "cross over" to another. This is needed for performing comparisons and so that one chain simply input results from the other to form a subsequent hybrid chain. Each consumer at the merge will impose some requirements related to the chunk boundaries of the data products from each stream. Even in the unlikely case that identical chunk boundaries existed on both streams, the node consuming the two streams may have special needs. Eg, it may require to consume a FIFO queue of some depth of data products from each stream.

3 replies

marcpaterno May 20, 2025
Maintainer

Here are some types of chunking needed for DUNE and what implications chunking may have. It has a focus on FD charge data and Wire-Cell implementations so is definitely not comprehensive.

Terms

TPC : a contiguous sub-detector unit with a single "face" of electrodes in which drifting ionization electrons inducde current which is then measured. Ionization electron signal originating in a given TPC only induce current in the TPC's face.

APA : Back to back TPCs with electrodes from both faces providing induced current to a single electronics channel.

TR : DUNE trigger record which comes in two basic time durations (nominal 3-5ms and the 100s long "extended" supernova neutrino burst (SNB) candidate) and a variety of spacial extent (providing data anywhere from 1 to 150 APAs for DUNE FD HD).

Wire-Cell charge waveform simulation

Basic transformation:
(depos) -> [sim] -> (ADC waveforms)
The [sim] (as implemented by Wire-Cell Toolkit) is itself a (data flow) graph that includes these major nodes:

drifting of depos

convolution of drifted charge distribution with detector response

addition of noise

digitization

There are several types of chunking relevant to this simulation:

time : Input (depos) may be chunked into time bins, each group fed into [sim] and the resulting (ADC waveforms) regrouped. The time duration of the output waveforms is longer than the time duration input depos by an amount governed by the detector response which can be O(1ms). Thus, overlaps between neighboring waveform chunks are formed and must be summed. This chunking is only required for "extended" FD trigger records (eg SNB candidates).

In the attached document, we have tried to describe this part of a possible workflow with the terminology used in Phlex.
wirecell-charge-waveform-sim-doc.pdf
We would like to use this document to improve our understanding of the workflow and to determine what kinds of higher order functions need to be provided by Phlex. We have two main questions:

Does the document capture the general idea of the workflow correctly?
If it does, does the fold at the end of this workflow suffice? Or do you need something we'd call a "windowed fold", which would present two consecutive NoisyConvolvedDepos to the digitize algorithm at a time -- perhaps first NCD1 and NCD2, then NCD2 and NCD3, etc) so that "edge effects" between time bins can be mitigated?

brettviren May 20, 2025

Hi @marcpaterno

There are a few things that are a little off. Hopefully the following helps:

I know this is just an exercise to get a feel but I think it remains to be understood how granular the WC flow graphs should appear to phlex's flow graph. My feeling is that my ASCII diagram above is the right granularity - ie, the WC sim is a single "black box" to phlex. And really, we may want sim+sigproc to be combined to avoid exposing the large intermediate waveform blocks (more on that below).
The data product types of an input depo and a drifted depo are identical. The size of the set of drifted depos will be no larger than the size of the set of input depos. The drifted set may be smaller as any input depos that are not in the active volume of a detector are dropped.
The drifting can (and does) handle a "chunk streamed" data flow so one need not have all depos in memory at once. To handle the causality of drifting, the drifter must buffer some depos long enough to know when some subset is safe to be released as output. This requires input depos to be provided in time order. Output depos are also time ordered (which is a different order compared to input because of varying drift distance).
Figure 2 is not sufficient because the convolution of one chunk of DriftedDepos leads to the ConvolvedDepos becoming extended in time such that there is "overlap" with the next ConvolvedDepos in time. The time duration of the overlap is at least as long as the duration of the convolution kernel. If a really long RC response is relevant, it is possible for the overlap duration to be even longer than a nominal choice for the chunk duration. The overlap should be snipped off its ConvolvedDepos and added to the next one prior to adding noise. Thus something must sit between "ConvolvedDepos sequence" and "NoiseConvovledDepos sequence" which is allowed to process the input sequence, err, sequentially. I note that this sequencing requirement is similar to the one for the drifter.
In most types of production jobs we will want to NOT write out DigitizedWaveforms but instead follow WC sim immediately with WC sigproc. And, DUNE uses this sim+sigproc mode now. It is used so that we write out a much smaller data product (signals - reconstructed ionization electron distribution). This is also the data product (not ADCs) that begins ALL branches of downstream reconstruction. Some special purpose, smaller scale, jobs may need to write out sim DigitizedWaveforms. This all doesn't qualitatively change the picture you paint. It just lengthens the graph.
The term "time bin" is not clear to me. If it means the chunk duration (say, 1-10 ms), no worries. If it means "tick" (~500ns sampling period) then big worries.

sabasehrish Jun 25, 2025
Maintainer

Thanks @brettviren, @marcpaterno and I have updated the document based on your feedback. Please see attached an updated copy of the document. Does the updated document capture the general idea of the workflow correctly?

wirecell-charge-waveform-sim-doc.pdf

knoepfel · 2025-05-01T15:55:52Z

knoepfel
May 1, 2025
Maintainer Author

To provide some context, we discussed these slides at yesterday's meeting to start the discussion.

0 replies

marcpaterno · 2025-07-28T14:27:34Z

marcpaterno
Jul 28, 2025
Maintainer

Hi @brettviren . Saba and I are just getting back to looking at the window function after weeks of finalizing the document for the design review and then a week of design team retreat. Have you had a chance to look at the updated workflow document we posted above? We believe it is closer to what you need, and are interested in what additional adjustments might be necessary.

0 replies

knoepfel · 2025-08-27T20:17:02Z

knoepfel
Aug 27, 2025
Maintainer Author

Hi @brettviren and @absolution1, we're trying to move forward on the Phlex design, and we'd like to hear from you whether @marcpaterno and @sabasehrish's updated wire-cell workflow document accurately captures one of the use cases—i.e. the use case that Brett included above with the title "Wire-Cell charge waveform simulation".

Could you take a look please and give us your thoughts? Thanks very much.

0 replies

brettviren · 2025-08-28T18:52:14Z

brettviren
Aug 28, 2025

Hi @knoepfel et al and sorry for slow response.

The granularity of this example is smaller than the current simulation implemented with Wire-Cell toolkit running inside art/larsoft. It would "bust apart" the WCT graph execution for no perceived benefit. As a thought exercise to draw out some patterns, the example is perhaps okay, but I'd not want to see this particular graph implemented in phlex. It is not useful to reinvent WCT. Rather the focus should be on how phlex can run WCT and other "payloads".

Also, the note describes a "time-binned depos" being converted into a "drift-binned depos". That's not sufficient for how the physics works. There is not a clean one-to-one mapping between any possible pre and post drift binnings because the drift distance/time swap causes order to change.

The note assumes "all SNB depos in memory". I don't know if that is even feasible. It certainly is if one only considered SN interactions, but 100s of radiological and cosmics will be a lot of data.

Instead I suggest we consider ways to operate in a "chunked streaming" mode throughout. This might start with a stream of G4 final state kinematics for individual interactions over an extended time. Each interaction can be fed through Geant4 one at a time to produce a set of depos. The sets of depos may overlap and so need some kind of windowed / streamed sort. The content of the window is bound and determined by the data (similar windowed sort is done in WCT drifter). The time ordered depos can be streamed into WCT sim and WCT sim would output time ordered stream of chunks of ADC waveform. In this serial-streamed mode, the "leakage" can be handled inside of WCT and not exposed to phlex.

OTOH, this sequential stream will lead to rather long-running jobs. Simulating 100s of 1 APA is order a CPU-week of compute not counting G4 time. If monopolizing a single CPU for order week is not feasible then we may consider a scatter-gather approach. In this case we may segment the ordered depo stream, scatter those segment to many WCT sim jobs. In this case the output of these must not yet be "leakage free" ADC but be signal+noise waveform chunks at voltage level and with "leakage" tails. The tails must be snipped and transferred and added to the subsequent chunk in time and then the result (again, voltage level waveform) sent into WCT digitizer to produce (leakage free) ADC waveform chunks. Final gather of the resulting ADC waveform chunks would/could be done in order to assemble contiguous runs of chunks, eg 1 APA-second per file (approx 10 GB).

Though, actually, DUNE already generally does not want to stop sim at ADC but immediately follow on with signal processing to avoid pointlessly storing large ADC data. So, the above scatter-gather gets a bit more complicated.

In any case, this is a more practical granularity and data flow patterns to consider.

0 replies

knoepfel · 2025-09-03T22:19:53Z

knoepfel
Sep 3, 2025
Maintainer Author

Thanks for the response, @brettviren.

The granularity of this example is smaller than the current simulation implemented with Wire-Cell toolkit running inside art/larsoft. It would "bust apart" the WCT graph execution for no perceived benefit. As a thought exercise to draw out some patterns, the example is perhaps okay, but I'd not want to see this particular graph implemented in phlex. It is not useful to reinvent WCT. Rather the focus should be on how phlex can run WCT and other "payloads".

It is neither our desire nor intent to reinvent WCT. We are trying to understand the details of various WCT workflows that Phlex may need to support. It is certainly possible that a WCT workflow could be represented as a "black box" to the framework—this is (more-or-less) how things work with art/LArSoft jobs. You're well aware of the awkward back-and-forth between art and WCT through larwirecell, and I think we can likely do better than that by differently structuring both Phlex and the interface layer (i.e., the equivalent to larwirecell). All of our explorations are thus an attempt to understand how to better support WCT, not replace it.

We're going to chat a bit on our end, and then I'd like to propose a joint Zoom meeting between the WCT folks and the Phlex developers. It's good to have a written record of our conversations (via these GitHub discussions), but we are likely getting to the point where a face-to-face meeting will be more efficient.

Thanks for your patience, and stay tuned.

0 replies

brettviren · 2025-09-04T12:41:45Z

brettviren
Sep 4, 2025

Hi @knoepfel

A zoom chat sounds good.

Let me try to explain my thinking about "it's not useful to reinvent WCT" as that statement may have come off wrong. My concerns are all technical (and driven to minimize future effort).

I think you hit on a core issue which is indeed the nature of the interface layer between framework and any given "toolkit".

Toolkit here means WCT but also Pandora or any other implementation that has its own data model and possibly also its own execution model (eg, the optional use of the WCT's graph execution engines).

The data model interface is more fundamental in my mind.

Each side of that interface has different requirements on its own data model. The requirements for existing toolkits are already baked in to their implementations and we must keep those as immutable.

It then becomes a challenge to PHLEX to provide its own data model that first works with all the existing toolkits' model and second that passing data through the interface between framework and toolkit model is fast (for some definition of "fast").

I think the "works" part poses no serious challenge. We can always come up with something.

I think "fast" must be measured by comparing time for data to traverse framework->toolkit->framework and toolkit->framework->toolkit paths. That is the time spent in the ->'s and in just "framework" and in just "toolkit" must be all compared.

What we have today in art/larsoft is a data interface that requires serializing of data between framework data model (meaning larsoft) and toolkit data model. Eg, in the larewirecell WCT nodes we convert every bit of data between WCT IFrame and larsoft vector<raw::RawDigit> or vector<recob:Wire>. This satisfies the "works" but satisfying the "fast" depends on the nature of the job.

And here is where my concerns about granularity of WCT graphs in PHLEX comes to play.

For many individual WCT graph nodes, even the expensive serialization interface poses insignificant overhead. But for some nodes, this overhead would dominate. And the number of these "too fast" nodes are increasing as we develop more small-scope GPU-dominated nodes. These fast nodes implicitly rely on fast data transfer to be relevant. With WCT's TbbFlow, message passing by TBB's flow_graph is measured in MHz which makes data transfer times insignificant even for our fasted GPU nodes. If we inserted larwirecell style serialization as a fine-grained transfer method, it would largely degrade the benefits of developing these fast GPU nodes. Indeed, we take pains to even avoid a tensor leaving the GPU as the WCT data object that "holds" the torch::Tensor is transferred between flow graph nodes.

So, with PHLEX, if we want fine-grained toolkit, we should think of a different framework data model that does not required data serialization. The only model I can think of that has a chance to be "fast" is one based on data encapsulation.

Happily, a PHELX encapsulation data model is a good fit to WCT's data model which is based on an abstract base class hierarchy of interface classes rooted in a single IData. Furthermore, WCT data objects are passed between WCT nodes as shared_ptr and even type erased all the way down to boost::any inside the WCT graph execution engines.

This then enables a PHLEX data model that encapsulates WCT (and other toolkit?) data objects via lightweight pointer-like instances. We might think of a PhlexToolkitData<T> PHLEX data type (hopefully with some nicer name). The T might be as type free as boost::any or expose some level of toolkit-specific typing, eg base shared_ptt<IData> for WCT on up to leaf data interface types like shared_ptr<IFrame> or shared_ptr<ITorchTensorSet> (the main data type used by the GPU-heavy nodes).

A really nice side effect of exposing the WCT leaf data interface types is that it enables people to develop new "PHLEX-native" nodes that can directly consume WCT data model types without writing full-blown-and-phlex-wrapped WCT flow graph node types. This would "steal" developer "mind share" from the WCT world but whatever lowers the bar for people developing what they need is a usually good thing.

Another benefit with this "encapsulation" data model interface is that we can probably get away with developing a single, general purpose, templated PHLEX equivalent to the many type-specific data converters in larwirecell that are required for that "serialization" data model interface.

0 replies

absolution1 · 2025-09-04T12:44:17Z

absolution1
Sep 4, 2025

Hi all, I’m just returning for holiday and catching up. If there’s a zoom call please loop me in. Cheers, Dom

…

On 4 Sep 2025, at 13:42, Brett Viren ***@***.***> wrote: Hi @knoepfel <https://github.com/knoepfel> A zoom chat sounds good. Let me try to explain my thinking about "it's not useful to reinvent WCT" as that statement may have come off wrong. My concerns are all technical (and driven to minimize future effort). I think you hit on a core issue which is indeed the nature of the interface layer between framework and any given "toolkit". Toolkit here means WCT but also Pandora or any other implementation that has its own data model and possibly also its own execution model (eg, the optional use of the WCT's graph execution engines). The data model interface is more fundamental in my mind. Each side of that interface has different requirements on its own data model. The requirements for existing toolkits are already baked in to their implementations and we must keep those as immutable. It then becomes a challenge to PHLEX to provide its own data model that first works with all the existing toolkits' model and second that passing data through the interface between framework and toolkit model is fast (for some definition of "fast"). I think the "works" part poses no serious challenge. We can always come up with something. I think "fast" must be measured by comparing time for data to traverse framework->toolkit->framework and toolkit->framework->toolkit paths. That is the time spent in the ->'s and in just "framework" and in just "toolkit" must be all compared. What we have today in art/larsoft is a data interface that requires serializing of data between framework data model (meaning larsoft) and toolkit data model. Eg, in the larewirecell WCT nodes we convert every bit of data between WCT IFrame and larsoft vector<raw::RawDigit> or vector<recob:Wire>. This satisfies the "works" but satisfying the "fast" depends on the nature of the job. And here is where my concerns about granularity of WCT graphs in PHLEX comes to play. For many individual WCT graph nodes, even the expensive serialization interface poses insignificant overhead. But for some nodes, this overhead would dominate. And the number of these "too fast" nodes are increasing as we develop more small-scope GPU-dominated nodes. These fast nodes implicitly rely on fast data transfer to be relevant. With WCT's TbbFlow, message passing by TBB's flow_graph is measured in MHz which makes data transfer times insignificant even for our fasted GPU nodes. If we inserted larwirecell style serialization as a fine-grained transfer method, it would largely degrade the benefits of developing these fast GPU nodes. Indeed, we take pains to even avoid a tensor leaving the GPU as the WCT data object that "holds" the torch::Tensor is transferred between flow graph nodes. So, with PHLEX, if we want fine-grained toolkit, we should think of a different framework data model that does not required data serialization. The only model I can think of that has a chance to be "fast" is one based on data encapsulation. Happily, a PHELX encapsulation data model is a good fit to WCT's data model which is based on an abstract base class hierarchy of interface classes rooted in a single IData. Furthermore, WCT data objects are passed between WCT nodes as shared_ptr and even type erased all the way down to boost::any inside the WCT graph execution engines. This then enables a PHLEX data model that encapsulates WCT (and other toolkit?) data objects via lightweight pointer-like instances. We might think of a PhlexToolkitData<T> PHLEX data type (hopefully with some nicer name). The T might be as type free as boost::any or expose some level of toolkit-specific typing, eg base shared_ptt<IData> for WCT on up to leaf data interface types like shared_ptr<IFrame> or shared_ptr<ITorchTensorSet> (the main data type used by the GPU-heavy nodes). A really nice side effect of exposing the WCT leaf data interface types is that it enables people to develop new "PHLEX-native" nodes that can directly consume WCT data model types without writing full-blown-and-phlex-wrapped WCT flow graph node types. This would "steal" developer "mind share" from the WCT world but whatever lowers the bar for people developing what they need is a usually good thing. Another benefit with this "encapsulation" data model interface is that we can probably get away with developing a single, general purpose, templated PHLEX equivalent to the many type-specific data converters in larwirecell that are required for that "serialization" data model interface. — Reply to this email directly, view it on GitHub <#3 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACS5GRY7NAL4RI77X6HFMF33RAXR7AVCNFSM6AAAAABTXDRCJGVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTIMZQG4ZDQOI>. You are receiving this because you were mentioned.

0 replies

brettviren · 2025-09-11T13:58:54Z

brettviren
Sep 11, 2025

@knoepfel From listening to your nice presentation at the DUNE collab meeting, I think the "data encapsulation" approach I described above is well aligned with PHLEX data model. It makes me think about the possibility of (re)interpreting existing WCT graphs defined in Jsonnet to be executed as a PHLEX (sub)graph. IOW, I can start to see the shape of a new, 3rd WCT graph execution engine which looks like the current Pgraph and TbbFlow to the WCT side but looks like a "subgraph definer" to PHLEX (which would provide the graph execution engine). This could allow for a more fine-grained, but still generic, interface layer compared to the pattern in larwirecell.

0 replies

marcpaterno · 2025-09-22T16:14:39Z

marcpaterno
Sep 22, 2025
Maintainer

We agree that the core issue in question is the nature of the interface between the framework and the algorithms. Our understanding is that WCT commonly uses data that are essentially vectors of shared pointers to some interface type. The pointed to objects, of course, must be of some concrete type. If this is correct, then the current early version of Phlex can likely already use the style of data products from WCT. We have a test in Phlex that demonstrates it will work correctly when passing std::vector<std::unique_ptr<> objects between algorithms (where Abstract is an abstract class). More specifically, an algorithm that expects a const reference to a vector of shared pointers to Base can be declared to Phlex, and Phlex would handle propagating the data type correctly. An algorithm can also return such a type. If a Phlex workflow is going to use algorithms that expect LArSoft-style types (e.g., vectorraw::RawDigit) and other algorithms that expect WCT-style types (e.g., WireCell::ITrace::vector), then translation from one to the other is needed. We are imagining at least two ways of doing this, one of which might be automatically scheduled by Phlex if the LArSoft and WCT types model the same data-product concept (we can say more about that another time).

A Phlex workflow that closely matches those done by larwirecell could then consist of two or more components. The first is a translator that turns a LArSoft type into the input needed by the WCT-based algorithms to follow. Next would be one (or more) Phlex nodes that deal only with WCT-defined data types. The entire WCT workflow could be expressed as one Phlex node (following the LArSoft->WCT translation), or it may be desirable to factorize the WCT workflow more to allow for data-product provenance-tracking, taking advantage of framework-scheduling, etc. If translation back to LArSoft types is needed for follow-up algorithms, then last would come another translator from WCT types to LArSoft types.

0 replies

brettviren · 2025-09-22T17:29:53Z

brettviren
Sep 22, 2025

Hi @marcpaterno. Yes, this is essentially correct. Though the unit of data passed between WCT data flow graph nodes is always a shared_ptr<IData> and never a std::vector. But it is true that many of the types passed do happen to have/represent containers of smaller grained data. Eg, the shared_ptr<IDepoSet> input to the simulation is largely a collection of shared_ptr<IDepo> (and yes, both backed by a concrete implementation, usually by a similarly named class with I -> Simple).

I think your view of sometimes applying WC/LS data translation and sometimes passing WC data as-is is good. And, indeed, it is basically the larwirecell design with the added benefit that both toolkit and framework can pass WC data as-is

WC vs LS data models and their inter-conversion could be an entire subgroup topic. There is some intersection and quite a lot of difference between them and writing converters is always non-trivial and non-fun (though doable). One approach that might be fruitful going forward is to define a generic data model that is flexible enough to faithfully represent the info in WC and LS (and other) data models.

WCT actually does this with its own data model in the "WCT tensor data model" (link below). It consists of a generic "low level" spec which is a HDF5-inspired and is comprised of dense multi-dimensional tensors and for each tensor a JSON-like metadata object. The "high level' TDM spec then maps each complex, structured WC data type to the low-level TDM. Something like this, perhaps better thought out by more people than just me, could become sort of a "schema bus" that allows many disparate toolkits share data most easily.

https://github.com/WireCell/wire-cell-toolkit/blob/master/aux/docs/tensor-data-model.org

0 replies

knoepfel · 2025-10-24T19:30:14Z

knoepfel
Oct 24, 2025
Maintainer Author

We are closing this discussion as it was intended to cover data-chunking explorations for FY25. That effort is now complete, but we want to continue the WireCell discussion in the context of early adoption (see discussion #14).

0 replies

Chunk data to an algorithm #3

Uh oh!

Uh oh!

knoepfel Dec 16, 2024 Maintainer

Replies: 15 comments · 3 replies

Uh oh!

Uh oh!

sabasehrish Feb 5, 2025 Maintainer

Subsystems

Required

Nice to have

Uh oh!

Uh oh!

knoepfel Feb 21, 2025 Maintainer Author

Uh oh!

knoepfel Feb 24, 2025 Maintainer Author

Uh oh!

Terms

Wire-Cell charge waveform simulation

Wire-Cell charge waveform signal processing

Wire-Cell charge sim+sigproc

Wire-Cell 3D charge imaging

Wire-Cell charge cluster stitching

Wire-Cell Charge-Light matching

Cross-chain merging

Uh oh!

marcpaterno May 20, 2025 Maintainer

Terms

Wire-Cell charge waveform simulation

Uh oh!

Uh oh!

Uh oh!

sabasehrish Jun 25, 2025 Maintainer

Uh oh!

knoepfel May 1, 2025 Maintainer Author

Uh oh!

marcpaterno Jul 28, 2025 Maintainer

Uh oh!

knoepfel Aug 27, 2025 Maintainer Author

Uh oh!

Uh oh!

knoepfel Sep 3, 2025 Maintainer Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

marcpaterno Sep 22, 2025 Maintainer

Uh oh!

Uh oh!

knoepfel Oct 24, 2025 Maintainer Author

knoepfel
Dec 16, 2024
Maintainer

Replies: 15 comments 3 replies

sabasehrish
Feb 5, 2025
Maintainer

knoepfel
Feb 21, 2025
Maintainer Author

knoepfel
Feb 24, 2025
Maintainer Author

marcpaterno May 20, 2025
Maintainer

sabasehrish Jun 25, 2025
Maintainer

knoepfel
May 1, 2025
Maintainer Author

marcpaterno
Jul 28, 2025
Maintainer

knoepfel
Aug 27, 2025
Maintainer Author

knoepfel
Sep 3, 2025
Maintainer Author

marcpaterno
Sep 22, 2025
Maintainer

knoepfel
Oct 24, 2025
Maintainer Author