Data Product Translation Properties #535

greenc-FNAL · 2026-04-17T17:12:52Z

greenc-FNAL
Apr 17, 2026
Maintainer

Introduction

This discussion started on ZOOM with the following participants:

Brief

We will be continuing discussion and planning of the Data Product Translation subsystem. Specifically, Kyle has asked that we develop a list of properties, and provided some proposals as a seed:

Some proposed properties

Translators do not remove or replace any data-product provenance

Translators may add additional data-product provenance to record translation having taken place

Translators may need to propagate (a reference to) the original data product for object-ownership purposes

Translators should not be duplicated if the same translation is required by multiple computational nodes

Translators are part of the computational graph

As a way of firing up our thinking in this direction, please—if you have time:

Consider the above suggestions and make a note or two for or against.
Think about a couple of new properties to suggest and discuss .

Notes

2026-04-17

Translators do not remove or replace any data-product provenance
- Provenance -> includes labels, here
- WL: currently labels are modified in order to distinguish the translated product from the original, and to allow the node requiring the translated input to find it in preference to the original. We need an alternative to this if we are to avoid replacing labels in provenance.
Translators may add additional data-product provenance to record translation having taken place
- Perhaps by adding a tag to the original?
- @greenc-FNAL: User nodes will see something that they will treat as a concrete data product, regardless of whether it is a "primary" entity created by another user node or provider, or whether it is the result of a translation. 99% of user node metadata needs will be ignorant of translation. Rarely, translation-specific metadata (assuming it exists) may need to be accessed.
Translators may need to propagate (a reference to) the original data product for object-ownership purposes
- @marcpaterno: "May"
- @greenc-FNAL: Object management is required by the system, and must account for translations and their association with output data products. It is unclear to me whether and how translator nodes need to be aware of and interact with this functionality.
- @wlav: An algorithm that is preceded by a translator node for its input, can be expected to have a translator node to its output. Whereas the minimum lifetime for the input data products is clear: at least until after the algorithm has completed, there's no equivalent for the output. In particular, the algorithm only produces an intermediary and it is the output translator node that produces the expected output data product. The intermediary has no further place in the graph (because no one can possibly have asked for it). However, it is beneficial in cases to keep the intermediary alive to prevent copying. A mechanism should be in place to express that.
- [ZOOM discussion truncated due to lack of time]
Translators should not be duplicated if the same translation is required by multiple computational nodes
- [No ZOOM discussion due to lack of time]
Translators are part of the computational graph
- Yes, but: (@marcpaterno) They are part of the computational graph, but they are not computational nodes. A translator will never be recorded as the "creator" of a translated data product.
- @wlav: possible complication. Currently limited in Python to up to three data products explicitly. In order to support an arbitrary number, we need to collect all the data products into an "internal" collection. Wim has been treating these collector nodes as "translators." They may need to be classified as something else (e.g. aggregators), but integrating translation of individual data products into the process has implications for graph construction and node interaction.

Discussion continuation 2026-04-24

Marc presented some slides for discussion, laying out some of his thoughts on data product concepts, translators, and graph construction.

Some disagreement on whether an explicit provider must specify which layer is the destination for its output product(s). @beojan believes it must; others (@marcpaterno, @greenc-FNAL, ?, ...) believes that the relevant layer can be deduced where it is unambiguous.

Discussion continuation 2026-05-01

Continuing with point 3 from above:

Translators may need to propagate (a reference to) the original data product for object-ownership purposes
- ... (see above discussion from 2026-04-17)
- @marcpaterno: Kyle pointed out that the internal representation is not restricted to the "bare" data product, and there are no major obstacles to attaching any extra information as/when it might become necessary.
- @sabasehrish: "If needed, it can be done."
- @marcpaterno: Access to this extra information from user algorithms would be "opt-in," and therefore it is permissible for this to be Phlex-dependent.
- @beojan: This might include "tracking" the different translations of a data product in order to avoid duplication of translator nodes in the graph.
- @marcpaterno / @greenc-FNAL: translators are "first-class" nodes from the point of view of flowgraph, but mostly irrelevant/invisible to the user, and probably optional in any visual representation of the graph.
Translators should not be duplicated if the same translation is required by multiple computational nodes
- Per discussion of (3) above, we believe we have agreement that—when incorporated into the graph as evisioned—translators for a given data representation conversion will be singular by construction.

More translation subsystem properties:

Discovery and selection of translation plugins. Use case: a developer relying mostly on production-built translators, but possibly overriding/adding a small number of translators for their own needs.
1. Translators are not configured by the framework user: therefore they need to be discovered by the framework based on the needs defined by the configuration.
2. The framework should be able to discover translators and select which ones should be part of the graph without needing to load every discovered translator DLL.
3. The translator information necessary for the graph building process should be generated at translator plugin build time. Whether this is a text file, a database, one per file, one per package, or one per build, is an implementation detail not necessary to be solidified at this time.

Further properties to be suggested/discussed in the comments below by end of business Friday 2026-05-08, and collected/summarized by @greenc-FNAL for discussion early the following week.

marcpaterno · 2026-04-20T19:19:48Z

marcpaterno
Apr 20, 2026
Maintainer

@wlav: An algorithm that is preceded by a translator node for its input, can be expected to have a translator node to its output.

Why is this to be expected? This seems to be an important assumption for what you wrote following this, but I do not understand why this should be expected.

6 replies

marcpaterno Apr 20, 2026
Maintainer

I think, but am not sure, that by "data products have some type of 'default' representation" you mean "data-product concepts have some default concrete data-product type" (all names referring to Figure 3.2 and the related text in the design doc).

If this is what you mean, then I do not think this needs to be the case. But before we go into further discussion, I want to make sure I understood you correctly.

wlav Apr 20, 2026
Maintainer

Yes, that is currently the implementation. I'd love to move away from that, but if we do, then both scenarios mentioned above will have more complicated life time management as now the first is no longer transactional but open-ended, just like the second.

My current thinking was along the lines of adding translator nodes first, then scrapping them in an optimization pass after. Such an optimization pass can also adjust life time. The opposite, of constructing a more abstract graph and then adding translation nodes where needed would have my vote if technically feasible.

wlav Apr 20, 2026
Maintainer

Aside, for the Numba implementation, I use a union type together with FFI. As long as results are returned on the stack, the implementation doesn't care what the interpretation is of the returned bits, it all works fine. It's only the templates and the graph construction that do and will refuse to execute if C++ types don't match.

knoepfel Apr 20, 2026
Maintainer

A concrete data product created by a C++ algorithm will have a "C++" representation; and a concrete data product created by a Python algorithm will have a "Python" representation. Any downstream node that wants to consume a concrete data product may require a different representation (i.e., thus requiring a translator). I suppose we could think of the concrete data-product representation at creation time as the "default" representation. But I don't believe we need to.
The lifetimes of data products are currently managed in the implementation via std::shared_ptr. The framework never explicitly calls reset on these pointers—the C++ STL simply destroys the managed objects once the use count becomes zero. I think the same thing can happen for translated data products, with the extra complication that some objects may need to be bound in their lifetimes (e.g., an underlying std::vector object needs to exist as long as the numpy array, which serves as the translated data product). But I think that just means we may need to have more kinds of "messages" that we pass around the data-flow graph. That's doable. To the extent possible, we should let other systems do the necessary caching (C++ STL, oneTBB flow-graph join nodes, etc.)

wlav Apr 20, 2026
Maintainer

To 1.: yes, ideally, we wouldn't need to think of a "default" representation, but the current template-based graph construction enforces it. By implication, it currently simplifies one part of the life time management by making the translator nodes transactional, which is what I described above. (I make explicit use of this to handle the Python ref-counting.)

To 2.: same thing here: because C++ is the "default" in the current implementation, if forces the use of shared_ptr inside the "Python" (or any other) representation. Hence the "lifeline" objects as a way of doing this without affecting the actual (user-visible) Python representation.

Yes, whether a "message" or something else, it's all about the translator node communicating to the framework what its needs are and lifetime being adjusted accordingly, e.g. by caching a shared_ptr for the duration.

brettviren · 2026-05-07T14:18:03Z

brettviren
May 7, 2026

From the point of view of integrating WCT, I see two main issues touching on "translators".

Wanted: eager dropping by Phlex of (some) data products.
Unwanted: combinatoric scaling of required translator types.

First is a Phlex problem. Second is mostly a DUNE problem but if Phlex can provide things to help it, that's certainly welcome.

Here's what I mean:

Eager dropping example is ADC waveforms produce either from DAQ file input or from WCT sim. ADC waveforms are a particular source of memory pressure while also having very specific consumers and so are particularly ripe for some "eager drop" feature. Once ADCs are consumed, having Phlex keep them alive is an unwanted memory burden. There are perhaps data model factoring issues here: ADC waveform data would have fewer consumers than the metadata the comes with the waveforms.

Combinatoric scaling will occur if we must pair every relevant type of producer with every relevant type of consumer for a given conceptual data type. In the ADC waveform example there are two producers: DAQ HFD5 files and WCT sim. Human nature being as it is, each will produce a unique data type (if we do not force otherwise). There is almost just one consumer which is WCT sigproc. However, special-purpose consumers will exist, eg trigger studies algs.

Assuming DAQ HDF5 files produce a data product called RAW and WCT sim produces a Frame here are some of the types of Phlex graphs I can imagine we must express:

  HDF file -> source -> RAW -> translator -> Frame -> WCT sigproc
                                  WCT sim -> Frame -> WCT sigproc
                                  WCT sim -> Frame -> trigger studies
                        RAW -> trigger studies
                        RAW -> translator -> Frame -> trigger studies
                        RAW -> translator -> TRIGINPUT -> trigger studies
                        RAW -> translator -> Frame -> translator -> TRIGINPUT -> trigger studies

The type of a translator is defined by the types of its input/output.

RAW -> translator -> Frame
RAW -> translator -> TRIGINPUT
Frame -> translator -> TRIGINPUT

The picture explodes as we add more types of consumers (or producers).

It is largely up to the mythical DUNE data model to simplify this complexity by defining a reduced set of "data products" in the way LArSoft did. At best, this is the netpbm/pandoc solution of decomposing M-to-N into M-to-1-to-N translators. Implementing these as well factored, loosely coupled packages will mean many packages and carefully drawn dependency lines.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Framework R&D

Data Product Translation Properties #535

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 6 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Framework R&D

Data Product Translation Properties #535

Uh oh!

Uh oh!

greenc-FNAL Apr 17, 2026 Maintainer

Introduction

Brief

Notes

2026-04-17

Discussion continuation 2026-04-24

Discussion continuation 2026-05-01

Replies: 2 comments · 6 replies

Uh oh!

marcpaterno Apr 20, 2026 Maintainer

Uh oh!

marcpaterno Apr 20, 2026 Maintainer

Uh oh!

wlav Apr 20, 2026 Maintainer

Uh oh!

wlav Apr 20, 2026 Maintainer

Uh oh!

knoepfel Apr 20, 2026 Maintainer

Uh oh!

wlav Apr 20, 2026 Maintainer

Uh oh!

brettviren May 7, 2026

greenc-FNAL
Apr 17, 2026
Maintainer

Replies: 2 comments 6 replies

marcpaterno
Apr 20, 2026
Maintainer

marcpaterno Apr 20, 2026
Maintainer

wlav Apr 20, 2026
Maintainer

wlav Apr 20, 2026
Maintainer

knoepfel Apr 20, 2026
Maintainer

wlav Apr 20, 2026
Maintainer

brettviren
May 7, 2026