Document standardized provenance recording #442

notactuallyfinn · 2025-10-17T08:14:00Z

This PR adds an adr that documents the current state of the concept for recording provenance information.
It also contains a diagram that visualizes the graphic of the information that will be recorded in the future according to the current state of discussion.
All contents of this PR are only up to discussion.
See also #363.

zyzzyxdonta

I had a first look at this and I'm quite surprised how detailed this is 😄👍🏻

Some thoughts:

I'm not quite on board with HermesCache being considered an agent. The definition of an agent is "something that bears some form of responsibility for an activity taking place, for the existence of an entity, or for another agent's activity.". Further, wasAssociatedWith is "an assignment of responsibility to an agent for an activity, indicating that the agent had a role in the activity". An example from the provided diagram (harvest step) would be: "codemeta.json was generated by a 'write' process which is associated with HermesCache. HermesCache has a responsibility for the 'write' process taking place. codemeta.json was attributed to HermesCache, thus HermesCache is responsible for the existence of codemeta.json". And that sounds untrue to me. Sure, without HermesCache, codemeta.json wouldn't exist. But that's due to the fact that codemeta.json is stored inside the HermesCache. I simply wouldn't consider HermesCache because it has no real role to me.

Apart from their prov types, how will Activities/Agents/Entities be modelled? E.g. the "load" activity with func, args, kwargs. Which vocabularies/ontologies will be used to describe these properties?

The diagram needs a legend. I figured out that agents are trapezoids, actions are rectangles with sharp corners, and entities are rectangles with rounded corners. But I don't know what solid vs dotted lines are. I also don't understand why everything related to the CFF plugin is blue/green. To signify that it works the same way as the other plugin? 😅 Maybe the diagram could be split into parts. One diagram per step?

docs/adr/0014-standardized-provenance-recording.md

zyzzyxdonta · 2025-10-24T12:04:48Z

Another idea for visualization: Maybe you could have data entities and activities running "in parallel" rather than alternating in a single line. Example without arrows:

--------------
|            |
|CITATION.cff|
|            |
--------------
                        --------------
                        |            |
                        |    load    |
                        |            |
                        --------------
--------------
|            |
|CITATION Py |
|            |
--------------
                        --------------
                        |            |
                        |    map     |
                        |            |
                        --------------
--------------
|            |
| software   |
| metadata   |
--------------
                        --------------
                        |            |
                        |   write    |
                        |            |
                        --------------

notactuallyfinn · 2025-10-27T09:05:48Z

Responding to this comment:

We/ I have considert HermesCache to be an Agent (more specifically a SoftwareAgent) because it bears the responsibility to store a SoftwareMetadata object and load the file again. I think that codemeta.json wasAttributedTo HermesCache because HermesCache stores the SoftwareMetadata object in codemeta.json and the only difference between those two Entities (codemeta.json and the SoftwareMetadata object) is that on is stored in a file and the other is not. And who else but the HermesCache could be associated with the storing/ loading?

All the provenance data will probably be stored as JSON-LD therefor I guess that all other properties could be modelled with the schema and codemeta vocab.

Good Idee to move the legend into the diagram. The legend is currently only in the Markdown file and does not include Activity, Agent and Entity. The grayed out part means possible but not necessary (like a second harvest plugin, you would only need one but it shows how another could be added). It would be possible to split the diagram but not nice I think as every plugin uses stuff from the one before and like it is it gives us a better overview of the hole graph.

…ties

sdruskat · 2025-11-07T13:44:16Z

We/ I have considert HermesCache to be an Agent (more specifically a SoftwareAgent) because it bears the responsibility to store a SoftwareMetadata object and load the file again.

I agree. HermesCache here is the concrete instance of the class that has the write method, not the abstract idea of "we have a cache where stuff goes." Hence, I think this is a misunderstanding, as the statement "the fact that codemeta.json is stored inside the HermesCache" is false (HermesCache is the object at runtime, not the drive directory).

I think that codemeta.json wasAttributedTo HermesCache because HermesCache stores the SoftwareMetadata object in codemeta.json and the only difference between those two Entities (codemeta.json and the SoftwareMetadata object) is that on is stored in a file and the other is not. And who else but the HermesCache could be associated with the storing/ loading?

I also agree here.

All the provenance data will probably be stored as JSON-LD therefor I guess that all other properties could be modelled with the schema and codemeta vocab.

Yes, it will be stored as JSON-LD, using the PROV-JSONLD Serialization (schema | context). This should probably be in the ADR.

sdruskat · 2025-11-07T13:47:24Z

Another idea for visualization: Maybe you could have data entities and activities running "in parallel" rather than alternating in a single line.

This is an example of how I think the swimlanes could look like (horizontal is perhaps not ideal though).

It'll be a looooong diagram, but perhaps it's worth it? 🤷
For user docs, we should use only snippets of this I guess (around the size of the example).

Here's a drawio XML file.

zyzzyxdonta · 2025-11-07T13:56:32Z

HermesCache here is the concrete instance of the class that has the write method, not the abstract idea of "we have a cache where stuff goes." Hence, I think this is a misunderstanding, as the statement "the fact that codemeta.json is stored inside the HermesCache" is false (HermesCache is the object at runtime, not the drive directory).

Yes, this was not clear to me. There is currently no such thing as HermesCache so I didn't recognize it as a class/object name. (Though the PascalCase could have given it away.) I only considered hermes, the plugins, and people as the agents. Having ~~a class~~ an object as an agent is way more detailed than I would have done this 😄

zyzzyxdonta · 2025-11-07T14:04:33Z

All the provenance data will probably be stored as JSON-LD therefor I guess that all other properties could be modelled with the schema and codemeta vocab.

Sure. But neither models the concept of functions and parameters. The closest thing would be Action. Are you planning to use object for the function arguments?

(To model functions, args, kwargs, I mean)

sdruskat · 2025-11-21T08:26:09Z

Sure. But neither models the concept of functions and parameters. The closest thing would be Action. Are you planning to use object for the function arguments?

(To model functions, args, kwargs, I mean)

That's a good point, but I don't think we're yet clear on the actual schema vocabularies to use. FWIW, we have the same issue elsewhere, so perhaps this needs to be solved as part of #393.

sdruskat

As an ADR draft I think this is good to be merged (iff the decision is removed for now).
We can postpone the actual decision to before we get to the implementation, which is when we'll have a complete implementation of the plugin steps and merge strategy steps and will be in a better position to find alternatives, complete the picture, prototype options and make a decision.

Would you agree, @zyzzyxdonta?

docs/adr/0014-standardized-provenance-recording.md

sdruskat · 2025-11-21T08:38:17Z

docs/adr/0014-standardized-provenance-recording.md

+
+## Decision Outcome
+
+Chosen option: "Provide HERMES API-methods that also document themselves", because comes out best.


Suggested change

Chosen option: "Provide HERMES API-methods that also document themselves", because comes out best.

As this is in the proposal stage, I think we need more feedback on the actual solution. What are alternatives for this solution? This ADR is already a very good track record of our thinking so far, but I think we need more buy-in before making the actual decision.

That said, this isn't a blocker for the PR, it just says: we need a decision about this further down the line.

sdruskat · 2025-11-21T08:41:42Z

docs/adr/0014-standardized-provenance-recording.md

+
+* Good, because allows for recording of provenance information of the plugins
+* Good, because it isn't making plugin development harder
+* Bad, because API methods may not cover all I/O functionality python provides


True, but we can make a best effort to cover as many as make sense.
When we provide the respective API (load_file, make_request, etc.)
and make it usable enough so that developers don't see themselves forced to come up with their own solutions,
I think we can get good coverage.

Probably mostly a question of documentation, plugin templates, etc.

sdruskat · 2025-11-21T08:45:17Z

docs/adr/0014-standardized-provenance-recording.md

+* Good, because allows for recording of provenance information of the plugins
+* Good, because it isn't making plugin development harder
+* Bad, because API methods may not cover all I/O functionality python provides
+* Bad, because it doesn't cover merging, mapping, etc.  


This specific solution doesn't, but can we build provenance into mapping via the model API (if so, track in an issue/sub-issue to the provenance issue)?
For merging, if this is about merging full models during the hermes process step, I think we can easily also build this into the treatment classes, right? (Again, should be tracked in a respective issue/the over-arching issue for how we plan to do process and postprocess.)

docs/adr/0014-standardized-provenance-recording.md

sdruskat · 2025-11-21T08:49:58Z

docs/adr/hermes-prov-diagram/hermes-prov.svg

As discussed, I think this is a good enough basis for discussion right now. When someone finds some time, they can feel free to create a swimlane version to improve legibility.

zyzzyxdonta · 2025-11-21T10:50:50Z

That's a good point, but I don't think we're yet clear on the actual schema vocabularies to use.

That's exactly why I said this is way more detailed than I would have done it 😄

zyzzyxdonta · 2025-11-21T10:51:27Z

As an ADR draft I think this is good to be merged (iff the decision is removed for now). We can postpone the actual decision to before we get to the implementation, which is when we'll have a complete implementation of the plugin steps and merge strategy steps and will be in a better position to find alternatives, complete the picture, prototype options and make a decision.

Would you agree, @zyzzyxdonta?

Sure 👍🏻

Co-authored-by: Stephan Druskat <sdruskat@users.noreply.github.com>

notactuallyfinn · 2025-11-21T12:59:46Z

Could you add the the license information (@sdruskat or @zyzzyxdonta)? I don't know what licenses should be used for the two files.

added adr record and graphic

11af43e

notactuallyfinn added documentation Improvements or additions to documentation architecture Describes some architectural decisions that need to be made data model Related to the hermes data model labels Oct 17, 2025

notactuallyfinn added 2 commits October 17, 2025 10:15

inserted PR link

5ee10eb

changed name of header

0777ec6

notactuallyfinn mentioned this pull request Oct 17, 2025

Unite Harvest- and Post-Process-Plugins #443

Open

sdruskat self-assigned this Oct 17, 2025

sdruskat self-requested a review October 17, 2025 10:53

sdruskat removed their assignment Oct 17, 2025

zyzzyxdonta reviewed Oct 24, 2025

View reviewed changes

docs/adr/0014-standardized-provenance-recording.md Outdated Show resolved Hide resolved

notactuallyfinn added 2 commits October 31, 2025 12:08

updated the diagram

04a273e

removed explicit line breaks and added information on non-prov proper…

f63ac59

…ties

sdruskat self-assigned this Nov 7, 2025

sdruskat approved these changes Nov 21, 2025

View reviewed changes

notactuallyfinn and others added 2 commits November 21, 2025 13:48

Added sdruskats suggestions

0c3f85f

Co-authored-by: Stephan Druskat <sdruskat@users.noreply.github.com>

removed the chosen option part

c963add

Add licenses for graphics

0253a3d

SKernchen merged commit 3a92f42 into develop Dec 12, 2025
5 checks passed

SKernchen deleted the refactor/363-record-provenance-documentation branch December 12, 2025 09:23


		## Decision Outcome

		Chosen option: "Provide HERMES API-methods that also document themselves", because comes out best.

Document standardized provenance recording #442

Document standardized provenance recording #442

Uh oh!

Conversation

notactuallyfinn commented Oct 17, 2025

Uh oh!

zyzzyxdonta left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

zyzzyxdonta commented Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

notactuallyfinn commented Oct 27, 2025

Uh oh!

sdruskat commented Nov 7, 2025

Uh oh!

sdruskat commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zyzzyxdonta commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zyzzyxdonta commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sdruskat commented Nov 21, 2025

Uh oh!

sdruskat left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sdruskat Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

sdruskat Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

sdruskat Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sdruskat Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

zyzzyxdonta commented Nov 21, 2025

Uh oh!

zyzzyxdonta commented Nov 21, 2025

Uh oh!

notactuallyfinn commented Nov 21, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

zyzzyxdonta commented Oct 24, 2025 •

edited

Loading

sdruskat commented Nov 7, 2025 •

edited

Loading

zyzzyxdonta commented Nov 7, 2025 •

edited

Loading

zyzzyxdonta commented Nov 7, 2025 •

edited

Loading