Skip to content

Conversation

@notactuallyfinn
Copy link
Collaborator

This PR adds an adr that documents the current state of the concept for recording provenance information.
It also contains a diagram that visualizes the graphic of the information that will be recorded in the future according to the current state of discussion.
All contents of this PR are only up to discussion.
See also #363.

@notactuallyfinn notactuallyfinn added documentation Improvements or additions to documentation architecture Describes some architectural decisions that need to be made data model Related to the hermes data model labels Oct 17, 2025
@sdruskat sdruskat self-assigned this Oct 17, 2025
@sdruskat sdruskat self-requested a review October 17, 2025 10:53
@sdruskat sdruskat removed their assignment Oct 17, 2025
Copy link
Contributor

@zyzzyxdonta zyzzyxdonta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a first look at this and I'm quite surprised how detailed this is 😄👍🏻

Some thoughts:

I'm not quite on board with HermesCache being considered an agent. The definition of an agent is "something that bears some form of responsibility for an activity taking place, for the existence of an entity, or for another agent's activity.". Further, wasAssociatedWith is "an assignment of responsibility to an agent for an activity, indicating that the agent had a role in the activity". An example from the provided diagram (harvest step) would be: "codemeta.json was generated by a 'write' process which is associated with HermesCache. HermesCache has a responsibility for the 'write' process taking place. codemeta.json was attributed to HermesCache, thus HermesCache is responsible for the existence of codemeta.json". And that sounds untrue to me. Sure, without HermesCache, codemeta.json wouldn't exist. But that's due to the fact that codemeta.json is stored inside the HermesCache. I simply wouldn't consider HermesCache because it has no real role to me.

Apart from their prov types, how will Activities/Agents/Entities be modelled? E.g. the "load" activity with func, args, kwargs. Which vocabularies/ontologies will be used to describe these properties?

The diagram needs a legend. I figured out that agents are trapezoids, actions are rectangles with sharp corners, and entities are rectangles with rounded corners. But I don't know what solid vs dotted lines are. I also don't understand why everything related to the CFF plugin is blue/green. To signify that it works the same way as the other plugin? 😅 Maybe the diagram could be split into parts. One diagram per step?

@zyzzyxdonta
Copy link
Contributor

zyzzyxdonta commented Oct 24, 2025

Another idea for visualization: Maybe you could have data entities and activities running "in parallel" rather than alternating in a single line. Example without arrows:

--------------
|            |
|CITATION.cff|
|            |
--------------
                        --------------
                        |            |
                        |    load    |
                        |            |
                        --------------
--------------
|            |
|CITATION Py |
|            |
--------------
                        --------------
                        |            |
                        |    map     |
                        |            |
                        --------------
--------------
|            |
| software   |
| metadata   |
--------------
                        --------------
                        |            |
                        |   write    |
                        |            |
                        --------------

@notactuallyfinn
Copy link
Collaborator Author

Responding to this comment:

We/ I have considert HermesCache to be an Agent (more specifically a SoftwareAgent) because it bears the responsibility to store a SoftwareMetadata object and load the file again. I think that codemeta.json wasAttributedTo HermesCache because HermesCache stores the SoftwareMetadata object in codemeta.json and the only difference between those two Entities (codemeta.json and the SoftwareMetadata object) is that on is stored in a file and the other is not. And who else but the HermesCache could be associated with the storing/ loading?

All the provenance data will probably be stored as JSON-LD therefor I guess that all other properties could be modelled with the schema and codemeta vocab.

Good Idee to move the legend into the diagram. The legend is currently only in the Markdown file and does not include Activity, Agent and Entity. The grayed out part means possible but not necessary (like a second harvest plugin, you would only need one but it shows how another could be added). It would be possible to split the diagram but not nice I think as every plugin uses stuff from the one before and like it is it gives us a better overview of the hole graph.

@sdruskat sdruskat self-assigned this Nov 7, 2025
@sdruskat
Copy link
Contributor

sdruskat commented Nov 7, 2025

We/ I have considert HermesCache to be an Agent (more specifically a SoftwareAgent) because it bears the responsibility to store a SoftwareMetadata object and load the file again.

I agree. HermesCache here is the concrete instance of the class that has the write method, not the abstract idea of "we have a cache where stuff goes." Hence, I think this is a misunderstanding, as the statement "the fact that codemeta.json is stored inside the HermesCache" is false (HermesCache is the object at runtime, not the drive directory).

I think that codemeta.json wasAttributedTo HermesCache because HermesCache stores the SoftwareMetadata object in codemeta.json and the only difference between those two Entities (codemeta.json and the SoftwareMetadata object) is that on is stored in a file and the other is not. And who else but the HermesCache could be associated with the storing/ loading?

I also agree here.

All the provenance data will probably be stored as JSON-LD therefor I guess that all other properties could be modelled with the schema and codemeta vocab.

Yes, it will be stored as JSON-LD, using the PROV-JSONLD Serialization (schema | context). This should probably be in the ADR.

@sdruskat
Copy link
Contributor

sdruskat commented Nov 7, 2025

Another idea for visualization: Maybe you could have data entities and activities running "in parallel" rather than alternating in a single line.

This is an example of how I think the swimlanes could look like (horizontal is perhaps not ideal though).

swimlanes

It'll be a looooong diagram, but perhaps it's worth it? 🤷
For user docs, we should use only snippets of this I guess (around the size of the example).

Here's a drawio XML file.

@zyzzyxdonta
Copy link
Contributor

zyzzyxdonta commented Nov 7, 2025

HermesCache here is the concrete instance of the class that has the write method, not the abstract idea of "we have a cache where stuff goes." Hence, I think this is a misunderstanding, as the statement "the fact that codemeta.json is stored inside the HermesCache" is false (HermesCache is the object at runtime, not the drive directory).

Yes, this was not clear to me. There is currently no such thing as HermesCache so I didn't recognize it as a class/object name. (Though the PascalCase could have given it away.) I only considered hermes, the plugins, and people as the agents. Having a class an object as an agent is way more detailed than I would have done this 😄

@zyzzyxdonta
Copy link
Contributor

zyzzyxdonta commented Nov 7, 2025

All the provenance data will probably be stored as JSON-LD therefor I guess that all other properties could be modelled with the schema and codemeta vocab.

Sure. But neither models the concept of functions and parameters. The closest thing would be Action. Are you planning to use object for the function arguments?

(To model functions, args, kwargs, I mean)

@sdruskat
Copy link
Contributor

Sure. But neither models the concept of functions and parameters. The closest thing would be Action. Are you planning to use object for the function arguments?

(To model functions, args, kwargs, I mean)

That's a good point, but I don't think we're yet clear on the actual schema vocabularies to use. FWIW, we have the same issue elsewhere, so perhaps this needs to be solved as part of #393.

Copy link
Contributor

@sdruskat sdruskat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As an ADR draft I think this is good to be merged (iff the decision is removed for now).
We can postpone the actual decision to before we get to the implementation, which is when we'll have a complete implementation of the plugin steps and merge strategy steps and will be in a better position to find alternatives, complete the picture, prototype options and make a decision.

Would you agree, @zyzzyxdonta?


## Decision Outcome

Chosen option: "Provide HERMES API-methods that also document themselves", because comes out best.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Chosen option: "Provide HERMES API-methods that also document themselves", because comes out best.

As this is in the proposal stage, I think we need more feedback on the actual solution. What are alternatives for this solution? This ADR is already a very good track record of our thinking so far, but I think we need more buy-in before making the actual decision.

That said, this isn't a blocker for the PR, it just says: we need a decision about this further down the line.


* Good, because allows for recording of provenance information of the plugins
* Good, because it isn't making plugin development harder
* Bad, because API methods may not cover all I/O functionality python provides
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, but we can make a best effort to cover as many as make sense.
When we provide the respective API (load_file, make_request, etc.)
and make it usable enough so that developers don't see themselves forced to come up with their own solutions,
I think we can get good coverage.

Probably mostly a question of documentation, plugin templates, etc.

* Good, because allows for recording of provenance information of the plugins
* Good, because it isn't making plugin development harder
* Bad, because API methods may not cover all I/O functionality python provides
* Bad, because it doesn't cover merging, mapping, etc.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This specific solution doesn't, but can we build provenance into mapping via the model API (if so, track in an issue/sub-issue to the provenance issue)?
For merging, if this is about merging full models during the hermes process step, I think we can easily also build this into the treatment classes, right? (Again, should be tracked in a respective issue/the over-arching issue for how we plan to do process and postprocess.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed, I think this is a good enough basis for discussion right now. When someone finds some time, they can feel free to create a swimlane version to improve legibility.

@zyzzyxdonta
Copy link
Contributor

That's a good point, but I don't think we're yet clear on the actual schema vocabularies to use.

That's exactly why I said this is way more detailed than I would have done it 😄

@zyzzyxdonta
Copy link
Contributor

As an ADR draft I think this is good to be merged (iff the decision is removed for now). We can postpone the actual decision to before we get to the implementation, which is when we'll have a complete implementation of the plugin steps and merge strategy steps and will be in a better position to find alternatives, complete the picture, prototype options and make a decision.

Would you agree, @zyzzyxdonta?

Sure 👍🏻

notactuallyfinn and others added 2 commits November 21, 2025 13:48
Co-authored-by: Stephan Druskat <sdruskat@users.noreply.github.com>
@notactuallyfinn
Copy link
Collaborator Author

Could you add the the license information (@sdruskat or @zyzzyxdonta)? I don't know what licenses should be used for the two files.

@SKernchen SKernchen merged commit 3a92f42 into develop Dec 12, 2025
5 checks passed
@SKernchen SKernchen deleted the refactor/363-record-provenance-documentation branch December 12, 2025 09:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

architecture Describes some architectural decisions that need to be made data model Related to the hermes data model documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants