-
Notifications
You must be signed in to change notification settings - Fork 7
Document standardized provenance recording #442
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Document standardized provenance recording #442
Conversation
zyzzyxdonta
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a first look at this and I'm quite surprised how detailed this is 😄👍🏻
Some thoughts:
I'm not quite on board with HermesCache being considered an agent. The definition of an agent is "something that bears some form of responsibility for an activity taking place, for the existence of an entity, or for another agent's activity.". Further, wasAssociatedWith is "an assignment of responsibility to an agent for an activity, indicating that the agent had a role in the activity". An example from the provided diagram (harvest step) would be: "codemeta.json was generated by a 'write' process which is associated with HermesCache. HermesCache has a responsibility for the 'write' process taking place. codemeta.json was attributed to HermesCache, thus HermesCache is responsible for the existence of codemeta.json". And that sounds untrue to me. Sure, without HermesCache, codemeta.json wouldn't exist. But that's due to the fact that codemeta.json is stored inside the HermesCache. I simply wouldn't consider HermesCache because it has no real role to me.
Apart from their prov types, how will Activities/Agents/Entities be modelled? E.g. the "load" activity with func, args, kwargs. Which vocabularies/ontologies will be used to describe these properties?
The diagram needs a legend. I figured out that agents are trapezoids, actions are rectangles with sharp corners, and entities are rectangles with rounded corners. But I don't know what solid vs dotted lines are. I also don't understand why everything related to the CFF plugin is blue/green. To signify that it works the same way as the other plugin? 😅 Maybe the diagram could be split into parts. One diagram per step?
|
Another idea for visualization: Maybe you could have data entities and activities running "in parallel" rather than alternating in a single line. Example without arrows: |
|
Responding to this comment: We/ I have considert All the provenance data will probably be stored as JSON-LD therefor I guess that all other properties could be modelled with the schema and codemeta vocab. Good Idee to move the legend into the diagram. The legend is currently only in the Markdown file and does not include Activity, Agent and Entity. The grayed out part means possible but not necessary (like a second harvest plugin, you would only need one but it shows how another could be added). It would be possible to split the diagram but not nice I think as every plugin uses stuff from the one before and like it is it gives us a better overview of the hole graph. |
I agree.
I also agree here.
Yes, it will be stored as JSON-LD, using the PROV-JSONLD Serialization (schema | context). This should probably be in the ADR. |
This is an example of how I think the swimlanes could look like (horizontal is perhaps not ideal though).
It'll be a looooong diagram, but perhaps it's worth it? 🤷 Here's a drawio XML file. |
Yes, this was not clear to me. There is currently no such thing as |
Sure. But neither models the concept of functions and parameters. The closest thing would be (To model functions, args, kwargs, I mean) |
That's a good point, but I don't think we're yet clear on the actual schema vocabularies to use. FWIW, we have the same issue elsewhere, so perhaps this needs to be solved as part of #393. |
sdruskat
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As an ADR draft I think this is good to be merged (iff the decision is removed for now).
We can postpone the actual decision to before we get to the implementation, which is when we'll have a complete implementation of the plugin steps and merge strategy steps and will be in a better position to find alternatives, complete the picture, prototype options and make a decision.
Would you agree, @zyzzyxdonta?
|
|
||
| ## Decision Outcome | ||
|
|
||
| Chosen option: "Provide HERMES API-methods that also document themselves", because comes out best. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Chosen option: "Provide HERMES API-methods that also document themselves", because comes out best. |
As this is in the proposal stage, I think we need more feedback on the actual solution. What are alternatives for this solution? This ADR is already a very good track record of our thinking so far, but I think we need more buy-in before making the actual decision.
That said, this isn't a blocker for the PR, it just says: we need a decision about this further down the line.
|
|
||
| * Good, because allows for recording of provenance information of the plugins | ||
| * Good, because it isn't making plugin development harder | ||
| * Bad, because API methods may not cover all I/O functionality python provides |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, but we can make a best effort to cover as many as make sense.
When we provide the respective API (load_file, make_request, etc.)
and make it usable enough so that developers don't see themselves forced to come up with their own solutions,
I think we can get good coverage.
Probably mostly a question of documentation, plugin templates, etc.
| * Good, because allows for recording of provenance information of the plugins | ||
| * Good, because it isn't making plugin development harder | ||
| * Bad, because API methods may not cover all I/O functionality python provides | ||
| * Bad, because it doesn't cover merging, mapping, etc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This specific solution doesn't, but can we build provenance into mapping via the model API (if so, track in an issue/sub-issue to the provenance issue)?
For merging, if this is about merging full models during the hermes process step, I think we can easily also build this into the treatment classes, right? (Again, should be tracked in a respective issue/the over-arching issue for how we plan to do process and postprocess.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed, I think this is a good enough basis for discussion right now. When someone finds some time, they can feel free to create a swimlane version to improve legibility.
That's exactly why I said this is way more detailed than I would have done it 😄 |
Sure 👍🏻 |
Co-authored-by: Stephan Druskat <sdruskat@users.noreply.github.com>
|
Could you add the the license information (@sdruskat or @zyzzyxdonta)? I don't know what licenses should be used for the two files. |

This PR adds an adr that documents the current state of the concept for recording provenance information.
It also contains a diagram that visualizes the graphic of the information that will be recorded in the future according to the current state of discussion.
All contents of this PR are only up to discussion.
See also #363.