Don't fail markAsComplete if there is already a failure relationship #282
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What does this change?
Giant is getting into an infinite loop when it tries to markAsComplete a blob in this situation:
We have hundreds of errors like this in the logs:
Failed to mark 'URI' processed by 'OcrMyPdfImageExtractor' as complete: Unexpected number of creates/deletes in markAsComplete. Created: 0. Deleted: 1
In this scenario, when markAsComplete returns a
Leftthen the whole transaction gets rolled back, the TODO remains, and giant tries to run the extractor again. sad times.This PR gets giant to just WARN instead of rolling back the whole transaction.
Leaving this in draft as some team discussion and further digging to understand how we got into this state required I think.
How I think this happened
When we ingest a file, the InsertBlob operation in the neo4j manifest gets called. This will create or update the blob for the file. If that file already exists, it checks for existing PROCESSED relations. If there is a PROCESSED relation with the same workspace properties and ingest then no new TODO is created, otherwise it creates a new TODO
giant/backend/app/services/manifest/Neo4jManifest.scala
Lines 353 to 370 in be75f73
So, if you upload a hundred files via the workspace UI, you'll end up with a hundred new blobs with TODO relations to the relevant extractors. If you then kick off a cli ingest (which invariably will have a different ingestion uri, and obviously no workspace properties), then those blobs will all get a second TODO relation to the relevant extractors.
One of the TODOs will get picked up first - let's call it todo1. When it completes, this function gets called. As I understand it, it will delete todo1, and add a PROCESSED relation tagged with various things.
Then, the next - todo2 will get picked up and completed. When it hits the above function, it will find todo2, delete it, but then rather than creating a PROCESSED relation it will just change the properties of the existing PROCESSED relation. Then the check
if(counters.relationshipsCreated() != 1 || counters.relationshipsDeleted() != 1)gets run, fails and the whole transaction gets rolled back. This leaves a hanging TODO, which will be picked up again in future by fetchWork - so long as theattemptsproperty of the relation is not above max attempts. fetchWork is responsible for updating .attempts, but in the example I looked at, attempts was still at 0 on the TODO.TODO - more research to understand this
How to test