routing/localchans: fix nested db tx#5643
Conversation
routing/localchans/manager.go
Outdated
There was a problem hiding this comment.
I think edges should be initialized on first access. We can't be sure if ForAllOutgoingChannels is ever retried. Even though all of this is going to be ram cached soon.
There was a problem hiding this comment.
Hmm yes interesting. But it can't be fixed here I think? It seems that the deeper graph method needs a reset function passed in as well. In fact every graph method that accepts a callback needs a reset function?
There may be non-graph db methods too with a callback. We are accidentally stumbling into this, but perhaps there is more unsafe usage of external variables. The pattern to store items in an external list is quite common.
I do want to re-emphasize my earlier comments btcsuite/btcwallet#757 (comment) and #5484 (comment). I still think that switching to the serializable locking model is a risk that doesn't need to be taken at this time.
There was a problem hiding this comment.
I'd just add a flag and check it? If false => create the array, set flag true.
Edit: yeah, that won't help, need a closure, it's pretty simple to add.
There was a problem hiding this comment.
There may be non-graph db methods too with a callback. We are accidentally stumbling into this, but perhaps there is more unsafe usage of external variables. The pattern to store items in an external list is quite common.
Also re: accidentally stumbling: I think during the fully remote PR we discussed that we need to make sure all DB transactions are in their closed forms and retries are handled. It may have been forgotten somehow as there was too many other threads of dicussions. Maybe worth merging this PR with #5642 and doing these fixes there.
There was a problem hiding this comment.
@guggero what do you think? I am not sure if it is a good idea to leak details of the database locking model into the graph interface.
Is there an issue in the repo to check whether all transactions are in their closed forms? I had an idea that could help with that: modify the code to artificially always run the code inside each transaction twice. It is then as if everything needs a retry. Itests, bench, etc, should still work.
There was a problem hiding this comment.
Add a tx kvdb.RTx parameter to the FetchChannel method of the channel DB.
That is a great suggestion. The whole thing will be retried then with a fresh slice.
What I don't understand is: Why did this never lead to a deadlock with bbolt?
Good question. I will look into it.
There was a problem hiding this comment.
That the deadlock doesn't occur with bbolt almost makes me think that somehow two different locks are being used. As if there are two instances of something.
This doesn't seem to be the case.
There was a problem hiding this comment.
I am now looking at indeed the type of lock that postgres uses. Maybe you are right @guggero that it isn't exactly the same.
There was a problem hiding this comment.
Explanation why getting an RLock twice can lead to a deadlock: https://github.com/sasha-s/go-deadlock
There was a problem hiding this comment.
The same problem exists for bbolt. It just doesn't show up in the itest because timing is different. For a recursive read lock to go wrong, there needs to be a write that comes in between. By adding a sleep(5) inside the ForAllOutgoingChannels callback, the issue reproduces with bbolt on master.
|
Nice fix, I remember seeing this bug once before during the initial |
358befd to
241c32f
Compare
46de99b to
ffa7b04
Compare
bb4008c to
a5446f0
Compare
bhandras
left a comment
There was a problem hiding this comment.
LGTM, nice to have this fixed! 👍
One minor thing I'd like to suggest is to squash the two commits and separate the release note update as it's a bit easier to follow the changes that way.
I just separated them this morning :) I thought it would be easier for you to read if the non-functional refactor is on its own. |
|
Needs rebase. |
Adds an optional tx parameter to ForAllOutgoingChannels and FetchChannel so that data can be queried within the context of an existing database transaction.
This commit fixes a bug where a db tx is opened within another db tx.
a5446f0 to
bd07b5f
Compare
|
Rebased |
|
Going to merge this as both itest failures are known flakes. |
This PR fixes a bug where a db tx is opened within another db tx.