Skip to content

routing/localchans: fix nested db tx#5643

Merged
guggero merged 2 commits intolightningnetwork:masterfrom
bottlepay:updatepolicy-nested-tx
Aug 24, 2021
Merged

routing/localchans: fix nested db tx#5643
guggero merged 2 commits intolightningnetwork:masterfrom
bottlepay:updatepolicy-nested-tx

Conversation

@joostjager
Copy link
Contributor

This PR fixes a bug where a db tx is opened within another db tx.

@joostjager joostjager mentioned this pull request Aug 19, 2021
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think edges should be initialized on first access. We can't be sure if ForAllOutgoingChannels is ever retried. Even though all of this is going to be ram cached soon.

Copy link
Contributor Author

@joostjager joostjager Aug 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm yes interesting. But it can't be fixed here I think? It seems that the deeper graph method needs a reset function passed in as well. In fact every graph method that accepts a callback needs a reset function?

There may be non-graph db methods too with a callback. We are accidentally stumbling into this, but perhaps there is more unsafe usage of external variables. The pattern to store items in an external list is quite common.

I do want to re-emphasize my earlier comments btcsuite/btcwallet#757 (comment) and #5484 (comment). I still think that switching to the serializable locking model is a risk that doesn't need to be taken at this time.

Copy link
Collaborator

@bhandras bhandras Aug 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd just add a flag and check it? If false => create the array, set flag true.

Edit: yeah, that won't help, need a closure, it's pretty simple to add.

Copy link
Collaborator

@bhandras bhandras Aug 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There may be non-graph db methods too with a callback. We are accidentally stumbling into this, but perhaps there is more unsafe usage of external variables. The pattern to store items in an external list is quite common.

Also re: accidentally stumbling: I think during the fully remote PR we discussed that we need to make sure all DB transactions are in their closed forms and retries are handled. It may have been forgotten somehow as there was too many other threads of dicussions. Maybe worth merging this PR with #5642 and doing these fixes there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@guggero what do you think? I am not sure if it is a good idea to leak details of the database locking model into the graph interface.

Is there an issue in the repo to check whether all transactions are in their closed forms? I had an idea that could help with that: modify the code to artificially always run the code inside each transaction twice. It is then as if everything needs a retry. Itests, bench, etc, should still work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a tx kvdb.RTx parameter to the FetchChannel method of the channel DB.

That is a great suggestion. The whole thing will be retried then with a fresh slice.

What I don't understand is: Why did this never lead to a deadlock with bbolt?

Good question. I will look into it.

Copy link
Contributor Author

@joostjager joostjager Aug 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That the deadlock doesn't occur with bbolt almost makes me think that somehow two different locks are being used. As if there are two instances of something.

This doesn't seem to be the case.

Copy link
Contributor Author

@joostjager joostjager Aug 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am now looking at indeed the type of lock that postgres uses. Maybe you are right @guggero that it isn't exactly the same.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explanation why getting an RLock twice can lead to a deadlock: https://github.com/sasha-s/go-deadlock

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same problem exists for bbolt. It just doesn't show up in the itest because timing is different. For a recursive read lock to go wrong, there needs to be a write that comes in between. By adding a sleep(5) inside the ForAllOutgoingChannels callback, the issue reproduces with bbolt on master.

@bhandras
Copy link
Collaborator

bhandras commented Aug 19, 2021

Nice fix, I remember seeing this bug once before during the initial etcd port but totally forgot about it since we were using mixed local/remote mode.

@joostjager joostjager force-pushed the updatepolicy-nested-tx branch from 358befd to 241c32f Compare August 19, 2021 16:26
@joostjager joostjager mentioned this pull request Aug 20, 2021
3 tasks
@joostjager joostjager force-pushed the updatepolicy-nested-tx branch 3 times, most recently from 46de99b to ffa7b04 Compare August 23, 2021 10:19
Copy link
Collaborator

@guggero guggero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice investigation and fix! We'll need to think of a way to detect and fix other such cases, going to track this in #5557.

This particular case looks good to me 💯

@joostjager joostjager force-pushed the updatepolicy-nested-tx branch 3 times, most recently from bb4008c to a5446f0 Compare August 23, 2021 11:43
Copy link
Collaborator

@bhandras bhandras left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, nice to have this fixed! 👍

One minor thing I'd like to suggest is to squash the two commits and separate the release note update as it's a bit easier to follow the changes that way.

@joostjager
Copy link
Contributor Author

joostjager commented Aug 23, 2021

squash the two commits

I just separated them this morning :)

I thought it would be easier for you to read if the non-functional refactor is on its own.

@bhandras
Copy link
Collaborator

bhandras commented Aug 24, 2021

Needs rebase.

Adds an optional tx parameter to ForAllOutgoingChannels and FetchChannel
so that data can be queried within the context of an existing database
transaction.
This commit fixes a bug where a db tx is opened within another db tx.
@joostjager joostjager force-pushed the updatepolicy-nested-tx branch from a5446f0 to bd07b5f Compare August 24, 2021 11:43
@joostjager
Copy link
Contributor Author

Rebased

@guggero
Copy link
Collaborator

guggero commented Aug 24, 2021

Going to merge this as both itest failures are known flakes.

@guggero guggero merged commit 93d12cd into lightningnetwork:master Aug 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants