Skip to content

Running getinfo with postgresql db is very slow #6702

@sandipndev

Description

@sandipndev

Background

We have set up LND with a Postgres database, and it has a dedicated CPU with about 4GB of RAM requested and available for itself. The database is populated and it's not like it's not responding but the responses are extremely slow. In our cluster, we also have lndmon running for the metrics exporting but clearly, it has way too many "restarts" on kubernetes because it is failing and timing out:

NAME                           READY   STATUS    RESTARTS   AGE
bitcoind-0                     2/2     Running   0          2d21h
lnd1-0                         3/3     Running   7          2d22h
lnd1-lndmon-78997ccdf4-7plz9   1/1     Running   746        2d21h

We had opened a channel to this LND instance and because of this erratic behaviour of not being able to respond properly all the time, the channel had gotten force-closed as well.

Logs from LND:

$ kubectl logs lnd1-0 -c lnd | grep ERR | tail
2022-07-04 06:03:34.203 [ERR] DISC: unable to validate channel update short_chan_id=759840600204771329: failed to connect to `host=10.36.0.12 user=lnd_user database=lnd_db`: server error (FATAL: remaining connection slots are reserved for non-replication superuser connections (SQLSTATE 53300))
2022-07-04 06:12:06.942 [ERR] DISC: unable to validate channel update short_chan_id=797497774017609729: failed to connect to `host=10.36.0.12 user=lnd_user database=lnd_db`: server error (FATAL: remaining connection slots are reserved for non-replication superuser connections (SQLSTATE 53300))
2022-07-04 06:12:40.412 [ERR] RPCS: [/routerrpc.Router/SubscribeHtlcEvents]: context canceled
2022-07-04 06:20:44.384 [ERR] DISC: unable to validate channel update short_chan_id=779182109372448769: failed to connect to `host=10.36.0.12 user=lnd_user database=lnd_db`: server error (FATAL: remaining connection slots are reserved for non-replication superuser connections (SQLSTATE 53300))
2022-07-04 06:22:35.041 [ERR] RPCS: [/routerrpc.Router/SubscribeHtlcEvents]: context canceled
2022-07-04 06:27:20.468 [ERR] RPCS: [/routerrpc.Router/SubscribeHtlcEvents]: context canceled
2022-07-04 06:47:02.180 [ERR] DISC: unable to validate channel update short_chan_id=763321653983117313: failed to connect to `host=10.36.0.12 user=lnd_user database=lnd_db`: server error (FATAL: remaining connection slots are reserved for non-replication superuser connections (SQLSTATE 53300))
2022-07-04 07:01:22.169 [ERR] DISC: unable to validate channel update short_chan_id=799154738011242497: failed to connect to `host=10.36.0.12 user=lnd_user database=lnd_db`: server error (FATAL: remaining connection slots are reserved for non-replication superuser connections (SQLSTATE 53300))
2022-07-04 07:07:03.159 [ERR] DISC: unable to validate channel update short_chan_id=814529209106300929: failed to connect to `host=10.36.0.12 user=lnd_user database=lnd_db`: server error (FATAL: remaining connection slots are reserved for non-replication superuser connections (SQLSTATE 53300))
2022-07-04 07:09:51.814 [ERR] DISC: unable to validate channel update short_chan_id=758370553249398784: failed to connect to `host=10.36.0.12 user=lnd_user database=lnd_db`: server error (FATAL: remaining connection slots are reserved for non-replication superuser connections (SQLSTATE 53300))

Logs from lndmon:

$ k logs lnd1-lndmon-78997ccdf4-7plz9
2022-07-04 06:28:32.725 [INF] LNDMON: Starting Prometheus exporter...
2022-07-04 06:28:32.726 [INF] HTLC: Starting Htlc Monitor
2022-07-04 06:28:32.727 [INF] LNDMON: Prometheus active!
rpc error: code = DeadlineExceeded desc = context deadline exceeded
2022/07/04 07:01:18 http: superfluous response.WriteHeader call from github.com/lightninglabs/lndmon/health.(*Checker).ServeHTTP (health.go:76)
rpc error: code = DeadlineExceeded desc = context deadline exceeded
2022/07/04 07:01:18 http: superfluous response.WriteHeader call from github.com/lightninglabs/lndmon/health.(*Checker).ServeHTTP (health.go:76)
rpc error: code = DeadlineExceeded desc = context deadline exceeded
2022/07/04 07:09:48 http: superfluous response.WriteHeader call from github.com/lightninglabs/lndmon/health.(*Checker).ServeHTTP (health.go:76)
rpc error: code = DeadlineExceeded desc = context deadline exceeded
2022/07/04 07:09:48 http: superfluous response.WriteHeader call from github.com/lightninglabs/lndmon/health.(*Checker).ServeHTTP (health.go:76)
rpc error: code = DeadlineExceeded desc = context deadline exceeded
2022/07/04 07:25:18 http: superfluous response.WriteHeader call from github.com/lightninglabs/lndmon/health.(*Checker).ServeHTTP (health.go:76)
rpc error: code = DeadlineExceeded desc = context deadline exceeded
2022/07/04 07:25:18 http: superfluous response.WriteHeader call from github.com/lightninglabs/lndmon/health.(*Checker).ServeHTTP (health.go:76)

Environment

  • Version: 0.15.0-beta commit=v0.15.0-beta
  • Using Docker Image: lightninglabs/lnd
  • Postgres running on GCP CloudSQL, which has maximum 25 concurrent connections, so in lnd config, it's setup like:
db.backend=postgres
db.postgres.dsn=<sensitive>
db.postgres.timeout=0
db.postgres.maxconnections=21

Expected behaviour

LND should at least be responding all the time.

Actual behaviour

There is some caching involved in the RPC I believe which is why after one getinfo call, there is about 5m time when requests don't time out and after that, the cycle repeats.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2should be fixed if one has timedatabaseRelated to the database/storage of LNDperformancepostgres

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions