Skip to content

Query hangs when using async_exec #583

@jordan-thoms

Description

@jordan-thoms

I've been debugging an issue where we have intermittent hangs of our ruby processes when running queries on CockroachDB using the ruby-pg driver. This is with with ruby-3.3.4, libpq-dev 15.8-0+deb12u1, ruby-pg 1.5.7, and Linux 6.8.0-1012-gcp however I have been able to reproduce the issue on other versions also.

We are able to reproduce the issue ~80% of the time with a simple query like this:

conn = PG.connect(ENV['COCKROACH_DATABASE_URL'])
result = conn.exec(<THE QUERY>)

This will hang forever, tying up a ruby thread in our sidekiq workers until the worker is shutdown, even days later.

If I disable async_exec, the same query always works, i.e. this works 100% of the time:

PG::Connection.async_api = false
conn = PG.connect(ENV['COCKROACH_DATABASE_URL'])
result = conn.exec(<THE QUERY>)

The triggering of this is related to the content returned from the query - if I leave out certain columns, or remove the ORDER BY, then the issue doesn't show up.

Here's the gdb trace when the query is hung:

(gdb) where
#0  0x000070e98a319256 in __ppoll (fds=fds@entry=0x7fff46646eb8, nfds=nfds@entry=1, timeout=<optimized out>,
    sigmask=sigmask@entry=0x0) at ../sysdeps/unix/sysv/linux/ppoll.c:42
#1  0x000070e98a878390 in rb_thread_wait_for_single_fd (fd=6, events=1, timeout=timeout@entry=0x0)
    at /usr/src/ruby/hrtime.h:158
#2  0x000070e98a71dcd9 in rb_io_wait (timeout=4, events=3, io=<optimized out>)
    at ./include/ruby/internal/arithmetic/int.h:172
#3  rb_io_wait (io=<optimized out>, events=events@entry=3, timeout=timeout@entry=4) at io.c:1422
#4  0x000070e96ebab84a in wait_socket_readable (ptimeout=0x0, is_readable=<optimized out>, self=124147919560760)
    at pg_connection.c:2441
#5  pgconn_async_get_last_result (self=self@entry=124147919560760) at pg_connection.c:3139
#6  0x000070e96ebac939 in pgconn_async_exec (argc=1, argv=0x70e98a09c368, self=124147919560760) at pg_connection.c:3288
#7  0x000070e98a8ae8cc in vm_call_cfunc_with_frame_ (stack_bottom=<optimized out>, argv=<optimized out>, argc=1,
    calling=<optimized out>, reg_cfp=0x70e98a19bbb0, ec=0x62de88a5aac0) at /usr/src/ruby/vm_insnhelper.c:3502
#8  vm_call_cfunc_with_frame (ec=0x62de88a5aac0, reg_cfp=0x70e98a19bbb0, calling=<optimized out>)
    at /usr/src/ruby/vm_insnhelper.c:3530
#9  0x000070e98a8c0289 in vm_sendish (method_explorer=<optimized out>, block_handler=<optimized out>,
    cd=<optimized out>, reg_cfp=<optimized out>, ec=<optimized out>) at /usr/src/ruby/vm_callinfo.h:403
#10 vm_exec_core (ec=0x7fff46646eb8, ec@entry=0x62de88a5aac0) at /usr/src/ruby/insns.def:834
#11 0x000070e98a8c6149 in rb_vm_exec (ec=ec@entry=0x62de88a5aac0) at vm.c:2486
#12 0x000070e98a8c6e94 in eval_string_with_scope (line=9, file=<optimized out>, src=124147872936320,
    scope=124147909484040) at /usr/src/ruby/vm_eval.c:1761
#13 rb_f_eval (argc=<optimized out>, argv=<optimized out>, self=<optimized out>) at /usr/src/ruby/vm_eval.c:1804
#14 0x000070e98a8ae8cc in vm_call_cfunc_with_frame_ (stack_bottom=<optimized out>, argv=<optimized out>, argc=4,
    calling=<optimized out>, reg_cfp=0x70e98a19bc20, ec=0x62de88a5aac0) at /usr/src/ruby/vm_insnhelper.c:3502
#15 vm_call_cfunc_with_frame (ec=0x62de88a5aac0, reg_cfp=0x70e98a19bc20, calling=<optimized out>)
    at /usr/src/ruby/vm_insnhelper.c:3530
#16 0x000070e98a8c0289 in vm_sendish (method_explorer=<optimized out>, block_handler=<optimized out>,
    cd=<optimized out>, reg_cfp=<optimized out>, ec=<optimized out>) at /usr/src/ruby/vm_callinfo.h:403
#17 vm_exec_core (ec=0x7fff46646eb8, ec@entry=0x62de88a5aac0) at /usr/src/ruby/insns.def:834
#18 0x000070e98a8c6149 in rb_vm_exec (ec=ec@entry=0x62de88a5aac0) at vm.c:2486
#19 0x000070e98a8d7320 in invoke_block (captured=0x70e98a19bed8, opt_pc=<optimized out>, type=<optimized out>, cref=0x0

It seems the issue is that wait_socket_readable is getting called when there is nothing more to read from the socket, and as pgconn_async_get_last_result is passing NULL as the timeout value ( https://github.com/ged/ruby-pg/blob/master/ext/pg_connection.c#L3139 ), the process just waits forever. I compiled ruby-pg with this change to hardcode a wait_timeout, and afterwards I would see a hang of a few seconds when the issue was triggered, rather than an infinite one:

diff --git a/ext/pg_connection.c b/ext/pg_connection.c
index e4194a3..1d5e276 100644
--- a/ext/pg_connection.c
+++ b/ext/pg_connection.c
@@ -2438,7 +2438,7 @@ wait_socket_readable( VALUE self, struct timeval *ptimeout, void *(*is_readable)

                        socket_io = pgconn_socket_io(self);
                        /* Wait for the socket to become readable before checking again */
-                       ret = pg_rb_io_wait(socket_io, RB_INT2NUM(PG_RUBY_IO_READABLE), wait_timeout);
+                       ret = pg_rb_io_wait(socket_io, RB_INT2NUM(PG_RUBY_IO_READABLE), 5);
                } else {
                        ret = Qfalse;
                }

This issue seems related to #325 (comment) , and some investigation there suggested it was a result of the frame sizes.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions