-
Notifications
You must be signed in to change notification settings - Fork 190
Description
I've been debugging an issue where we have intermittent hangs of our ruby processes when running queries on CockroachDB using the ruby-pg driver. This is with with ruby-3.3.4, libpq-dev 15.8-0+deb12u1, ruby-pg 1.5.7, and Linux 6.8.0-1012-gcp however I have been able to reproduce the issue on other versions also.
We are able to reproduce the issue ~80% of the time with a simple query like this:
conn = PG.connect(ENV['COCKROACH_DATABASE_URL'])
result = conn.exec(<THE QUERY>)
This will hang forever, tying up a ruby thread in our sidekiq workers until the worker is shutdown, even days later.
If I disable async_exec, the same query always works, i.e. this works 100% of the time:
PG::Connection.async_api = false
conn = PG.connect(ENV['COCKROACH_DATABASE_URL'])
result = conn.exec(<THE QUERY>)
The triggering of this is related to the content returned from the query - if I leave out certain columns, or remove the ORDER BY, then the issue doesn't show up.
Here's the gdb trace when the query is hung:
(gdb) where
#0 0x000070e98a319256 in __ppoll (fds=fds@entry=0x7fff46646eb8, nfds=nfds@entry=1, timeout=<optimized out>,
sigmask=sigmask@entry=0x0) at ../sysdeps/unix/sysv/linux/ppoll.c:42
#1 0x000070e98a878390 in rb_thread_wait_for_single_fd (fd=6, events=1, timeout=timeout@entry=0x0)
at /usr/src/ruby/hrtime.h:158
#2 0x000070e98a71dcd9 in rb_io_wait (timeout=4, events=3, io=<optimized out>)
at ./include/ruby/internal/arithmetic/int.h:172
#3 rb_io_wait (io=<optimized out>, events=events@entry=3, timeout=timeout@entry=4) at io.c:1422
#4 0x000070e96ebab84a in wait_socket_readable (ptimeout=0x0, is_readable=<optimized out>, self=124147919560760)
at pg_connection.c:2441
#5 pgconn_async_get_last_result (self=self@entry=124147919560760) at pg_connection.c:3139
#6 0x000070e96ebac939 in pgconn_async_exec (argc=1, argv=0x70e98a09c368, self=124147919560760) at pg_connection.c:3288
#7 0x000070e98a8ae8cc in vm_call_cfunc_with_frame_ (stack_bottom=<optimized out>, argv=<optimized out>, argc=1,
calling=<optimized out>, reg_cfp=0x70e98a19bbb0, ec=0x62de88a5aac0) at /usr/src/ruby/vm_insnhelper.c:3502
#8 vm_call_cfunc_with_frame (ec=0x62de88a5aac0, reg_cfp=0x70e98a19bbb0, calling=<optimized out>)
at /usr/src/ruby/vm_insnhelper.c:3530
#9 0x000070e98a8c0289 in vm_sendish (method_explorer=<optimized out>, block_handler=<optimized out>,
cd=<optimized out>, reg_cfp=<optimized out>, ec=<optimized out>) at /usr/src/ruby/vm_callinfo.h:403
#10 vm_exec_core (ec=0x7fff46646eb8, ec@entry=0x62de88a5aac0) at /usr/src/ruby/insns.def:834
#11 0x000070e98a8c6149 in rb_vm_exec (ec=ec@entry=0x62de88a5aac0) at vm.c:2486
#12 0x000070e98a8c6e94 in eval_string_with_scope (line=9, file=<optimized out>, src=124147872936320,
scope=124147909484040) at /usr/src/ruby/vm_eval.c:1761
#13 rb_f_eval (argc=<optimized out>, argv=<optimized out>, self=<optimized out>) at /usr/src/ruby/vm_eval.c:1804
#14 0x000070e98a8ae8cc in vm_call_cfunc_with_frame_ (stack_bottom=<optimized out>, argv=<optimized out>, argc=4,
calling=<optimized out>, reg_cfp=0x70e98a19bc20, ec=0x62de88a5aac0) at /usr/src/ruby/vm_insnhelper.c:3502
#15 vm_call_cfunc_with_frame (ec=0x62de88a5aac0, reg_cfp=0x70e98a19bc20, calling=<optimized out>)
at /usr/src/ruby/vm_insnhelper.c:3530
#16 0x000070e98a8c0289 in vm_sendish (method_explorer=<optimized out>, block_handler=<optimized out>,
cd=<optimized out>, reg_cfp=<optimized out>, ec=<optimized out>) at /usr/src/ruby/vm_callinfo.h:403
#17 vm_exec_core (ec=0x7fff46646eb8, ec@entry=0x62de88a5aac0) at /usr/src/ruby/insns.def:834
#18 0x000070e98a8c6149 in rb_vm_exec (ec=ec@entry=0x62de88a5aac0) at vm.c:2486
#19 0x000070e98a8d7320 in invoke_block (captured=0x70e98a19bed8, opt_pc=<optimized out>, type=<optimized out>, cref=0x0
It seems the issue is that wait_socket_readable is getting called when there is nothing more to read from the socket, and as pgconn_async_get_last_result is passing NULL as the timeout value ( https://github.com/ged/ruby-pg/blob/master/ext/pg_connection.c#L3139 ), the process just waits forever. I compiled ruby-pg with this change to hardcode a wait_timeout, and afterwards I would see a hang of a few seconds when the issue was triggered, rather than an infinite one:
diff --git a/ext/pg_connection.c b/ext/pg_connection.c
index e4194a3..1d5e276 100644
--- a/ext/pg_connection.c
+++ b/ext/pg_connection.c
@@ -2438,7 +2438,7 @@ wait_socket_readable( VALUE self, struct timeval *ptimeout, void *(*is_readable)
socket_io = pgconn_socket_io(self);
/* Wait for the socket to become readable before checking again */
- ret = pg_rb_io_wait(socket_io, RB_INT2NUM(PG_RUBY_IO_READABLE), wait_timeout);
+ ret = pg_rb_io_wait(socket_io, RB_INT2NUM(PG_RUBY_IO_READABLE), 5);
} else {
ret = Qfalse;
}
This issue seems related to #325 (comment) , and some investigation there suggested it was a result of the frame sizes.