Query hangs when using async_exec

I've been debugging an issue where we have intermittent hangs of our ruby processes when running queries on CockroachDB using the ruby-pg driver. This is with with ruby-3.3.4, libpq-dev 15.8-0+deb12u1,  ruby-pg 1.5.7, and Linux 6.8.0-1012-gcp however I have been able to reproduce the issue on other versions also. 

We are able to reproduce the issue ~80% of the time with a simple query like this:
```
conn = PG.connect(ENV['COCKROACH_DATABASE_URL'])
result = conn.exec(<THE QUERY>)
```
This will hang forever, tying up a ruby thread in our sidekiq workers until the worker is shutdown, even days later.

If I disable async_exec, the same query always works, i.e. this works 100% of the time:
```
PG::Connection.async_api = false
conn = PG.connect(ENV['COCKROACH_DATABASE_URL'])
result = conn.exec(<THE QUERY>)
```

The triggering of this is related to the content returned from the query - if I leave out certain columns, or remove the ORDER BY, then the issue doesn't show up.

Here's the gdb trace when the query is hung:
```
(gdb) where
#0  0x000070e98a319256 in __ppoll (fds=fds@entry=0x7fff46646eb8, nfds=nfds@entry=1, timeout=<optimized out>,
    sigmask=sigmask@entry=0x0) at ../sysdeps/unix/sysv/linux/ppoll.c:42
#1  0x000070e98a878390 in rb_thread_wait_for_single_fd (fd=6, events=1, timeout=timeout@entry=0x0)
    at /usr/src/ruby/hrtime.h:158
#2  0x000070e98a71dcd9 in rb_io_wait (timeout=4, events=3, io=<optimized out>)
    at ./include/ruby/internal/arithmetic/int.h:172
#3  rb_io_wait (io=<optimized out>, events=events@entry=3, timeout=timeout@entry=4) at io.c:1422
#4  0x000070e96ebab84a in wait_socket_readable (ptimeout=0x0, is_readable=<optimized out>, self=124147919560760)
    at pg_connection.c:2441
#5  pgconn_async_get_last_result (self=self@entry=124147919560760) at pg_connection.c:3139
#6  0x000070e96ebac939 in pgconn_async_exec (argc=1, argv=0x70e98a09c368, self=124147919560760) at pg_connection.c:3288
#7  0x000070e98a8ae8cc in vm_call_cfunc_with_frame_ (stack_bottom=<optimized out>, argv=<optimized out>, argc=1,
    calling=<optimized out>, reg_cfp=0x70e98a19bbb0, ec=0x62de88a5aac0) at /usr/src/ruby/vm_insnhelper.c:3502
#8  vm_call_cfunc_with_frame (ec=0x62de88a5aac0, reg_cfp=0x70e98a19bbb0, calling=<optimized out>)
    at /usr/src/ruby/vm_insnhelper.c:3530
#9  0x000070e98a8c0289 in vm_sendish (method_explorer=<optimized out>, block_handler=<optimized out>,
    cd=<optimized out>, reg_cfp=<optimized out>, ec=<optimized out>) at /usr/src/ruby/vm_callinfo.h:403
#10 vm_exec_core (ec=0x7fff46646eb8, ec@entry=0x62de88a5aac0) at /usr/src/ruby/insns.def:834
#11 0x000070e98a8c6149 in rb_vm_exec (ec=ec@entry=0x62de88a5aac0) at vm.c:2486
#12 0x000070e98a8c6e94 in eval_string_with_scope (line=9, file=<optimized out>, src=124147872936320,
    scope=124147909484040) at /usr/src/ruby/vm_eval.c:1761
#13 rb_f_eval (argc=<optimized out>, argv=<optimized out>, self=<optimized out>) at /usr/src/ruby/vm_eval.c:1804
#14 0x000070e98a8ae8cc in vm_call_cfunc_with_frame_ (stack_bottom=<optimized out>, argv=<optimized out>, argc=4,
    calling=<optimized out>, reg_cfp=0x70e98a19bc20, ec=0x62de88a5aac0) at /usr/src/ruby/vm_insnhelper.c:3502
#15 vm_call_cfunc_with_frame (ec=0x62de88a5aac0, reg_cfp=0x70e98a19bc20, calling=<optimized out>)
    at /usr/src/ruby/vm_insnhelper.c:3530
#16 0x000070e98a8c0289 in vm_sendish (method_explorer=<optimized out>, block_handler=<optimized out>,
    cd=<optimized out>, reg_cfp=<optimized out>, ec=<optimized out>) at /usr/src/ruby/vm_callinfo.h:403
#17 vm_exec_core (ec=0x7fff46646eb8, ec@entry=0x62de88a5aac0) at /usr/src/ruby/insns.def:834
#18 0x000070e98a8c6149 in rb_vm_exec (ec=ec@entry=0x62de88a5aac0) at vm.c:2486
#19 0x000070e98a8d7320 in invoke_block (captured=0x70e98a19bed8, opt_pc=<optimized out>, type=<optimized out>, cref=0x0
```
It seems the issue is that wait_socket_readable is getting called when there is nothing more to read from the socket, and as pgconn_async_get_last_result is passing NULL as the timeout value ( https://github.com/ged/ruby-pg/blob/master/ext/pg_connection.c#L3139 ), the process just waits forever. I compiled ruby-pg with this change to hardcode a wait_timeout, and afterwards I would see a hang of a few seconds when the issue was triggered, rather than an infinite one: 

```
diff --git a/ext/pg_connection.c b/ext/pg_connection.c
index e4194a3..1d5e276 100644
--- a/ext/pg_connection.c
+++ b/ext/pg_connection.c
@@ -2438,7 +2438,7 @@ wait_socket_readable( VALUE self, struct timeval *ptimeout, void *(*is_readable)

                        socket_io = pgconn_socket_io(self);
                        /* Wait for the socket to become readable before checking again */
-                       ret = pg_rb_io_wait(socket_io, RB_INT2NUM(PG_RUBY_IO_READABLE), wait_timeout);
+                       ret = pg_rb_io_wait(socket_io, RB_INT2NUM(PG_RUBY_IO_READABLE), 5);
                } else {
                        ret = Qfalse;
                }
```


This issue seems related to https://github.com/ged/ruby-pg/issues/325#issuecomment-737561270 , and some investigation there suggested it was a result of the frame sizes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Query hangs when using async_exec #583

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Query hangs when using async_exec #583

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions