-
Notifications
You must be signed in to change notification settings - Fork 26
Description
We've run into some odd issues where a socket will hang and never timeout, close, or attempt to reconnect. Part of the issue in our particular case appears to be related to intermittent network instability. What we believe is happening goes something like this:
- Network connection is lost briefly, causing a disconnect
- Slipstream/Mint attempts to reconnect
- Slipstream opens a connection to the remote host and gets an ok back from Mint
- Some time between now and after the following step, the network stability causes Slipstream to lose its connection to the remote host. There may be some comingling of ICMP packet loss that goes along with this that is keeping the container from knowing the socket should be closed.
- At this point, Slipstream still thinks that it has a connection, in part, because it's not checking for the true connected state on the socket. Instead, the GenServer is listening for subsequent messages.
- Then:
Slipstream.Connection.Pipeline.handle_message(%{message: :connect, ...})Slipstream.Connection.Impl.websocket_upgrade/2Mint.WebSocket.upgrade/5- which finally returns an ok because there was (seemingly) no error sending the request and it is an async pattern.
- This is where our fate is sealed.
Mint.WebSocketexpects you to read the stream for any messages to see if the successful websocket upgrade handshake has taken placeSlipstream.Connection.Pipelineis now waiting for a message to come through consisting of the server's response to the upgrade request before it callsMint.Websocket.new/4to create the new websocket object.- The upgrade response never comes and the
Slipstream.Connection.Pipelineremains stagnant, never timing out or attempting to reconnect.
I may have a few details muddled but this is a sketch of how things appear to be going.
This can be replicated by simply spinning up a dumb listening socket and attempting to create a new connection to it. With no Websocket upgrade response provided, Slipstream will simply hang "forever".
For the purposes of testing, I simply use netcat for the listener like so: nc -lk 4000
test_mode must also be turned off to accurately test this behaviour.
I'm happy to write up a thin repo that reproduces this issue as soon as I have the time in the next couple weeks but I figured I'd raise the issue while it's still fresh. That said, the problem itself can be reproduced by the basic connection functionality without any additional configuration.