Skip to content

Connection hangs, does not reconnect with unsuccessful Websocket handshake #75

@fendent

Description

@fendent

We've run into some odd issues where a socket will hang and never timeout, close, or attempt to reconnect. Part of the issue in our particular case appears to be related to intermittent network instability. What we believe is happening goes something like this:

  • Network connection is lost briefly, causing a disconnect
  • Slipstream/Mint attempts to reconnect
  • Slipstream opens a connection to the remote host and gets an ok back from Mint
  • Some time between now and after the following step, the network stability causes Slipstream to lose its connection to the remote host. There may be some comingling of ICMP packet loss that goes along with this that is keeping the container from knowing the socket should be closed.
  • At this point, Slipstream still thinks that it has a connection, in part, because it's not checking for the true connected state on the socket. Instead, the GenServer is listening for subsequent messages.
  • Then:
    • Slipstream.Connection.Pipeline.handle_message(%{message: :connect, ...})
    • Slipstream.Connection.Impl.websocket_upgrade/2
    • Mint.WebSocket.upgrade/5
    • which finally returns an ok because there was (seemingly) no error sending the request and it is an async pattern.
  • This is where our fate is sealed.
  • Mint.WebSocket expects you to read the stream for any messages to see if the successful websocket upgrade handshake has taken place
  • Slipstream.Connection.Pipeline is now waiting for a message to come through consisting of the server's response to the upgrade request before it calls Mint.Websocket.new/4 to create the new websocket object.
  • The upgrade response never comes and the Slipstream.Connection.Pipeline remains stagnant, never timing out or attempting to reconnect.

I may have a few details muddled but this is a sketch of how things appear to be going.

This can be replicated by simply spinning up a dumb listening socket and attempting to create a new connection to it. With no Websocket upgrade response provided, Slipstream will simply hang "forever".

For the purposes of testing, I simply use netcat for the listener like so: nc -lk 4000

test_mode must also be turned off to accurately test this behaviour.

I'm happy to write up a thin repo that reproduces this issue as soon as I have the time in the next couple weeks but I figured I'd raise the issue while it's still fresh. That said, the problem itself can be reproduced by the basic connection functionality without any additional configuration.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions