websocketd crashing with too many open files #51

ssfrr · 2015-02-16T16:29:42Z

Error log (trimmed) is at https://gist.github.com/ssfrr/a7188f0d9a95b18a4e83.

Seems like this client is reconnecting every minute and (as far as the server knows) not closing the connections. Interestingly all the connections seem so close at the same time (maybe when the client script is shut down?). Even if that's the case we should handle this more gracefully on the server side.

One issue is that our application is happily sending data down to these connections without any exceptions being thrown, so I'm not sure how we're supposed to know when it's happening.

I'm not sure what the best solution would be though. A few options:

Limit the number of open connections from a single IP address
Require clients to periodically send some keepalive data and kill any connections that don't

ssfrr · 2015-02-16T16:44:27Z

Some other considerations:

Currently ulimit on open files for www-data is 1024 which limits the number of concurrent connections. We should bump that up but not before we nail down this bug. See here for more info.

Limiting per IP address seems problematic in case there are a bunch of clients behind a NAT. Maybe if we set it to something large but still well below the ulimit maximum for open files.

Edit: It's also possible that there's some sort of nginx thing going on where it's keeping the connections open

ssfrr · 2015-02-16T16:59:30Z

More relevant info: http://security.stackexchange.com/questions/48378/anti-dos-websockets-best-practices

They don't give a clear recommendation or best practice, but there are a bunch of links I didn't dig into.

bmayton · 2015-03-07T17:34:01Z

I just restarted websocketd again. Before doing so, I tried closing out the clients that were running, which appeared to have no effect.

I am beginning to suspect that the problem involves the ZMQ sockets, one of which is opened for each websocket connection. These are only ever closed if the associated websocket throws an exception when trying to send data. I'm not super familiar with flask/gevent's websocket implementations, but the logs do seem to be missing a lot of disconnected sockets.

One possible scenario is a site that is producing no data and a client that keeps reconnecting. The client connects, waits a while, then drops the connection because there's no activity, and then subsequently reconnects. chain_websocketd won't recognize that the client has disconnected (and thus won't disconnect the associated ZMQ socket) until it tries to send some data to the websocket, which isn't happening because the site isn't producing data. Meanwhile, hours go by and the client has reconnected enough times to create enough ZMQ sockets to eat up all of the allowed open files.

It might be good, instead of blocking on ZMQ receive, to do a select on both ZMQ receive and websocket receive. That way you should get notified as soon as the websocket can no longer receive, and clean up ZMQ. (Also, it's generally a good idea to read from the websocket at some point even if nothing is done with the data, so the client can't just fill up the buffers by sending it a bunch of data).

kkleidal · 2015-04-01T14:57:28Z

We can't use select.select, because it doesn't play well with gevent and there are no file descriptors for Flask sockets. We can't use gevent.select.select, because there are no file descriptors for Flask sockets. I've started writing my own implementation of the select loop over a single websockets socket and single Flask socket, but flask sockets (what we're using for websockets) is still very raw and doesn't have support for a non-blocking call to receive... We might need to look into alternative libraries or Socket.io

kkleidal · 2015-04-01T15:52:22Z

I looked into the possibility of forking https://bitbucket.org/Jeffrey/gevent-websocket/ and changing it to be non-blocking, but it is a lot of low-level code. I think another option to explore is replacing the flask app with Node.js. Node is practically designed for asynchronous streaming. I'll explore that next time.

ssfrr · 2015-04-01T16:54:07Z

Yeah, the nice thing about the websocket daemon is that it's pretty small and simple, so replacing it with a Node-based solution should be pretty easy; it looks like there's a ZMQ library for node.

My main concern is that it complicates deployment somewhat because now we have to install Node/npm and the library, and there could be possible version incompatibilities because both Django and the Node server will be using the system ZMQ libs with separate wrappers.

Have you checked out Flask-SocketIO? It relies on the gevent-socketio library, so it's probably already all set for green thread blocking. I'm not sure if the socketio stuff introduces another layer of expected behavior on top of raw websockets, but it seems worth checking out.

kkleidal · 2015-04-08T15:25:43Z

Not the same error but a similar error (in the original code, not the PR). Created by opening and closing 1200 ws connections, 200 at a time.

2015-04-08 15:23:48 9ac5085913f6 chain.websocketd[8513] INFO ws client connected for tag "site-1"
Exception AttributeError: "'NoneType' object has no attribute 'set'" in <bound method _Socket.__del__ of <zmq.green.core._Socket object at 0x7fbd226dfe88>> ignored
2015-04-08 15:23:48 9ac5085913f6 chain.websocketd[8513] ERROR Exception on /site-1 [GET]
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/Flask-0.10.1-py2.7.egg/flask/app.py", line 1817, in wsgi_app
    response = self.full_dispatch_request()
  File "/usr/local/lib/python2.7/dist-packages/Flask-0.10.1-py2.7.egg/flask/app.py", line 1477, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/usr/local/lib/python2.7/dist-packages/Flask-0.10.1-py2.7.egg/flask/app.py", line 1381, in handle_user_exception
    reraise(exc_type, exc_value, tb)
  File "/usr/local/lib/python2.7/dist-packages/Flask-0.10.1-py2.7.egg/flask/app.py", line 1475, in full_dispatch_request
    rv = self.dispatch_request()
  File "/usr/local/lib/python2.7/dist-packages/Flask-0.10.1-py2.7.egg/flask/app.py", line 1461, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "/opt/chain-api-dev/flask_sockets.py", line 59, in inner
    return f(request.environ['wsgi.websocket'], *args, **kwargs)
  File "/opt/chain-api-dev/chain/websocketd.py", line 30, in site_socket
    zmq_sock = zmq_ctx.socket(zmq.SUB)
  File "/usr/local/lib/python2.7/dist-packages/pyzmq-14.1.1-py2.7-linux-x86_64.egg/zmq/sugar/context.py", line 114, in socket
    s = self._socket_class(self, socket_type)
  File "socket.pyx", line 227, in zmq.backend.cython.socket.Socket.__cinit__ (zmq/backend/cython/socket.c:2478)
ZMQError: Too many open files

kkleidal · 2015-04-08T15:43:46Z

#57 crashes if the sockets are not closed properly on the client side. This is a different error than the original code. #57 does NOT have the same problem I just posted for the original code.

kkleidal · 2015-04-13T15:29:19Z

I can't repeat the crash I mentioned above. The client crashes when doing this sometimes, but I think that has to do with some concurrency or garbage collection issue which comes up when trying to open 200 connections concurrently and not closing them properly. I think #57 should be good to go.

ssfrr · 2015-04-20T23:29:46Z

Cool. Is the issue mentioned above ("#57 crashes if the sockets are not closed properly on the client side") still a problem or is that fixed?

Given it's tough to reproduce we can try pushing it (once we resolve the question above) and we'll see if @bmayton's websockets clients still have issues.

bmayton · 2015-04-28T02:34:58Z

I'm not sure what's actually been pushed to production, but I just restarted chain_websocketd because it was returning 504s to my graphite clients.

I will try to put together a test client that will reproduce the problem quickly.

ssfrr · 2015-04-28T15:18:31Z

#57 hasn't yet been merged or pushed to production. @kkleidal - is there still an issue to be resolved with the crash when the client doesn't close the websocket connection properly?

kkleidal · 2015-04-28T15:20:49Z

I can't reproduce that crash, @ssfrr . I think it's okay to take the risk and push it to test the waters

ssfrr · 2015-04-28T16:16:32Z

Just merged #57. @bmayton let us know if your stuff is still pulling successfully (and hopefully not 504'ing).

ssfrr · 2015-04-28T16:20:45Z

Also, it occurs to me that this will make it tricky to @bmayton to put together a test failure script.

So if it's easier just to wait it out as long as it would normally take to crap out that's OK by me.

kkleidal · 2015-04-29T15:38:15Z

Yeah, I tried to make a test failure script in #57 (chain/test_websocketd_volume.py), but it's a difficult problem to reproduce.

ssfrr added the bug label Feb 16, 2015

kkleidal mentioned this issue Apr 6, 2015

Made websocketsd more efficient #57

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

websocketd crashing with too many open files #51

websocketd crashing with too many open files #51

ssfrr commented Feb 16, 2015

ssfrr commented Feb 16, 2015

ssfrr commented Feb 16, 2015

bmayton commented Mar 7, 2015

kkleidal commented Apr 1, 2015

kkleidal commented Apr 1, 2015

ssfrr commented Apr 1, 2015

kkleidal commented Apr 8, 2015

kkleidal commented Apr 8, 2015

kkleidal commented Apr 13, 2015

ssfrr commented Apr 20, 2015

bmayton commented Apr 28, 2015

ssfrr commented Apr 28, 2015

kkleidal commented Apr 28, 2015

ssfrr commented Apr 28, 2015

ssfrr commented Apr 28, 2015

kkleidal commented Apr 29, 2015

websocketd crashing with too many open files #51

websocketd crashing with too many open files #51

Comments

ssfrr commented Feb 16, 2015

ssfrr commented Feb 16, 2015

ssfrr commented Feb 16, 2015

bmayton commented Mar 7, 2015

kkleidal commented Apr 1, 2015

kkleidal commented Apr 1, 2015

ssfrr commented Apr 1, 2015

kkleidal commented Apr 8, 2015

kkleidal commented Apr 8, 2015

kkleidal commented Apr 13, 2015

ssfrr commented Apr 20, 2015

bmayton commented Apr 28, 2015

ssfrr commented Apr 28, 2015

kkleidal commented Apr 28, 2015

ssfrr commented Apr 28, 2015

ssfrr commented Apr 28, 2015

kkleidal commented Apr 29, 2015