Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

websocketd crashing with too many open files #51

Open
ssfrr opened this issue Feb 16, 2015 · 16 comments
Open

websocketd crashing with too many open files #51

ssfrr opened this issue Feb 16, 2015 · 16 comments
Labels

Comments

@ssfrr
Copy link
Member

ssfrr commented Feb 16, 2015

Error log (trimmed) is at https://gist.github.com/ssfrr/a7188f0d9a95b18a4e83.

Seems like this client is reconnecting every minute and (as far as the server knows) not closing the connections. Interestingly all the connections seem so close at the same time (maybe when the client script is shut down?). Even if that's the case we should handle this more gracefully on the server side.

One issue is that our application is happily sending data down to these connections without any exceptions being thrown, so I'm not sure how we're supposed to know when it's happening.

I'm not sure what the best solution would be though. A few options:

  1. Limit the number of open connections from a single IP address
  2. Require clients to periodically send some keepalive data and kill any connections that don't
@ssfrr ssfrr added the bug label Feb 16, 2015
@ssfrr
Copy link
Member Author

ssfrr commented Feb 16, 2015

Some other considerations:

Currently ulimit on open files for www-data is 1024 which limits the number of concurrent connections. We should bump that up but not before we nail down this bug. See here for more info.

Limiting per IP address seems problematic in case there are a bunch of clients behind a NAT. Maybe if we set it to something large but still well below the ulimit maximum for open files.

Edit: It's also possible that there's some sort of nginx thing going on where it's keeping the connections open

@ssfrr
Copy link
Member Author

ssfrr commented Feb 16, 2015

More relevant info: http://security.stackexchange.com/questions/48378/anti-dos-websockets-best-practices

They don't give a clear recommendation or best practice, but there are a bunch of links I didn't dig into.

@bmayton
Copy link
Member

bmayton commented Mar 7, 2015

I just restarted websocketd again. Before doing so, I tried closing out the clients that were running, which appeared to have no effect.

I am beginning to suspect that the problem involves the ZMQ sockets, one of which is opened for each websocket connection. These are only ever closed if the associated websocket throws an exception when trying to send data. I'm not super familiar with flask/gevent's websocket implementations, but the logs do seem to be missing a lot of disconnected sockets.

One possible scenario is a site that is producing no data and a client that keeps reconnecting. The client connects, waits a while, then drops the connection because there's no activity, and then subsequently reconnects. chain_websocketd won't recognize that the client has disconnected (and thus won't disconnect the associated ZMQ socket) until it tries to send some data to the websocket, which isn't happening because the site isn't producing data. Meanwhile, hours go by and the client has reconnected enough times to create enough ZMQ sockets to eat up all of the allowed open files.

It might be good, instead of blocking on ZMQ receive, to do a select on both ZMQ receive and websocket receive. That way you should get notified as soon as the websocket can no longer receive, and clean up ZMQ. (Also, it's generally a good idea to read from the websocket at some point even if nothing is done with the data, so the client can't just fill up the buffers by sending it a bunch of data).

@kkleidal
Copy link
Contributor

kkleidal commented Apr 1, 2015

We can't use select.select, because it doesn't play well with gevent and there are no file descriptors for Flask sockets. We can't use gevent.select.select, because there are no file descriptors for Flask sockets. I've started writing my own implementation of the select loop over a single websockets socket and single Flask socket, but flask sockets (what we're using for websockets) is still very raw and doesn't have support for a non-blocking call to receive... We might need to look into alternative libraries or Socket.io

@kkleidal
Copy link
Contributor

kkleidal commented Apr 1, 2015

I looked into the possibility of forking https://bitbucket.org/Jeffrey/gevent-websocket/ and changing it to be non-blocking, but it is a lot of low-level code. I think another option to explore is replacing the flask app with Node.js. Node is practically designed for asynchronous streaming. I'll explore that next time.

@ssfrr
Copy link
Member Author

ssfrr commented Apr 1, 2015

Yeah, the nice thing about the websocket daemon is that it's pretty small and simple, so replacing it with a Node-based solution should be pretty easy; it looks like there's a ZMQ library for node.

My main concern is that it complicates deployment somewhat because now we have to install Node/npm and the library, and there could be possible version incompatibilities because both Django and the Node server will be using the system ZMQ libs with separate wrappers.

Have you checked out Flask-SocketIO? It relies on the gevent-socketio library, so it's probably already all set for green thread blocking. I'm not sure if the socketio stuff introduces another layer of expected behavior on top of raw websockets, but it seems worth checking out.

@kkleidal
Copy link
Contributor

kkleidal commented Apr 8, 2015

Not the same error but a similar error (in the original code, not the PR). Created by opening and closing 1200 ws connections, 200 at a time.

2015-04-08 15:23:48 9ac5085913f6 chain.websocketd[8513] INFO ws client connected for tag "site-1"
Exception AttributeError: "'NoneType' object has no attribute 'set'" in <bound method _Socket.__del__ of <zmq.green.core._Socket object at 0x7fbd226dfe88>> ignored
2015-04-08 15:23:48 9ac5085913f6 chain.websocketd[8513] ERROR Exception on /site-1 [GET]
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/Flask-0.10.1-py2.7.egg/flask/app.py", line 1817, in wsgi_app
    response = self.full_dispatch_request()
  File "/usr/local/lib/python2.7/dist-packages/Flask-0.10.1-py2.7.egg/flask/app.py", line 1477, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/usr/local/lib/python2.7/dist-packages/Flask-0.10.1-py2.7.egg/flask/app.py", line 1381, in handle_user_exception
    reraise(exc_type, exc_value, tb)
  File "/usr/local/lib/python2.7/dist-packages/Flask-0.10.1-py2.7.egg/flask/app.py", line 1475, in full_dispatch_request
    rv = self.dispatch_request()
  File "/usr/local/lib/python2.7/dist-packages/Flask-0.10.1-py2.7.egg/flask/app.py", line 1461, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "/opt/chain-api-dev/flask_sockets.py", line 59, in inner
    return f(request.environ['wsgi.websocket'], *args, **kwargs)
  File "/opt/chain-api-dev/chain/websocketd.py", line 30, in site_socket
    zmq_sock = zmq_ctx.socket(zmq.SUB)
  File "/usr/local/lib/python2.7/dist-packages/pyzmq-14.1.1-py2.7-linux-x86_64.egg/zmq/sugar/context.py", line 114, in socket
    s = self._socket_class(self, socket_type)
  File "socket.pyx", line 227, in zmq.backend.cython.socket.Socket.__cinit__ (zmq/backend/cython/socket.c:2478)
ZMQError: Too many open files

@kkleidal
Copy link
Contributor

kkleidal commented Apr 8, 2015

#57 crashes if the sockets are not closed properly on the client side. This is a different error than the original code. #57 does NOT have the same problem I just posted for the original code.

@kkleidal
Copy link
Contributor

I can't repeat the crash I mentioned above. The client crashes when doing this sometimes, but I think that has to do with some concurrency or garbage collection issue which comes up when trying to open 200 connections concurrently and not closing them properly. I think #57 should be good to go.

@ssfrr
Copy link
Member Author

ssfrr commented Apr 20, 2015

Cool. Is the issue mentioned above ("#57 crashes if the sockets are not closed properly on the client side") still a problem or is that fixed?

Given it's tough to reproduce we can try pushing it (once we resolve the question above) and we'll see if @bmayton's websockets clients still have issues.

@bmayton
Copy link
Member

bmayton commented Apr 28, 2015

I'm not sure what's actually been pushed to production, but I just restarted chain_websocketd because it was returning 504s to my graphite clients.

I will try to put together a test client that will reproduce the problem quickly.

@ssfrr
Copy link
Member Author

ssfrr commented Apr 28, 2015

#57 hasn't yet been merged or pushed to production. @kkleidal - is there still an issue to be resolved with the crash when the client doesn't close the websocket connection properly?

@kkleidal
Copy link
Contributor

I can't reproduce that crash, @ssfrr . I think it's okay to take the risk and push it to test the waters

@ssfrr
Copy link
Member Author

ssfrr commented Apr 28, 2015

Just merged #57. @bmayton let us know if your stuff is still pulling successfully (and hopefully not 504'ing).

@ssfrr
Copy link
Member Author

ssfrr commented Apr 28, 2015

Also, it occurs to me that this will make it tricky to @bmayton to put together a test failure script. :bowtie:

So if it's easier just to wait it out as long as it would normally take to crap out that's OK by me.

@kkleidal
Copy link
Contributor

Yeah, I tried to make a test failure script in #57 (chain/test_websocketd_volume.py), but it's a difficult problem to reproduce.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants