-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
websocketd crashing with too many open files #51
Comments
Some other considerations: Currently ulimit on open files for www-data is 1024 which limits the number of concurrent connections. We should bump that up but not before we nail down this bug. See here for more info. Limiting per IP address seems problematic in case there are a bunch of clients behind a NAT. Maybe if we set it to something large but still well below the ulimit maximum for open files. Edit: It's also possible that there's some sort of nginx thing going on where it's keeping the connections open |
More relevant info: http://security.stackexchange.com/questions/48378/anti-dos-websockets-best-practices They don't give a clear recommendation or best practice, but there are a bunch of links I didn't dig into. |
I just restarted websocketd again. Before doing so, I tried closing out the clients that were running, which appeared to have no effect. I am beginning to suspect that the problem involves the ZMQ sockets, one of which is opened for each websocket connection. These are only ever closed if the associated websocket throws an exception when trying to send data. I'm not super familiar with flask/gevent's websocket implementations, but the logs do seem to be missing a lot of disconnected sockets. One possible scenario is a site that is producing no data and a client that keeps reconnecting. The client connects, waits a while, then drops the connection because there's no activity, and then subsequently reconnects. chain_websocketd won't recognize that the client has disconnected (and thus won't disconnect the associated ZMQ socket) until it tries to send some data to the websocket, which isn't happening because the site isn't producing data. Meanwhile, hours go by and the client has reconnected enough times to create enough ZMQ sockets to eat up all of the allowed open files. It might be good, instead of blocking on ZMQ receive, to do a select on both ZMQ receive and websocket receive. That way you should get notified as soon as the websocket can no longer receive, and clean up ZMQ. (Also, it's generally a good idea to read from the websocket at some point even if nothing is done with the data, so the client can't just fill up the buffers by sending it a bunch of data). |
We can't use select.select, because it doesn't play well with gevent and there are no file descriptors for Flask sockets. We can't use gevent.select.select, because there are no file descriptors for Flask sockets. I've started writing my own implementation of the select loop over a single websockets socket and single Flask socket, but flask sockets (what we're using for websockets) is still very raw and doesn't have support for a non-blocking call to receive... We might need to look into alternative libraries or Socket.io |
I looked into the possibility of forking https://bitbucket.org/Jeffrey/gevent-websocket/ and changing it to be non-blocking, but it is a lot of low-level code. I think another option to explore is replacing the flask app with Node.js. Node is practically designed for asynchronous streaming. I'll explore that next time. |
Yeah, the nice thing about the websocket daemon is that it's pretty small and simple, so replacing it with a Node-based solution should be pretty easy; it looks like there's a ZMQ library for node. My main concern is that it complicates deployment somewhat because now we have to install Node/npm and the library, and there could be possible version incompatibilities because both Django and the Node server will be using the system ZMQ libs with separate wrappers. Have you checked out Flask-SocketIO? It relies on the gevent-socketio library, so it's probably already all set for green thread blocking. I'm not sure if the socketio stuff introduces another layer of expected behavior on top of raw websockets, but it seems worth checking out. |
Not the same error but a similar error (in the original code, not the PR). Created by opening and closing 1200 ws connections, 200 at a time.
|
I can't repeat the crash I mentioned above. The client crashes when doing this sometimes, but I think that has to do with some concurrency or garbage collection issue which comes up when trying to open 200 connections concurrently and not closing them properly. I think #57 should be good to go. |
I'm not sure what's actually been pushed to production, but I just restarted chain_websocketd because it was returning 504s to my graphite clients. I will try to put together a test client that will reproduce the problem quickly. |
I can't reproduce that crash, @ssfrr . I think it's okay to take the risk and push it to test the waters |
Also, it occurs to me that this will make it tricky to @bmayton to put together a test failure script. So if it's easier just to wait it out as long as it would normally take to crap out that's OK by me. |
Yeah, I tried to make a test failure script in #57 (chain/test_websocketd_volume.py), but it's a difficult problem to reproduce. |
Error log (trimmed) is at https://gist.github.com/ssfrr/a7188f0d9a95b18a4e83.
Seems like this client is reconnecting every minute and (as far as the server knows) not closing the connections. Interestingly all the connections seem so close at the same time (maybe when the client script is shut down?). Even if that's the case we should handle this more gracefully on the server side.
One issue is that our application is happily sending data down to these connections without any exceptions being thrown, so I'm not sure how we're supposed to know when it's happening.
I'm not sure what the best solution would be though. A few options:
The text was updated successfully, but these errors were encountered: