Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hasura using more CPU over time #2565

Closed
JordanCheney opened this issue Jul 18, 2019 · 12 comments · Fixed by #2581
Closed

Hasura using more CPU over time #2565

JordanCheney opened this issue Jul 18, 2019 · 12 comments · Fixed by #2581
Assignees

Comments

@JordanCheney
Copy link

JordanCheney commented Jul 18, 2019

Hello,

I recently upgraded to hasura 1.0.0.beta.3 and I've noticed that CPU use in the docker container is generally increasing over time, even if no requests are being made to the hasura server.

Here are the logs produced by Hasura (mostly showing nothing is happening):

{"timestamp":"2019-07-17T21:51:27.827+0000","level":"info","type":"startup","detail":{"kind":"server_configuration","info":{"live_query_options":{"fallback_options":{"refetch_delay":1000},"multiplexed_options":{"batch_size":100,"refetch_delay":1000}},"transaction_isolation":"ISOLATION LEVEL READ COMMITTED","enabled_log_types":["http-log","websocket-log","startup","webhook-log","query-log"],"server_host":"HostAny","enable_allowlist":false,"log_level":"debug","auth_hook_mode":null,"use_prepared_statements":true,"unauth_role":null,"stringify_numeric_types":false,"enabled_apis":["metadata","graphql"],"enable_telemetry":false,"enable_console":true,"auth_hook":null,"jwt_secret":null,"cors_config":{"allowed_origins":"*","disabled":false,"ws_read_cookie":null},"console_assets_dir":null,"admin_secret_set":true,"port":8080}}}
{"timestamp":"2019-07-17T21:51:27.827+0000","level":"info","type":"startup","detail":{"kind":"postgres_connection","info":{"database":"postgres","retries":1,"user":"postgres","host":"postgres","port":5432}}}
{"internal":"could not connect to server: Connection refused\n\tIs the server running on host \"postgres\" (172.129.0.3) and accepting\n\tTCP/IP connections on port 5432?\n","path":"$","error":"connection error","code":"postgres-error"}
{"timestamp":"2019-07-17T21:51:29.951+0000","level":"info","type":"startup","detail":{"kind":"server_configuration","info":{"live_query_options":{"fallback_options":{"refetch_delay":1000},"multiplexed_options":{"batch_size":100,"refetch_delay":1000}},"transaction_isolation":"ISOLATION LEVEL READ COMMITTED","enabled_log_types":["http-log","websocket-log","startup","webhook-log","query-log"],"server_host":"HostAny","enable_allowlist":false,"log_level":"debug","auth_hook_mode":null,"use_prepared_statements":true,"unauth_role":null,"stringify_numeric_types":false,"enabled_apis":["metadata","graphql"],"enable_telemetry":false,"enable_console":true,"auth_hook":null,"jwt_secret":null,"cors_config":{"allowed_origins":"*","disabled":false,"ws_read_cookie":null},"console_assets_dir":null,"admin_secret_set":true,"port":8080}}}
{"timestamp":"2019-07-17T21:51:29.951+0000","level":"info","type":"startup","detail":{"kind":"postgres_connection","info":{"database":"postgres","retries":1,"user":"postgres","host":"postgres","port":5432}}}
{"internal":"could not connect to server: Connection refused\n\tIs the server running on host \"postgres\" (172.129.0.3) and accepting\n\tTCP/IP connections on port 5432?\n","path":"$","error":"connection error","code":"postgres-error"}
{"timestamp":"2019-07-17T21:51:31.891+0000","level":"info","type":"startup","detail":{"kind":"server_configuration","info":{"live_query_options":{"fallback_options":{"refetch_delay":1000},"multiplexed_options":{"batch_size":100,"refetch_delay":1000}},"transaction_isolation":"ISOLATION LEVEL READ COMMITTED","enabled_log_types":["http-log","websocket-log","startup","webhook-log","query-log"],"server_host":"HostAny","enable_allowlist":false,"log_level":"debug","auth_hook_mode":null,"use_prepared_statements":true,"unauth_role":null,"stringify_numeric_types":false,"enabled_apis":["metadata","graphql"],"enable_telemetry":false,"enable_console":true,"auth_hook":null,"jwt_secret":null,"cors_config":{"allowed_origins":"*","disabled":false,"ws_read_cookie":null},"console_assets_dir":null,"admin_secret_set":true,"port":8080}}}
{"timestamp":"2019-07-17T21:51:31.891+0000","level":"info","type":"startup","detail":{"kind":"postgres_connection","info":{"database":"postgres","retries":1,"user":"postgres","host":"postgres","port":5432}}}
{"timestamp":"2019-07-17T21:51:33.992+0000","level":"info","type":"startup","detail":{"kind":"db_init","info":"successfully initialised"}}
{"timestamp":"2019-07-17T21:51:33.992+0000","level":"info","type":"startup","detail":{"kind":"db_migrate","info":"already at the latest version. current version: \"17\""}}
{"timestamp":"2019-07-17T21:51:33.992+0000","level":"info","type":"startup","detail":{"kind":"schema-sync","info":{"thread_id":"ThreadId 86","instance_id":"5e7dd705-9331-474a-bda4-986d78801ffa","message":"listener thread started"}}}
{"timestamp":"2019-07-17T21:51:33.992+0000","level":"info","type":"startup","detail":{"kind":"schema-sync","info":{"thread_id":"ThreadId 87","instance_id":"5e7dd705-9331-474a-bda4-986d78801ffa","message":"processor thread started"}}}
{"timestamp":"2019-07-17T21:51:33.992+0000","level":"info","type":"startup","detail":{"kind":"event_triggers","info":"preparing data"}}
{"timestamp":"2019-07-17T21:51:33.992+0000","level":"info","type":"startup","detail":{"kind":"event_triggers","info":"starting workers"}}
{"timestamp":"2019-07-17T21:51:33.992+0000","level":"info","type":"startup","detail":{"kind":"server","info":{"time_taken":2.216395209,"message":"starting API server"}}}

Here is a screenshot of a grafana dashboard showing the CPU use over 6 hours:

hasura_cpu_usage

Here is a minimal docker-compose script that can recreate the test environment and the monitoring tools.

hasura_test.zip

@lorenzo
Copy link
Contributor

lorenzo commented Jul 19, 2019

Out of curiosity, is memory increasing as well?

@JordanCheney
Copy link
Author

I had to recreate the experiment so, the times are different, but on run 2 it appears memory is also increasing.

hasura_memory_use

@lorenzo
Copy link
Contributor

lorenzo commented Jul 20, 2019

I think the CPU usage increase is due to the garbage collector doing more work as time passes. This may be hinting at a possible memory leak or just at the need to move some live data out of the garbage collector.

What's the experiment about? Does this happen in other "normal" circumstances?

@lorenzo
Copy link
Contributor

lorenzo commented Jul 20, 2019

@JordanCheney also, out of curiosity, would you mind adding this env var and see if it makes any difference?

GHCRTS=-I0

This flag disables the idle garbage collection, which may reduce the cpu load when the server is not doing other work.

@JordanCheney
Copy link
Author

@lorenzo I added the new env flag, I'll report back results in a few hours when it's finished. The experiment is intentionally not doing anything besides running the containers (no requests to hasura). I don't have any normal runs of the beta3 to compare to since this behavior got flagged so the new version never made it to real testing

@JordanCheney
Copy link
Author

@lorenzo looks like disabling the idle garbage collector fixed the behavior-
hasura_ghcrts_io

What impact will this change have on system performance?

@jberryman
Copy link
Collaborator

jberryman commented Jul 22, 2019

@JordanCheney thanks for reporting. I've seen this crop up before, see e.g. PostgREST/postgrest#565

It should be perfectly safe to disable idle GC as @lorenzo helpfully suggested (thanks!). The idea is that if the program is doing nothing we might as well perform a GC. With idle GC disabled you might find that e.g. the next request to the server after an idle period has higher average latency (because a GC might happen to be triggered). It really shouldn't matter much for a production service. I've also found in another service I worked on that -I2 caused the memory increase issue to go away.

We're going to look a bit closer at this and will update the ticket when the fix is merged.

@bartjuh4
Copy link

We experience the same performance issues, is there a workaround at this moment to disable the idle GC?
Schermafbeelding 2019-07-22 om 08 39 59

jberryman added a commit to jberryman/graphql-engine that referenced this issue Jul 22, 2019
This seems to resolve the issue locally (and has worked in the past),
but it's not clear what exactly is going on here (in particular, why
this should resolve what looks like a memory leak). It certainly seems
like a GHC issue of some sort.

Closes hasura#2565
@JordanCheney
Copy link
Author

@bartjuh4- @lorenzo suggested adding GHCRTS=-I0 as an environment flag which solved the problem for me

@bartjuh4
Copy link

@bartjuh4- @lorenzo suggested adding GHCRTS=-I0 as an environment flag which solved the problem for me

Thanks, one (silly) question how did you add that flag to a docker instance?

@lexi-lambda
Copy link
Contributor

@bartjuh4 If you’re running the container from the command line via docker run, you can add the -e GHCRTS=-I0 flag. If you’re using docker-compose, you can add

environment:
- GHCRTS=-I0

to your container configuration. Other mechanisms of running containers have other ways to control the environment.

lexi-lambda pushed a commit to jberryman/graphql-engine that referenced this issue Jul 24, 2019
This seems to resolve the issue locally (and has worked in the past),
but it's not clear what exactly is going on here (in particular, why
this should resolve what looks like a memory leak). It certainly seems
like a GHC issue of some sort.

Closes hasura#2565
lexi-lambda pushed a commit that referenced this issue Jul 24, 2019
This seems to resolve the issue locally (and has worked in the past),
but it's not clear what exactly is going on here (in particular, why
this should resolve what looks like a memory leak). It certainly seems
like a GHC issue of some sort.

Closes #2565
@jberryman
Copy link
Collaborator

Survey of possibly relevant GHC issues: #2581 (comment)

polRk pushed a commit to polRk/graphql-engine that referenced this issue Feb 12, 2020
This seems to resolve the issue locally (and has worked in the past),
but it's not clear what exactly is going on here (in particular, why
this should resolve what looks like a memory leak). It certainly seems
like a GHC issue of some sort.

Closes hasura#2565
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants