-
Notifications
You must be signed in to change notification settings - Fork 609
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Potential RAM Leak #1140
Comments
Tagging previous work on this: #1037 |
My first intuition would be looking at state snapshot intervals and pruning intervals |
Another dashboard on the same node: Summary:
|
@czarcas7ic mentioned that this node is on default pruning. These intervals do look like snapshots might be the cause @czarcas7ic is going to set up a node with the same specs but no snapshots enabled for comparison |
Just set them both up and @p0mvn and I will be monitoring them |
Wouldn't a mem pprof help diagnose what's using so much memory? |
@alexanderbez I can take one on all three VMs after they run for 24 hours and see if there is any difference between the various settings (regarding tx index and snapshot settings). How long do you think the pprof should be taken for? |
Tx indexer on and snapshots on Tx indexer off and snapshots on Tx indexer on and snapshots off All three showed a continuous increase in RAM. Restarting these nodes placed RAM back into the beggining threshold and proceeded to slowly increase once again. cc @moniya12 |
Update: @czarcas7ic and I saw in pprof that fast cache takes a lot of the objects and heap Since there are several stores, we may have up to 780K fast nodes in cache for each:
We made a guess that it might simply be too much for many stores. As a result, the cache keeps growing incrementally as we keep using the nodes until it OOMs. We decided to hardcode a small value of 10K to see what happens: That seems to have helped. The memory has been stable for a day. However, setting fast cache too low might impact the performance negatively. Next steps:
|
Nice work @p0mvn! |
This seems especially severe in v6, but I don't know why. I am syncing a node with rocksdb like:
Notably these use rocksdb, and are part of an effort to make a fast full (non-statesync) scripted approach to making an archive node. |
Does v7 have various IAVL and DB improvements? Why test on v6? |
I believe @faddat is doing a full sync to get rocksdb caught up to the chain. As far as I know, he is one of the first to try that so there are no publicly available snapshots on rocks yet. The reason for the memory problem on v6 is probably fast IAVL changes and its cache that has been causing issues. We'll need to backport the configurable bytes cache to v6 to mitigate that: osmosis-labs/iavl#40 |
@p0mvn -- so, in the past I was able to do rocks via statesync. Sadly I had never saved one of those, I used state sync to make them quickly, so there really wasn't a need to do it. Concerning the memory leak-- it is real and seems to grow as chain state grows, but the patch you are writing does seem to handle it. Osmo uses a ton of RAM still, (50+gb) but it stabilizes there. Also I'm using a 150mb cache, so maybe it would be less if I set that value in app.toml lower. added QmfEuEx4r6sTZmp1odCMrytV4TWvyrn4ZxYHHVSgt1P38M osmo-v6-rocks.squashfs that's the ipfs cid (1tb) of osmosis v6, midway through. One thing that I am doing differently this time is making checkpoints of my work so that if I apphash doom, I can go back to them. |
I think this can be closed as there is no action needed. We can always find this in an archive if any information is needed |
Currently running a mainnet node and visualizing data with prometheus due to reports of OOM on machines with greater than recommended RAM requirement.
After ~20hrs of running, it is in fact spiking in cycles and slowly increasing
On discord, Geo claims to OOM with 64GB of RAM on an archive node.
Something else I have noticed is restarting the node brings the RAM back to a normal level, however will continue to increase as time goes on.
Also the CPU spikes correlate with the RAM spikes.
CC: @p0mvn
The text was updated successfully, but these errors were encountered: