Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential RAM Leak #1140

Closed
Tracked by #1172
czarcas7ic opened this issue Mar 24, 2022 · 15 comments
Closed
Tracked by #1172

Potential RAM Leak #1140

czarcas7ic opened this issue Mar 24, 2022 · 15 comments
Assignees
Labels
T:task ⚙️ A task belongs to a story

Comments

@czarcas7ic
Copy link
Member

czarcas7ic commented Mar 24, 2022

Currently running a mainnet node and visualizing data with prometheus due to reports of OOM on machines with greater than recommended RAM requirement.

After ~20hrs of running, it is in fact spiking in cycles and slowly increasing
Screen Shot 2022-03-24 at 12 17 45 PM

On discord, Geo claims to OOM with 64GB of RAM on an archive node.

Something else I have noticed is restarting the node brings the RAM back to a normal level, however will continue to increase as time goes on.

Also the CPU spikes correlate with the RAM spikes.

CC: @p0mvn

@p0mvn
Copy link
Member

p0mvn commented Mar 24, 2022

Tagging previous work on this: #1037

@ValarDragon
Copy link
Member

My first intuition would be looking at state snapshot intervals and pruning intervals

@p0mvn
Copy link
Member

p0mvn commented Mar 24, 2022

Another dashboard on the same node:
image

Summary:

  • the number of goroutines is stable over long-term

  • The number of open file descriptors is slowly growing, might just be leveldb compaction

  • Go runtime memory spikes at some frequent intervals and partially goes back to normal

    • This is different from what was observed in Memory leak #1037 where it would spike on epoch and fully go back to normal

@p0mvn
Copy link
Member

p0mvn commented Mar 24, 2022

@czarcas7ic mentioned that this node is on default pruning. These intervals do look like snapshots might be the cause

@czarcas7ic is going to set up a node with the same specs but no snapshots enabled for comparison

@czarcas7ic
Copy link
Member Author

Just set them both up and @p0mvn and I will be monitoring them

@alexanderbez
Copy link
Contributor

Wouldn't a mem pprof help diagnose what's using so much memory?

@czarcas7ic
Copy link
Member Author

czarcas7ic commented Mar 25, 2022

@alexanderbez I can take one on all three VMs after they run for 24 hours and see if there is any difference between the various settings (regarding tx index and snapshot settings). How long do you think the pprof should be taken for?

@czarcas7ic
Copy link
Member Author

czarcas7ic commented Mar 25, 2022

Tx indexer on and snapshots on
Screen Shot 2022-03-25 at 11 45 36 AM

Tx indexer off and snapshots on
Screen Shot 2022-03-25 at 11 46 34 AM

Tx indexer on and snapshots off
Screen Shot 2022-03-25 at 11 46 01 AM

All three showed a continuous increase in RAM. Restarting these nodes placed RAM back into the beggining threshold and proceeded to slowly increase once again.

cc @moniya12

@p0mvn
Copy link
Member

p0mvn commented Mar 26, 2022

Update:

@czarcas7ic and I saw in pprof that fast cache takes a lot of the objects and heap in_use. We discovered that the same value of approximately 780K is used for caching both regular nodes and fast nodes.

Since there are several stores, we may have up to 780K fast nodes in cache for each:

"acc","authz","bank","bech32ibc","capability","claim","distribution","epochs","evidence","gamm","gov","ibc","incentives","lockup","mem_capability","mint","params","poolincentives","slashing","staking","superfluid","transfer","txfees","upgrade","wasm"

We made a guess that it might simply be too much for many stores. As a result, the cache keeps growing incrementally as we keep using the nodes until it OOMs.

We decided to hardcode a small value of 10K to see what happens:

image

That seems to have helped. The memory has been stable for a day.

However, setting fast cache too low might impact the performance negatively.

Next steps:

  • Create a general config for configuring fast node cache
    • Set it to default to 100K
      • Find the best value by trial and error
      • Communicate to validators that it can be chosen depending on their needs:
        • One who has a lot of queries can set larger
        • Others may keep as default

@alexanderbez
Copy link
Contributor

Nice work @p0mvn!

@faddat
Copy link
Member

faddat commented Apr 12, 2022

This seems especially severe in v6, but I don't know why.

I am syncing a node with rocksdb like:

osmosisd start --pruning nothing <- this will be killed by oom reaper
osmosisd start <- this will be killed by oom reaper
osmosisd start --pruning custom --pruning-interval=100 --pruning-keep-every=0 --pruning-keep-recent=362880 --state-sync.snapshot-keep-recent=0 <- testing now

Notably these use rocksdb, and are part of an effort to make a fast full (non-statesync) scripted approach to making an archive node.

@alexanderbez
Copy link
Contributor

Does v7 have various IAVL and DB improvements? Why test on v6?

@p0mvn
Copy link
Member

p0mvn commented Apr 12, 2022

I believe @faddat is doing a full sync to get rocksdb caught up to the chain. As far as I know, he is one of the first to try that so there are no publicly available snapshots on rocks yet.

The reason for the memory problem on v6 is probably fast IAVL changes and its cache that has been causing issues. We'll need to backport the configurable bytes cache to v6 to mitigate that: osmosis-labs/iavl#40

@faddat
Copy link
Member

faddat commented Apr 18, 2022

@p0mvn -- so, in the past I was able to do rocks via statesync. Sadly I had never saved one of those, I used state sync to make them quickly, so there really wasn't a need to do it.

Concerning the memory leak-- it is real and seems to grow as chain state grows, but the patch you are writing does seem to handle it. Osmo uses a ton of RAM still, (50+gb) but it stabilizes there. Also I'm using a 150mb cache, so maybe it would be less if I set that value in app.toml lower.

added QmfEuEx4r6sTZmp1odCMrytV4TWvyrn4ZxYHHVSgt1P38M osmo-v6-rocks.squashfs

that's the ipfs cid (1tb) of osmosis v6, midway through. One thing that I am doing differently this time is making checkpoints of my work so that if I apphash doom, I can go back to them.

@p0mvn p0mvn moved this from In Progress🏃 to Todo 🕒 in Osmosis Chain Development May 9, 2022
@p0mvn p0mvn moved this from Todo 🕒 to Blocked ❌ in Osmosis Chain Development May 9, 2022
@p0mvn
Copy link
Member

p0mvn commented Jul 15, 2022

I think this can be closed as there is no action needed. We can always find this in an archive if any information is needed

@p0mvn p0mvn closed this as completed Jul 15, 2022
Repository owner moved this from Blocked ❌ to Done ✅ in Osmosis Chain Development Jul 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
T:task ⚙️ A task belongs to a story
Projects
Archived in project
Development

No branches or pull requests

5 participants