Potential RAM Leak #1140

czarcas7ic · 2022-03-24T17:19:37Z

Currently running a mainnet node and visualizing data with prometheus due to reports of OOM on machines with greater than recommended RAM requirement.

After ~20hrs of running, it is in fact spiking in cycles and slowly increasing

On discord, Geo claims to OOM with 64GB of RAM on an archive node.

Something else I have noticed is restarting the node brings the RAM back to a normal level, however will continue to increase as time goes on.

Also the CPU spikes correlate with the RAM spikes.

CC: @p0mvn

p0mvn · 2022-03-24T17:22:26Z

Tagging previous work on this: #1037

ValarDragon · 2022-03-24T17:31:00Z

My first intuition would be looking at state snapshot intervals and pruning intervals

p0mvn · 2022-03-24T17:39:43Z

Another dashboard on the same node:

Summary:

the number of goroutines is stable over long-term
The number of open file descriptors is slowly growing, might just be leveldb compaction
Go runtime memory spikes at some frequent intervals and partially goes back to normal
- This is different from what was observed in Memory leak #1037 where it would spike on epoch and fully go back to normal

p0mvn · 2022-03-24T17:43:39Z

@czarcas7ic mentioned that this node is on default pruning. These intervals do look like snapshots might be the cause

@czarcas7ic is going to set up a node with the same specs but no snapshots enabled for comparison

czarcas7ic · 2022-03-24T18:47:15Z

Just set them both up and @p0mvn and I will be monitoring them

alexanderbez · 2022-03-24T20:50:59Z

Wouldn't a mem pprof help diagnose what's using so much memory?

czarcas7ic · 2022-03-25T00:46:54Z

@alexanderbez I can take one on all three VMs after they run for 24 hours and see if there is any difference between the various settings (regarding tx index and snapshot settings). How long do you think the pprof should be taken for?

czarcas7ic · 2022-03-25T17:14:02Z

Tx indexer on and snapshots on

Tx indexer off and snapshots on

Tx indexer on and snapshots off

All three showed a continuous increase in RAM. Restarting these nodes placed RAM back into the beggining threshold and proceeded to slowly increase once again.

cc @moniya12

p0mvn · 2022-03-26T16:22:24Z

Update:

@czarcas7ic and I saw in pprof that fast cache takes a lot of the objects and heap in_use. We discovered that the same value of approximately 780K is used for caching both regular nodes and fast nodes.

Since there are several stores, we may have up to 780K fast nodes in cache for each:

"acc","authz","bank","bech32ibc","capability","claim","distribution","epochs","evidence","gamm","gov","ibc","incentives","lockup","mem_capability","mint","params","poolincentives","slashing","staking","superfluid","transfer","txfees","upgrade","wasm"

We made a guess that it might simply be too much for many stores. As a result, the cache keeps growing incrementally as we keep using the nodes until it OOMs.

We decided to hardcode a small value of 10K to see what happens:

That seems to have helped. The memory has been stable for a day.

However, setting fast cache too low might impact the performance negatively.

Next steps:

Create a general config for configuring fast node cache
- Set it to default to 100K
  - Find the best value by trial and error
  - Communicate to validators that it can be chosen depending on their needs:
    - One who has a lot of queries can set larger
    - Others may keep as default

alexanderbez · 2022-03-27T01:42:20Z

Nice work @p0mvn!

faddat · 2022-04-12T08:48:29Z

This seems especially severe in v6, but I don't know why.

I am syncing a node with rocksdb like:

osmosisd start --pruning nothing <- this will be killed by oom reaper
osmosisd start <- this will be killed by oom reaper
osmosisd start --pruning custom --pruning-interval=100 --pruning-keep-every=0 --pruning-keep-recent=362880 --state-sync.snapshot-keep-recent=0 <- testing now

Notably these use rocksdb, and are part of an effort to make a fast full (non-statesync) scripted approach to making an archive node.

alexanderbez · 2022-04-12T13:46:43Z

Does v7 have various IAVL and DB improvements? Why test on v6?

p0mvn · 2022-04-12T14:11:11Z

I believe @faddat is doing a full sync to get rocksdb caught up to the chain. As far as I know, he is one of the first to try that so there are no publicly available snapshots on rocks yet.

The reason for the memory problem on v6 is probably fast IAVL changes and its cache that has been causing issues. We'll need to backport the configurable bytes cache to v6 to mitigate that: osmosis-labs/iavl#40

faddat · 2022-04-18T21:57:15Z

@p0mvn -- so, in the past I was able to do rocks via statesync. Sadly I had never saved one of those, I used state sync to make them quickly, so there really wasn't a need to do it.

Concerning the memory leak-- it is real and seems to grow as chain state grows, but the patch you are writing does seem to handle it. Osmo uses a ton of RAM still, (50+gb) but it stabilizes there. Also I'm using a 150mb cache, so maybe it would be less if I set that value in app.toml lower.

added QmfEuEx4r6sTZmp1odCMrytV4TWvyrn4ZxYHHVSgt1P38M osmo-v6-rocks.squashfs

that's the ipfs cid (1tb) of osmosis v6, midway through. One thing that I am doing differently this time is making checkpoints of my work so that if I apphash doom, I can go back to them.

p0mvn · 2022-07-15T19:17:26Z

I think this can be closed as there is no action needed. We can always find this in an archive if any information is needed

p0mvn mentioned this issue Mar 26, 2022

[Epic] Database layer performance and stability #1016

Closed

14 tasks

This was referenced Mar 27, 2022

hardcode fast node cache size to 100,000 osmosis-labs/iavl#37

Merged

Implement configurable fast node cache size #1163

Closed

ValarDragon added this to Osmosis Chain Development Mar 29, 2022

ValarDragon moved this from 🔍 Needs Review to 🏃 In Progress in Osmosis Chain Development Mar 29, 2022

ValarDragon moved this to 🔍 Needs Review in Osmosis Chain Development Mar 29, 2022

ValarDragon assigned p0mvn and czarcas7ic Mar 29, 2022

ValarDragon added the T:task ⚙️ A task belongs to a story label Mar 29, 2022

p0mvn mentioned this issue Apr 12, 2022

port bytes size iavl cache to v6.x to mitigate RAM issues #1239

Closed

p0mvn mentioned this issue May 2, 2022

[Tracking Issue] Database layer performance and stability #1172

Closed

7 tasks

p0mvn moved this from In Progress🏃 to Todo 🕒 in Osmosis Chain Development May 9, 2022

p0mvn moved this from Todo 🕒 to Blocked ❌ in Osmosis Chain Development May 9, 2022

p0mvn closed this as completed Jul 15, 2022

Repository owner moved this from Blocked ❌ to Done ✅ in Osmosis Chain Development Jul 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potential RAM Leak #1140

Potential RAM Leak #1140

czarcas7ic commented Mar 24, 2022 •

edited

Loading

p0mvn commented Mar 24, 2022

ValarDragon commented Mar 24, 2022

p0mvn commented Mar 24, 2022 •

edited

Loading

p0mvn commented Mar 24, 2022

czarcas7ic commented Mar 24, 2022

alexanderbez commented Mar 24, 2022

czarcas7ic commented Mar 25, 2022 •

edited

Loading

czarcas7ic commented Mar 25, 2022 •

edited

Loading

p0mvn commented Mar 26, 2022 •

edited

Loading

alexanderbez commented Mar 27, 2022

faddat commented Apr 12, 2022

alexanderbez commented Apr 12, 2022

p0mvn commented Apr 12, 2022

faddat commented Apr 18, 2022

p0mvn commented Jul 15, 2022

Potential RAM Leak #1140

Potential RAM Leak #1140

Comments

czarcas7ic commented Mar 24, 2022 • edited Loading

p0mvn commented Mar 24, 2022

ValarDragon commented Mar 24, 2022

p0mvn commented Mar 24, 2022 • edited Loading

p0mvn commented Mar 24, 2022

czarcas7ic commented Mar 24, 2022

alexanderbez commented Mar 24, 2022

czarcas7ic commented Mar 25, 2022 • edited Loading

czarcas7ic commented Mar 25, 2022 • edited Loading

p0mvn commented Mar 26, 2022 • edited Loading

alexanderbez commented Mar 27, 2022

faddat commented Apr 12, 2022

alexanderbez commented Apr 12, 2022

p0mvn commented Apr 12, 2022

faddat commented Apr 18, 2022

p0mvn commented Jul 15, 2022

czarcas7ic commented Mar 24, 2022 •

edited

Loading

p0mvn commented Mar 24, 2022 •

edited

Loading

czarcas7ic commented Mar 25, 2022 •

edited

Loading

czarcas7ic commented Mar 25, 2022 •

edited

Loading

p0mvn commented Mar 26, 2022 •

edited

Loading