-
Notifications
You must be signed in to change notification settings - Fork 953
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(pruner/light): implement light pruning #3388
Conversation
Accidentally included unrelated changes from |
e071029
to
abc78f1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
super clear and simple
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to fix few things and should be g2g
Extra thing to consider is that it looks like pruner service seems to be enabled to Light nodes in previous releases. It noop operation, but iterates over headers. So node can start with non height == 1 pruner checkpoint. If thats the case, prunner might skip some old samples. So pruner checkpoint needs to be reset for LN. |
Actually, Pruner has never been released, so above is not a problem. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good besides fx provide,
would be good to have some profiles as well for what LN looks like during a pruning job on the PR description
The reason why I say profiles for pruning job is bc I'm wondering if we should adjust config values for how often pruner runs for LNs maybe |
1c5532d
to
dd5e191
Compare
After looking at my running node again, I found it in the restart-panic loop. It reached the file descriptor limit and couldn't start. I don't know why it crashed in the first place, but it will likely be the same limit issue. Increasing the limit resolved the issue but introduced another one, and we are now panicking on the DeleteNode code path, particularly here. Apparently, the nmtNode read from the disk does not match any of the constant sizes we define, which suggests some form of DB corruption. |
I couldn't recover logs on why the node crashed initially, so I must rerun the test and resync the node. I will give it more file descriptors this time and see how it behaves. I suspect that the number of files grew due to badger nature, as every deletion in LSM trees is a write until compaction happens, which cleans up both the original write and the deletion. The question is why that compaction didn't clean up things on time. |
LN sampling the whole chain on mainnet Next I will measure the pruned one |
So far, it looks good, but the resource usage is concerning. My 4 cores are constantly utilized around 60-80%, which is more or less fine. The other thing is that RAM reached almost 4GB and I don't know what to do about that yet |
Ok, the new pruning round finished in 88.5 minutes, but for some reason, the datastore size is almost the same as before pruning, just a tiny bit less. During the pruning process, it went down 34GB once and back to 35 again. I am confused; why is there no reduction?
|
I resynced and repruned, and the disk size does not go down when the deletion code path is clearly executed(verified through profiling and seeing badger's delete op taking RAM). I don't. The only good news is that the high RAM usage issue I mentioned before is not an issue. It's the same confusion we had with Astria when the process grabbed too much memory, yet available for reclaim by the OS. At this point, I don't know why data is not cleaned up and what to do next. Maybe its time to look into the Badger's flesh again... |
d795921
to
3c32f68
Compare
Okay, I know why, now. I've never actually synced the chain. Changing the availability to an archival one isn't enough(cc @renaynay), or there is some bug that doesn't propagate the archival window properly. EDIT: OK..., I found out that I only been changing this line, but not this. It's so confusing to have million ways to construct the modules |
The sampling is still in progress after 24 hours with ~920000 heights sampled, and the store size is ~47GB. I want it to finish before starting pruning. |
Sampling is done. |
Since the last comment, the pruning has been running, and it still is. It's extremely slow, and I wonder if we should parallelize it. It would definitely help because pruning is not bottlenecked by any resource. |
The pruning has finished! I can't find the exact time it took for the whole process, as I had to restart the node several times, and it does not log the time it took on node stop. I don't want to rerun it to know the exact time, but what's clear is that it took more than 24 hours. The little (every 5 minutes) pruning rounds take Thinking more about parallelization to speed up historical pruning. I don't think it's worth it atm, even though I have a little urge to implement it:
|
Pruning time of ~0.5 seconds is totally fine and sufficient. Prunner will free up space much faster than new samples are created, so which is the most important thing. Pruning months of data is once in lifetime thing and it is fine if possible take even days to prune everything for long running Light node. |
3cc4b46
to
e6fdd7f
Compare
Implements light pruning in before the Shwap era. It recursively traverses the tree and deletes the NMT nodes as it goes. The time to prune the whole history takes more than 24 hours. The time to prune recent heights (every 5 mins) takes `~0.5s` The historical pruning reduced disk usage from ~62GB to ~ 38GB. The RAM usage for active pruning is stable at `~70MB`
Implements light pruning in before the Shwap era.
It recursively traverses the tree and deletes the NMT nodes as it goes.
The time to prune the whole history takes more than 24 hours.
The time to prune recent heights (every 5 mins) takes
~0.5s
The historical pruning reduced disk usage from ~62GB to ~ 38GB.
The RAM usage for active pruning is stable at
~70MB