Replies: 3 comments 9 replies
-
Can I get dial part code? maybe I help you. |
Beta Was this translation helpful? Give feedback.
8 replies
-
How did you get so many nodes? Is it just a matter of having the budget? I am working on a project that might scale at some point so it would love to here your insight. |
Beta Was this translation helpful? Give feedback.
1 reply
-
How did you get so many peers? Are you just spinning up containers? |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I am trying to scale my libp2p node to see how large of a network I can support. I am using libp2p for a simple dkg protocol. For normal test with smaller network size <=100 it works just fine but as soon as I start scaling to 200 or more peers the nodes starts throwing errors when attempting to connect to peers. The errors are mostly with i/o timeouts and failing dials to peers, some times I also get security protocol specific errors. Here are some for the example errors I have seen on the node trying to establish connection to another node:
The 2nd error was also coming for TLS too when I was using TLS instead.
The above errors relate to this line of the code :
And in some cases when the connection does succeed I get errors opening a stream
NewStream()
:timed out: context deadline exceeded
I have increased all possible limits as far as I could find in documentation for libp2p, here's my node config :
And here's the msg send logic :
Here's what I did on the infrastructure setup :
ulimit
on the actual machines to a pretty large value in case the system fd limits were causing issues.For the dkg protocol itself the interactions between the peers are pretty simple :
In case of smaller networks (<=100) all the above is more than enough and it works without any issues I don't even need to change the ulimits on the node nor do I need to set the limits for the resource mngr etc. for it to work. But once we scale to a 300 node setup the system just can't handle it even with all the increased limits, the nodes start to fail at the first step of the protocol itself, though not all nodes show the error only some of the nodes do about 30% of the nodes. I am guessing the node couldn't handle 300 connections simultaneously, I do not think it's an infra problem anymore as the vms are pretty beefy and so is the bandwidth. My best guess is that I am probably not doing something right in the code, I have already tried the obvious things like ensuring I am closing streams, ensuring contexts are set properly, added retry logic for connections etc. I also ensured the nodes are actually still accessibly by exposing a http server in the same node process and I was able to hit the http endpoint on the receiver node without issues from the sender node while the p2p connection kept failing.
Version Info :
Beta Was this translation helpful? Give feedback.
All reactions