Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed vault clone or vault pull on N1 causes N2 to crash #324

Open
CMCDragonkai opened this issue Nov 1, 2024 · 8 comments
Open

Failed vault clone or vault pull on N1 causes N2 to crash #324

CMCDragonkai opened this issue Nov 1, 2024 · 8 comments
Assignees
Labels
bug Something isn't working

Comments

@CMCDragonkai
Copy link
Member

Describe the bug

INFO:polykey.PolykeyAgent.NodeConnectionManager.RPCServer:Handling stream with method (nodesConnectionSignalInitial)
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionForward [2406:da1c:67c:c000:3ed6:cc4b:b6e5:3745:1314].QUICClient.QUICConnection e33f1eb57d2cdb5bdfa6b289aa1d01fd41ed1f47.QUICStream 36:Create QUICStream
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionForward [2406:da1c:67c:c000:3ed6:cc4b:b6e5:3745:1314].QUICClient.QUICConnection e33f1eb57d2cdb5bdfa6b289aa1d01fd41ed1f47.QUICStream 36:Created QUICStream
INFO:polykey.PolykeyAgent.NodeConnectionManager.RPCServer:Handled stream with method (nodesConnectionSignalInitial)
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionForward [2600:1f16:1f71:7c00:3593:3a22:674f:f33b:1314].QUICClient.QUICConnection e1349462cbcf277d32c1e7ea1dd04cdfbe2684dd.QUICStream 45:Destroy QUICStream
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionForward [2600:1f16:1f71:7c00:3593:3a22:674f:f33b:1314].QUICClient.QUICConnection e1349462cbcf277d32c1e7ea1dd04cdfbe2684dd.QUICStream 45:Destroyed QUICStream
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionForward [2600:1f16:1f71:7c00:3593:3a22:674f:f33b:1314].QUICClient.QUICConnection e1349462cbcf277d32c1e7ea1dd04cdfbe2684dd.QUICStream 49:Create QUICStream
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionForward [2600:1f16:1f71:7c00:3593:3a22:674f:f33b:1314].QUICClient.QUICConnection e1349462cbcf277d32c1e7ea1dd04cdfbe2684dd.QUICStream 49:Created QUICStream
INFO:polykey.PolykeyAgent.NodeConnectionManager.RPCServer:Handling stream with method (nodesClosestActiveConnectionsGet)
INFO:polykey.PolykeyAgent.NodeConnectionManager.RPCServer:Handled stream with method (nodesClosestActiveConnectionsGet)
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionForward [2600:1f16:1f71:7c00:3593:3a22:674f:f33b:1314].QUICClient.QUICConnection e1349462cbcf277d32c1e7ea1dd04cdfbe2684dd.QUICStream 49:Destroy QUICStream
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionForward [2600:1f16:1f71:7c00:3593:3a22:674f:f33b:1314].QUICClient.QUICConnection e1349462cbcf277d32c1e7ea1dd04cdfbe2684dd.QUICStream 49:Destroyed QUICStream
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionForward [2600:1f16:1f71:7c00:3593:3a22:674f:f33b:1314].QUICClient.QUICConnection e1349462cbcf277d32c1e7ea1dd04cdfbe2684dd.QUICStream 53:Create QUICStream
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionForward [2600:1f16:1f71:7c00:3593:3a22:674f:f33b:1314].QUICClient.QUICConnection e1349462cbcf277d32c1e7ea1dd04cdfbe2684dd.QUICStream 53:Created QUICStream
INFO:polykey.PolykeyAgent.NodeConnectionManager.RPCServer:Handling stream with method (nodesClosestLocalNodesGet)
TypeError: Invalid state: WritableStream is closed

When N1 tries to clone/pull the vault, sometimes due to unknown bug, state corruption or something, it causes a ErrorRPCTimeout.

After a little bit of time, the agent on N2 reports: TypeError: Invalid state: WritableStream is closed.

This then causes the entire agent to shutdown. I suspect this has common factors with #115, #185, #198.

To Reproduce

  1. This is done with @CDeltakai his version was ["0.10.0","1.14.0","1","1"], but it doesn't appear that the version is the problem.
  2. My agent was running from the staging ["0.13.0","1.15.1","1","1"]
  3. Running a pull/clone of a vault.

Expected behavior

Regardless of what is happening, I believe the network streams is not properly being garbage collected or handled. It doesn't matter if the client is broken. The agent that is serving the vault SHOULD NOT FAIL.

I'm pretty sure this is similar to #198.

The point is something is causing ErrorRPCTimeout, and it seemed to only be fixed through a full state reset. And this implies there's some amount of state corruption that is occurring too.

Screenshots

Platform (please complete the following information)

  • Device: [e.g. iPhone6]
  • OS: [e.g. iOS]
  • Version [e.g. 22]

Additional context

Notify maintainers

@tegefaulkes @aryanjassal

@CMCDragonkai CMCDragonkai added the bug Something isn't working label Nov 1, 2024
@CMCDragonkai
Copy link
Member Author

This along with #198 is definitely due to some sort of resource leak coming out of node to node connections/streams.

@CMCDragonkai
Copy link
Member Author

The only way we were able to proceed was to delete the entire state of the polykey client node state and restart a new node, which means a new NodeId too.

@CMCDragonkai
Copy link
Member Author

This bug issue is really focusing on the inter-node behaviour which is quite critical.

However the state reset indicates that there's some corruption of the state... not sure where or what would cause the ErrorRPCTimeout.... we need a bit more detail over this.

@aryanjassal aryanjassal changed the title Failed Vault Clone/Pull on N1 causes N2 to crash with TypeError: Invalid state: WritableStream is closed Failed vault clone or vault pull on N1 causes N2 to crash Nov 21, 2024
Copy link
Member

Due to not having access to the corrupted Polykey state or another reliable method to replicate this issue, it is really challenging to pinpoint the issue. This will need an in-depth investigation.

Copy link
Member Author

Try testing it with the other team members PK. Don't just do self pull/clone. There's resource leaks in the nodes domain atm anyway.

Copy link
Member Author

Also you can always run different versions of PK too you can try to use the nixpkgs pin to different versions and run them or clone them separately.

@tegefaulkes
Copy link
Contributor

I think this was addressed when we fixed the leaking errors when addressing MatrixAI/js-quic#128. @aryanjassal you'll need to try and recreate the problem here. and see if it still happens. If not then we can mark this as done.

an easy way to trigger a timeout when cloning/pulling is to try and clone/pull a vault with a few megabytes in it.

I'll be assigning this to you @aryanjassal

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Development

No branches or pull requests

3 participants