Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: SocketError torch timeout #135

Merged
merged 4 commits into from
Nov 6, 2023
Merged

fix: SocketError torch timeout #135

merged 4 commits into from
Nov 6, 2023

Conversation

ishaansehgal99
Copy link
Collaborator

@ishaansehgal99 ishaansehgal99 commented Nov 5, 2023

torch.init_process_group includes a default 30 minute timeout.

While the worker is listening for instructions, after thirty idle minutes an error gets thrown: 
[1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Socket Timeout
Exception raised from recvBytes at /opt/pytorch/pytorch/torch/csrc/distributed/c10d/Utils.hpp:605 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xae (0x7f3ff9fb295e in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) + 0x7d (0x7f3ff9f6b7cd in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #2: c10d::TCPStore::doWait(c10::ArrayRef<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::chrono::duration<long, std::ratio<1l, 1000l> >) + 0xf8 (0x7f3fc834c858 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #3: c10d::TCPStore::doGet(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x3a (0x7f3fc834d4ca in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::get(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x84 (0x7f3fc834d594 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::PrefixStore::get(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x43 (0x7f3fc82fe063 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #6: c10d::PrefixStore::get(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x43 (0x7f3fc82fe063 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #7: c10d::PrefixStore::get(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x43 (0x7f3fc82fe063 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #8: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int) + 0x1fc (0x7f3f8c6e443c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #9: c10d::ProcessGroupNCCL::getNCCLComm(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<c10::Device, std::allocator<c10::Device> > const&, c10d::OpType, int, bool) + 0x530 (0x7f3f8c6e7c10 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #10: c10d::ProcessGroupNCCL::broadcast(std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::BroadcastOptions const&) + 0x4c2 (0x7f3f8c6f5922 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #11: <unknown function> + 0x4c1a310 (0x7f3fc82eb310 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #12: <unknown function> + 0x4c23bfb (0x7f3fc82f4bfb in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #13: <unknown function> + 0x4c43ab3 (0x7f3fc8314ab3 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #14: <unknown function> + 0xb8c97a (0x7f3fceaa197a in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #15: <unknown function> + 0x39bb46 (0x7f3fce2b0b46 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #16: <unknown function> + 0x15cc9e (0x5572975c2c9e in /usr/bin/python)
frame #17: _PyObject_MakeTpCall + 0x25b (0x5572975b972b in /usr/bin/python)
frame #18: <unknown function> + 0x16b1eb (0x5572975d11eb in /usr/bin/python)
frame #19: _PyEval_EvalFrameDefault + 0x640a (0x5572975b175a in /usr/bin/python)
frame #20: _PyFunction_Vectorcall + 0x7c (0x5572975c34ec in /usr/bin/python)
frame #21: PyObject_Call + 0x122 (0x5572975d1bc2 in /usr/bin/python)
frame #22: _PyEval_EvalFrameDefault + 0x2a37 (0x5572975add87 in /usr/bin/python)
frame #23: _PyFunction_Vectorcall + 0x7c (0x5572975c34ec in /usr/bin/python)
frame #24: _PyEval_EvalFrameDefault + 0x1a1b (0x5572975acd6b in /usr/bin/python)
frame #25: _PyFunction_Vectorcall + 0x7c (0x5572975c34ec in /usr/bin/python)
frame #26: PyObject_Call + 0x122 (0x5572975d1bc2 in /usr/bin/python)
frame #27: _PyEval_EvalFrameDefault + 0x2a37 (0x5572975add87 in /usr/bin/python)
frame #28: _PyFunction_Vectorcall + 0x7c (0x5572975c34ec in /usr/bin/python)
frame #29: _PyEval_EvalFrameDefault + 0x1a1b (0x5572975acd6b in /usr/bin/python)
frame #30: _PyFunction_Vectorcall + 0x7c (0x5572975c34ec in /usr/bin/python)
frame #31: _PyEval_EvalFrameDefault + 0x6cd (0x5572975aba1d in /usr/bin/python)
frame #32: <unknown function> + 0x142176 (0x5572975a8176 in /usr/bin/python)
frame #33: PyEval_EvalCode + 0x86 (0x55729769dc56 in /usr/bin/python)
frame #34: <unknown function> + 0x264b18 (0x5572976cab18 in /usr/bin/python)
frame #35: <unknown function> + 0x25d96b (0x5572976c396b in /usr/bin/python)
frame #36: <unknown function> + 0x264865 (0x5572976ca865 in /usr/bin/python)
frame #37: _PyRun_SimpleFileObject + 0x1a8 (0x5572976c9d48 in /usr/bin/python)
frame #38: _PyRun_AnyFileObject + 0x43 (0x5572976c9a43 in /usr/bin/python)
frame #39: Py_RunMain + 0x2be (0x5572976bac3e in /usr/bin/python)
frame #40: Py_BytesMain + 0x2d (0x557297690bcd in /usr/bin/python)
frame #41: <unknown function> + 0x29d90 (0x7f4018b72d90 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #42: __libc_start_main + 0x80 (0x7f4018b72e40 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #43: _start + 0x25 (0x557297690ac5 in /usr/bin/python)
. This may indicate a possible application crash on rank 0 or a network set up issue.

After this error, any attempt by the worker to reconnect and establish a new session with the master fails. Including pod restarts. Trying to establish any new connection from worker is met with this error:

root@llama-2-13b-chat-pod-1:/workspace/llama/llama-2# torchrun --nnodes 2 --nproc_per_node 1 --rdzv_endpoint 10.224.0.181:29500 --master_port 29500 --rdzv_backend c10d inference-api.py
master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified.
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==2.1.0a0+4136153', 'console_scripts', 'torchrun')())
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 797, in main
    run(args)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 788, in run
    elastic_launch(
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 255, in launch_agent
    result = agent.run()
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper
    result = f(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/agent/server/api.py", line 736, in run
    result = self._invoke_run(role)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/agent/server/api.py", line 871, in _invoke_run
    self._initialize_workers(self._worker_group)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper
    result = f(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/agent/server/api.py", line 705, in _initialize_workers
    self._rendezvous(worker_group)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper
    result = f(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/agent/server/api.py", line 546, in _rendezvous
    store, group_rank, group_world_size = spec.rdzv_handler.next_rendezvous()
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1028, in next_rendezvous
    self._op_executor.run(join_op, deadline)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 635, in run
    raise RendezvousClosedError()
torch.distributed.elastic.rendezvous.api.RendezvousClosedError

Cleaning up the process group and reinitalizing on the worker side does not resolve this issue either. The state between the worker and master is inconsistent, requires a master restart.

The best solution here is increasing the timeout to prevent the connection from being closed. I have included a dockerfile fix here.

@Fei-Guo Fei-Guo merged commit 1d854e3 into main Nov 6, 2023
6 checks passed
@Fei-Guo Fei-Guo deleted the Ishaan/fix-timeout branch November 6, 2023 01:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants