You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thank you for your project and contributions. When I was trying the gradient accumulation method, the code encountered an error at line 425 in Inf-CLIP/inf_clip/train/engine.py during _loss.backward().
The traceback is as follows:
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/contextlib.py", line 153, in __exit__
self.gen.throw(typ, value, traceback)
File "/opt/conda/lib/python3.10/site-packages/torch/random.py", line 165, in fork_rng
yield
File "/workspace/Inf-CLIP/inf_clip/train/engine.py", line 425, in accum_forward_backward
_loss.backward()
File "/opt/conda/lib/python3.10/site-packages/torch/_tensor.py", line 492, in backward
torch.autograd.backward(
File "/opt/conda/lib/python3.10/site-packages/torch/autograd/__init__.py", line 251, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
SystemError: <built-in method run_backward of torch._C._EngineBase object at 0x7f4c413a6d00> returned NULL without setting an exception
Could you please help me understand the cause of this issue?
The text was updated successfully, but these errors were encountered:
After conducting extensive searches, it seems that this is an OOM issue. As a reference, running scripts/cc3m/clip_vit-b-32_bs16k.sh requires 17GB of memory. You can conduct your script with small LOCAL_BATCH_SIZE.
Thank you for your project and contributions. When I was trying the gradient accumulation method, the code encountered an error at line 425 in Inf-CLIP/inf_clip/train/engine.py during
_loss.backward()
.The traceback is as follows:
Could you please help me understand the cause of this issue?
The text was updated successfully, but these errors were encountered: