Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SystemError: <built-in method run_backward of torch._C._EngineBase object at 0x7f4c413a6d00> returned NULL without setting an exception #6

Open
Pengxiang-zhao opened this issue Jan 8, 2025 · 1 comment

Comments

@Pengxiang-zhao
Copy link

Pengxiang-zhao commented Jan 8, 2025

Thank you for your project and contributions. When I was trying the gradient accumulation method, the code encountered an error at line 425 in Inf-CLIP/inf_clip/train/engine.py during _loss.backward().

The traceback is as follows:

Traceback (most recent call last):
File "/opt/conda/lib/python3.10/contextlib.py", line 153, in __exit__
    self.gen.throw(typ, value, traceback)
File "/opt/conda/lib/python3.10/site-packages/torch/random.py", line 165, in fork_rng
    yield
File "/workspace/Inf-CLIP/inf_clip/train/engine.py", line 425, in accum_forward_backward
    _loss.backward()
File "/opt/conda/lib/python3.10/site-packages/torch/_tensor.py", line 492, in backward
    torch.autograd.backward(
File "/opt/conda/lib/python3.10/site-packages/torch/autograd/__init__.py", line 251, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
SystemError: <built-in method run_backward of torch._C._EngineBase object at 0x7f4c413a6d00> returned NULL without setting an exception

Could you please help me understand the cause of this issue?

@clownrat6
Copy link
Member

After conducting extensive searches, it seems that this is an OOM issue. As a reference, running scripts/cc3m/clip_vit-b-32_bs16k.sh requires 17GB of memory. You can conduct your script with small LOCAL_BATCH_SIZE.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants