SystemError: <built-in method run_backward of torch._C._EngineBase object at 0x7f4c413a6d00> returned NULL without setting an exception #6

Pengxiang-zhao · 2025-01-08T05:38:06Z

Thank you for your project and contributions. When I was trying the gradient accumulation method, the code encountered an error at line 425 in Inf-CLIP/inf_clip/train/engine.py during _loss.backward().

The traceback is as follows:

Traceback (most recent call last):
File "/opt/conda/lib/python3.10/contextlib.py", line 153, in __exit__
    self.gen.throw(typ, value, traceback)
File "/opt/conda/lib/python3.10/site-packages/torch/random.py", line 165, in fork_rng
    yield
File "/workspace/Inf-CLIP/inf_clip/train/engine.py", line 425, in accum_forward_backward
    _loss.backward()
File "/opt/conda/lib/python3.10/site-packages/torch/_tensor.py", line 492, in backward
    torch.autograd.backward(
File "/opt/conda/lib/python3.10/site-packages/torch/autograd/__init__.py", line 251, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
SystemError: <built-in method run_backward of torch._C._EngineBase object at 0x7f4c413a6d00> returned NULL without setting an exception

Could you please help me understand the cause of this issue?

The text was updated successfully, but these errors were encountered:

clownrat6 · 2025-01-14T12:09:34Z

After conducting extensive searches, it seems that this is an OOM issue. As a reference, running scripts/cc3m/clip_vit-b-32_bs16k.sh requires 17GB of memory. You can conduct your script with small LOCAL_BATCH_SIZE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SystemError: <built-in method run_backward of torch._C._EngineBase object at 0x7f4c413a6d00> returned NULL without setting an exception #6

SystemError: <built-in method run_backward of torch._C._EngineBase object at 0x7f4c413a6d00> returned NULL without setting an exception #6

Pengxiang-zhao commented Jan 8, 2025 •

edited

Loading

clownrat6 commented Jan 14, 2025

SystemError: <built-in method run_backward of torch._C._EngineBase object at 0x7f4c413a6d00> returned NULL without setting an exception #6

SystemError: <built-in method run_backward of torch._C._EngineBase object at 0x7f4c413a6d00> returned NULL without setting an exception #6

Comments

Pengxiang-zhao commented Jan 8, 2025 • edited Loading

clownrat6 commented Jan 14, 2025

Pengxiang-zhao commented Jan 8, 2025 •

edited

Loading