segfault in abseil 20240126.2 with TensorFlow 2.16 and PyTorch 2.3 #79

njzjz · 2024-07-04T19:17:03Z

Comment:

In conda-forge/deepmd-kit-feedstock#78, I am building a program linked with TensorFlow and PyTorch. When using the latest versions of them, i.e. TensorFlow 2.16 and PyTorch 2.3, there is a segmentation fault in absl::lts_20240116::flags_internal::FlagRegistry::RegisterFlag in either conda-forge images or the local environment, as shown below:

(gdb) where
#0  0x0000155541a60e09 in absl::lts_20240116::flags_internal::FlagRegistry::RegisterFlag(absl::lts_20240116::CommandLineFlag&, char const*) ()
   from /home/jz748/anaconda3/envs/test-deepmd-build/bin/../lib/./python3.11/site-packages/tensorflow/../../../libabsl_flags_reflection.so.2401.0.0
#1  0x0000155541a625c1 in absl::lts_20240116::flags_internal::RegisterCommandLineFlag(absl::lts_20240116::CommandLineFlag&, char const*) ()
   from /home/jz748/anaconda3/envs/test-deepmd-build/bin/../lib/./python3.11/site-packages/tensorflow/../../../libabsl_flags_reflection.so.2401.0.0
#2  0x0000155541a80079 in _GLOBAL__sub_I_flags.cc ()
   from /home/jz748/anaconda3/envs/test-deepmd-build/bin/../lib/./python3.11/site-packages/tensorflow/../../../libabsl_log_flags.so.2401.0.0
#3  0x0000155555525237 in call_init (env=0x55555567dbf0, argv=0x7fffffffad58, argc=3, l=<optimized out>) at dl-init.c:74
#4  call_init (l=<optimized out>, argc=3, argv=0x7fffffffad58, env=0x55555567dbf0) at dl-init.c:26
#5  0x000015555552532d in _dl_init (main_map=0x555555780eb0, argc=3, argv=0x7fffffffad58, env=0x55555567dbf0) at dl-init.c:121
#6  0x00001555555215c2 in __GI__dl_catch_exception (exception=exception@entry=0x0, operate=operate@entry=0x15555552bf50 <call_dl_init>,
    args=args@entry=0x7fffffffa290) at dl-catch.c:211
#7  0x000015555552beec in dl_open_worker (a=a@entry=0x7fffffffa440) at dl-open.c:827
#8  0x0000155555521523 in __GI__dl_catch_exception (exception=exception@entry=0x7fffffffa420,
    operate=operate@entry=0x15555552be50 <dl_open_worker>, args=args@entry=0x7fffffffa440) at dl-catch.c:237
#9  0x000015555552c2e4 in _dl_open (file=0x555555780cc0 "/home/jz748/anaconda3/envs/test-deepmd-build/lib/deepmd_lmp/dpplugin.so",
    mode=<optimized out>, caller_dlopen=0x155550f40916 <LAMMPS_NS::plugin_load(char const*, LAMMPS_NS::LAMMPS*)+166>, nsid=<optimized out>,
    argc=3, argv=0x7fffffffad58, env=0x55555567dbf0) at dl-open.c:903
#10 0x000015554fcc7714 in dlopen_doit () from /lib64/libc.so.6
#11 0x0000155555521523 in __GI__dl_catch_exception (exception=exception@entry=0x7fffffffa630, operate=0x15554fcc76b0 <dlopen_doit>,
    args=0x7fffffffa6f0) at dl-catch.c:237
#12 0x0000155555521679 in _dl_catch_error (objname=0x7fffffffa698, errstring=0x7fffffffa6a0, mallocedp=0x7fffffffa697, operate=<optimized out>,
    args=<optimized out>) at dl-catch.c:256
#13 0x000015554fcc71f3 in _dlerror_run () from /lib64/libc.so.6
#14 0x000015554fcc77cf in dlopen@GLIBC_2.2.5 () from /lib64/libc.so.6
#15 0x0000155550f40916 in LAMMPS_NS::plugin_load(char const*, LAMMPS_NS::LAMMPS*) ()
   from /home/jz748/anaconda3/envs/test-deepmd-build/bin/../lib/liblammps.so.0
#16 0x0000155550f40f66 in LAMMPS_NS::plugin_auto_load(LAMMPS_NS::LAMMPS*) ()
   from /home/jz748/anaconda3/envs/test-deepmd-build/bin/../lib/liblammps.so.0
#17 0x00001555507fe6ed in LAMMPS_NS::LAMMPS::LAMMPS(int, char**, ompi_communicator_t*) ()
   from /home/jz748/anaconda3/envs/test-deepmd-build/bin/../lib/liblammps.so.0
#18 0x0000555555556217 in main ()

However, these two things do work:
(1) Pin tensorflow to 2.15 and pytorch to 2.1;
(2) Use the same abseil version, but built against TensorFlow only, without PyTorch. (conda-forge/deepmd-kit-feedstock#79)

I am not sure what the problem is. As a workaround, I pin tensorflow to 2.15 and pytorch to 2.1.

The text was updated successfully, but these errors were encountered:

h-vetinari · 2024-07-04T20:47:16Z

Do you have more information about the calling code? We even have a specific flags test here that runs on every CI run. It's also one of the most-used pieces of abseil, so it must be some very weird corner case for this to not show up in many more places.

njzjz · 2024-07-04T21:31:11Z

It just uses dlopen to load a library, i.e.

dlopen(fname.c_str(), RTLD_NOW | RTLD_GLOBAL);

The library here is linked against the abseil, tensorflow, and pytorch. It seems no other specific codes are called.

njzjz · 2024-07-04T21:44:47Z

I found it can be reproduced by a simple example:

test_cc.cc:

#include "tensorflow/core/public/session.h"
#include "tensorflow/core/platform/env.h"
#include "tensorflow/core/framework/op.h"
#include "tensorflow/core/framework/op_kernel.h"
#include "tensorflow/core/framework/shape_inference.h"

#include "tensorflow/cc/client/client_session.h"
#include "tensorflow/cc/ops/standard_ops.h"
#include "tensorflow/core/framework/tensor.h"
#include "torch/script.h"

int main() {
  using namespace tensorflow;
  using namespace tensorflow::ops;
  Scope root = Scope::NewRootScope();
  // Matrix A = [3 2; -1 0]
  auto A = Const(root, { {3.f, 2.f}, {-1.f, 0.f} });
  // Vector b = [3 5]
  auto b = Const(root, { {3.f, 5.f} });
  // v = Ab^T
  auto v = MatMul(root.WithOpName("v"), A, b, MatMul::TransposeB(true));
  std::vector<Tensor> outputs;
  ClientSession session(root);
  // Run and fetch v
  TF_CHECK_OK(session.Run({v}, &outputs));
  // Expect outputs[0] == [19; -3]
  LOG(INFO) << outputs[0].matrix<float>();

  torch::Tensor tensor = torch::eye(3);

  return 0;
}

compile.sh:

#!/bin/bash

set -exuo pipefail

# This environment installs the latest TensorFlow and PyTorch
PREFIX=~/anaconda3/envs/tfpt
g++ -o test_cc test_cc.cc -ltensorflow_cc -ltensorflow_framework -ltorch -ltorch_cpu -lc10 -labsl_status -I${PREFIX}/include -L${PREFIX}/lib -Wl,-rpath ${PREFIX}/lib
./test_cc

Got:

+ ./test_cc
./compile.sh: line 7: 1135726 Segmentation fault      (core dumped) ./test_cc

GDB:

(gdb) r
Starting program: /home/jz748/tmp/testtfpt/test_cc
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".

Program received signal SIGSEGV, Segmentation fault.
0x0000155507bc4e09 in absl::lts_20240116::flags_internal::FlagRegistry::RegisterFlag(absl::lts_20240116::CommandLineFlag&, char const*) ()
   from /home/jz748/anaconda3/envs/tfpt/lib/libabsl_flags_reflection.so.2401.0.0
Missing separate debuginfos, use: dnf debuginfo-install glibc-2.37-18.fc38.x86_64 xorg-x11-drv-nvidia-cuda-libs-545.29.06-2.fc38.x86_64
(gdb) where
#0  0x0000155507bc4e09 in absl::lts_20240116::flags_internal::FlagRegistry::RegisterFlag(absl::lts_20240116::CommandLineFlag&, char const*) ()
   from /home/jz748/anaconda3/envs/tfpt/lib/libabsl_flags_reflection.so.2401.0.0
#1  0x0000155507bc65c1 in absl::lts_20240116::flags_internal::RegisterCommandLineFlag(absl::lts_20240116::CommandLineFlag&, char const*) ()
   from /home/jz748/anaconda3/envs/tfpt/lib/libabsl_flags_reflection.so.2401.0.0
#2  0x0000155507be4079 in _GLOBAL__sub_I_flags.cc () from /home/jz748/anaconda3/envs/tfpt/lib/libabsl_log_flags.so.2401.0.0
#3  0x0000155555525237 in call_init (env=0x7fffffffcd28, argv=0x7fffffffcd18, argc=1, l=<optimized out>) at dl-init.c:74
#4  call_init (l=<optimized out>, argc=1, argv=0x7fffffffcd18, env=0x7fffffffcd28) at dl-init.c:26
#5  0x000015555552532d in _dl_init (main_map=0x1555555552c0, argc=1, argv=0x7fffffffcd18, env=0x7fffffffcd28) at dl-init.c:121
#6  0x000015555553b840 in _dl_start_user () from /lib64/ld-linux-x86-64.so.2
#7  0x0000000000000001 in ?? ()
#8  0x00007fffffffd220 in ?? ()
#9  0x0000000000000000 in ?? ()

h-vetinari · 2024-07-04T22:11:30Z

Does the error reproduce if you comment out torch::Tensor tensor = torch::eye(3);? IOW, does it need both tensorflow and pytorch to fail?

njzjz · 2024-07-04T22:18:21Z

One can reproduce the error even if the C++ file is empty and just links both tensorflow and pytorch:

int main() {
  return 0;
}

Got Segmentation fault.

But if removing -ltorch -ltorch_cpu -lc10, or removing -ltensorflow_cc -ltensorflow_framework, there's no error then.

IOW, does it need both tensorflow and pytorch to fail?

Yes, if one links both of them.

h-vetinari · 2024-07-05T22:02:16Z

CC @xhochy @hmaarrfk @isuruf
if you have any ideas what could cause this. My suspicion would be that we're not fully unvendoring abseil in either pytorch or tensorflow, and that something clashes when both are linked (though it's also very weird that this seems to happen even without the linker having to find a related symbol)

hmaarrfk · 2024-07-05T22:16:37Z

Pytorch might be linking things publically. I had to patch that at some point.

hmaarrfk · 2024-07-06T00:07:38Z

though it's also very weird that this seems to happen even without the linker having to find a related symbol

They sometimes do initialization routines in the startup sections. I thought we had this happen in a few other situations.

While this is a "python" startup (and I know the C example might not hit this particular code path), I can image something like this.
conda-forge/pytorch-cpu-feedstock#244

hmaarrfk · 2024-07-07T02:46:00Z

Tensorflow seems to leak the symbols quite heavily

Pytorch leaks fewer but one does make it

nm -gD libtorch_cpu.so | grep absl
                 U _ZN6google8protobuf8internal17AssignDescriptorsEPFPKNS1_15DescriptorTableEvEPN4absl12lts_202401169once_flagERKNS0_8MetadataE

Note to self: https://conda-metadata-app.streamlit.app/ is useful.

njzjz · 2024-09-18T19:55:42Z

nm -gD libtorch_cpu.so | grep absl
U _ZN6google8protobuf8internal17AssignDescriptorsEPFPKNS1_15DescriptorTableEvEPN4absl12lts_202401169once_flagERKNS0_8MetadataE

I don't see this symbol in the PyTorch PyPI wheel. Is there something different in the build process on conda-forge?

njzjz · 2024-11-03T04:08:21Z

Finally, I found a workaround: putting the link flag -labsl_log_flags before other libraries. The link order matters.

# Segfault
g++ -o test_cc test_cc.cc -ltensorflow_cc -ltorch -ltorch_cpu -lc10 -I${PREFIX}/include -L${PREFIX}/lib -Wl,-rpath ${PREFIX}/lib

# No error
g++ -o test_cc test_cc.cc -labsl_log_flags -ltensorflow_cc -ltorch -ltorch_cpu -lc10 -I${PREFIX}/include -L${PREFIX}/lib -Wl,-rpath ${PREFIX}/lib

# Segfault
g++ -o test_cc test_cc.cc -ltensorflow_cc -ltorch -ltorch_cpu -lc10 -labsl_log_flags -I${PREFIX}/include -L${PREFIX}/lib -Wl,-rpath ${PREFIX}/lib

When using CMake, setting LDFLAGS to -labsl_log_flags.

export LDFLAGS="-labsl_log_flags ${LDFLAGS}"

No segfault after applying this workaround!

h-vetinari · 2024-11-03T23:41:25Z

Thanks for digging into this @njzjz! 🙏

My suspicion remains that we've not fully unvendored all traces of abseil in tensorflow (perhaps through a bug i.e. something not being wired up correctly for TF_SYSTEM_LIBS=abseil).

njzjz added the question Further information is requested label Jul 4, 2024

njzjz mentioned this issue Jul 4, 2024

v3.0.0b0 conda-forge/deepmd-kit-feedstock#78

Merged

5 tasks

njzjz mentioned this issue Sep 25, 2024

rebuild for pytorch 2.1 and tensorflow 2.15 conda-forge/horovod-feedstock#17

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

segfault in abseil 20240126.2 with TensorFlow 2.16 and PyTorch 2.3 #79

segfault in abseil 20240126.2 with TensorFlow 2.16 and PyTorch 2.3 #79

njzjz commented Jul 4, 2024

h-vetinari commented Jul 4, 2024

njzjz commented Jul 4, 2024 •

edited

Loading

njzjz commented Jul 4, 2024 •

edited

Loading

h-vetinari commented Jul 4, 2024

njzjz commented Jul 4, 2024 •

edited

Loading

h-vetinari commented Jul 5, 2024

hmaarrfk commented Jul 5, 2024

hmaarrfk commented Jul 6, 2024

hmaarrfk commented Jul 7, 2024

njzjz commented Sep 18, 2024

njzjz commented Nov 3, 2024 •

edited

Loading

h-vetinari commented Nov 3, 2024

segfault in abseil 20240126.2 with TensorFlow 2.16 and PyTorch 2.3 #79

segfault in abseil 20240126.2 with TensorFlow 2.16 and PyTorch 2.3 #79

Comments

njzjz commented Jul 4, 2024

Comment:

h-vetinari commented Jul 4, 2024

njzjz commented Jul 4, 2024 • edited Loading

njzjz commented Jul 4, 2024 • edited Loading

h-vetinari commented Jul 4, 2024

njzjz commented Jul 4, 2024 • edited Loading

h-vetinari commented Jul 5, 2024

hmaarrfk commented Jul 5, 2024

hmaarrfk commented Jul 6, 2024

hmaarrfk commented Jul 7, 2024

njzjz commented Sep 18, 2024

njzjz commented Nov 3, 2024 • edited Loading

h-vetinari commented Nov 3, 2024

njzjz commented Jul 4, 2024 •

edited

Loading

njzjz commented Jul 4, 2024 •

edited

Loading

njzjz commented Jul 4, 2024 •

edited

Loading

njzjz commented Nov 3, 2024 •

edited

Loading