Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

segfault in abseil 20240126.2 with TensorFlow 2.16 and PyTorch 2.3 #79

Open
njzjz opened this issue Jul 4, 2024 · 12 comments
Open

segfault in abseil 20240126.2 with TensorFlow 2.16 and PyTorch 2.3 #79

njzjz opened this issue Jul 4, 2024 · 12 comments
Labels
question Further information is requested

Comments

@njzjz
Copy link
Member

njzjz commented Jul 4, 2024

Comment:

In conda-forge/deepmd-kit-feedstock#78, I am building a program linked with TensorFlow and PyTorch. When using the latest versions of them, i.e. TensorFlow 2.16 and PyTorch 2.3, there is a segmentation fault in absl::lts_20240116::flags_internal::FlagRegistry::RegisterFlag in either conda-forge images or the local environment, as shown below:

(gdb) where
#0  0x0000155541a60e09 in absl::lts_20240116::flags_internal::FlagRegistry::RegisterFlag(absl::lts_20240116::CommandLineFlag&, char const*) ()
   from /home/jz748/anaconda3/envs/test-deepmd-build/bin/../lib/./python3.11/site-packages/tensorflow/../../../libabsl_flags_reflection.so.2401.0.0
#1  0x0000155541a625c1 in absl::lts_20240116::flags_internal::RegisterCommandLineFlag(absl::lts_20240116::CommandLineFlag&, char const*) ()
   from /home/jz748/anaconda3/envs/test-deepmd-build/bin/../lib/./python3.11/site-packages/tensorflow/../../../libabsl_flags_reflection.so.2401.0.0
#2  0x0000155541a80079 in _GLOBAL__sub_I_flags.cc ()
   from /home/jz748/anaconda3/envs/test-deepmd-build/bin/../lib/./python3.11/site-packages/tensorflow/../../../libabsl_log_flags.so.2401.0.0
#3  0x0000155555525237 in call_init (env=0x55555567dbf0, argv=0x7fffffffad58, argc=3, l=<optimized out>) at dl-init.c:74
#4  call_init (l=<optimized out>, argc=3, argv=0x7fffffffad58, env=0x55555567dbf0) at dl-init.c:26
#5  0x000015555552532d in _dl_init (main_map=0x555555780eb0, argc=3, argv=0x7fffffffad58, env=0x55555567dbf0) at dl-init.c:121
#6  0x00001555555215c2 in __GI__dl_catch_exception (exception=exception@entry=0x0, operate=operate@entry=0x15555552bf50 <call_dl_init>,
    args=args@entry=0x7fffffffa290) at dl-catch.c:211
#7  0x000015555552beec in dl_open_worker (a=a@entry=0x7fffffffa440) at dl-open.c:827
#8  0x0000155555521523 in __GI__dl_catch_exception (exception=exception@entry=0x7fffffffa420,
    operate=operate@entry=0x15555552be50 <dl_open_worker>, args=args@entry=0x7fffffffa440) at dl-catch.c:237
#9  0x000015555552c2e4 in _dl_open (file=0x555555780cc0 "/home/jz748/anaconda3/envs/test-deepmd-build/lib/deepmd_lmp/dpplugin.so",
    mode=<optimized out>, caller_dlopen=0x155550f40916 <LAMMPS_NS::plugin_load(char const*, LAMMPS_NS::LAMMPS*)+166>, nsid=<optimized out>,
    argc=3, argv=0x7fffffffad58, env=0x55555567dbf0) at dl-open.c:903
#10 0x000015554fcc7714 in dlopen_doit () from /lib64/libc.so.6
#11 0x0000155555521523 in __GI__dl_catch_exception (exception=exception@entry=0x7fffffffa630, operate=0x15554fcc76b0 <dlopen_doit>,
    args=0x7fffffffa6f0) at dl-catch.c:237
#12 0x0000155555521679 in _dl_catch_error (objname=0x7fffffffa698, errstring=0x7fffffffa6a0, mallocedp=0x7fffffffa697, operate=<optimized out>,
    args=<optimized out>) at dl-catch.c:256
#13 0x000015554fcc71f3 in _dlerror_run () from /lib64/libc.so.6
#14 0x000015554fcc77cf in dlopen@GLIBC_2.2.5 () from /lib64/libc.so.6
#15 0x0000155550f40916 in LAMMPS_NS::plugin_load(char const*, LAMMPS_NS::LAMMPS*) ()
   from /home/jz748/anaconda3/envs/test-deepmd-build/bin/../lib/liblammps.so.0
#16 0x0000155550f40f66 in LAMMPS_NS::plugin_auto_load(LAMMPS_NS::LAMMPS*) ()
   from /home/jz748/anaconda3/envs/test-deepmd-build/bin/../lib/liblammps.so.0
#17 0x00001555507fe6ed in LAMMPS_NS::LAMMPS::LAMMPS(int, char**, ompi_communicator_t*) ()
   from /home/jz748/anaconda3/envs/test-deepmd-build/bin/../lib/liblammps.so.0
#18 0x0000555555556217 in main ()

However, these two things do work:
(1) Pin tensorflow to 2.15 and pytorch to 2.1;
(2) Use the same abseil version, but built against TensorFlow only, without PyTorch. (conda-forge/deepmd-kit-feedstock#79)

I am not sure what the problem is. As a workaround, I pin tensorflow to 2.15 and pytorch to 2.1.

@njzjz njzjz added the question Further information is requested label Jul 4, 2024
@h-vetinari
Copy link
Member

Do you have more information about the calling code? We even have a specific flags test here that runs on every CI run. It's also one of the most-used pieces of abseil, so it must be some very weird corner case for this to not show up in many more places.

@njzjz
Copy link
Member Author

njzjz commented Jul 4, 2024

It just uses dlopen to load a library, i.e.

dlopen(fname.c_str(), RTLD_NOW | RTLD_GLOBAL);

The library here is linked against the abseil, tensorflow, and pytorch. It seems no other specific codes are called.

@njzjz
Copy link
Member Author

njzjz commented Jul 4, 2024

I found it can be reproduced by a simple example:

test_cc.cc:

#include "tensorflow/core/public/session.h"
#include "tensorflow/core/platform/env.h"
#include "tensorflow/core/framework/op.h"
#include "tensorflow/core/framework/op_kernel.h"
#include "tensorflow/core/framework/shape_inference.h"

#include "tensorflow/cc/client/client_session.h"
#include "tensorflow/cc/ops/standard_ops.h"
#include "tensorflow/core/framework/tensor.h"
#include "torch/script.h"

int main() {
  using namespace tensorflow;
  using namespace tensorflow::ops;
  Scope root = Scope::NewRootScope();
  // Matrix A = [3 2; -1 0]
  auto A = Const(root, { {3.f, 2.f}, {-1.f, 0.f} });
  // Vector b = [3 5]
  auto b = Const(root, { {3.f, 5.f} });
  // v = Ab^T
  auto v = MatMul(root.WithOpName("v"), A, b, MatMul::TransposeB(true));
  std::vector<Tensor> outputs;
  ClientSession session(root);
  // Run and fetch v
  TF_CHECK_OK(session.Run({v}, &outputs));
  // Expect outputs[0] == [19; -3]
  LOG(INFO) << outputs[0].matrix<float>();

  torch::Tensor tensor = torch::eye(3);

  return 0;
}

compile.sh:

#!/bin/bash

set -exuo pipefail

# This environment installs the latest TensorFlow and PyTorch
PREFIX=~/anaconda3/envs/tfpt
g++ -o test_cc test_cc.cc -ltensorflow_cc -ltensorflow_framework -ltorch -ltorch_cpu -lc10 -labsl_status -I${PREFIX}/include -L${PREFIX}/lib -Wl,-rpath ${PREFIX}/lib
./test_cc

Got:

+ ./test_cc
./compile.sh: line 7: 1135726 Segmentation fault      (core dumped) ./test_cc

GDB:

(gdb) r
Starting program: /home/jz748/tmp/testtfpt/test_cc
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".

Program received signal SIGSEGV, Segmentation fault.
0x0000155507bc4e09 in absl::lts_20240116::flags_internal::FlagRegistry::RegisterFlag(absl::lts_20240116::CommandLineFlag&, char const*) ()
   from /home/jz748/anaconda3/envs/tfpt/lib/libabsl_flags_reflection.so.2401.0.0
Missing separate debuginfos, use: dnf debuginfo-install glibc-2.37-18.fc38.x86_64 xorg-x11-drv-nvidia-cuda-libs-545.29.06-2.fc38.x86_64
(gdb) where
#0  0x0000155507bc4e09 in absl::lts_20240116::flags_internal::FlagRegistry::RegisterFlag(absl::lts_20240116::CommandLineFlag&, char const*) ()
   from /home/jz748/anaconda3/envs/tfpt/lib/libabsl_flags_reflection.so.2401.0.0
#1  0x0000155507bc65c1 in absl::lts_20240116::flags_internal::RegisterCommandLineFlag(absl::lts_20240116::CommandLineFlag&, char const*) ()
   from /home/jz748/anaconda3/envs/tfpt/lib/libabsl_flags_reflection.so.2401.0.0
#2  0x0000155507be4079 in _GLOBAL__sub_I_flags.cc () from /home/jz748/anaconda3/envs/tfpt/lib/libabsl_log_flags.so.2401.0.0
#3  0x0000155555525237 in call_init (env=0x7fffffffcd28, argv=0x7fffffffcd18, argc=1, l=<optimized out>) at dl-init.c:74
#4  call_init (l=<optimized out>, argc=1, argv=0x7fffffffcd18, env=0x7fffffffcd28) at dl-init.c:26
#5  0x000015555552532d in _dl_init (main_map=0x1555555552c0, argc=1, argv=0x7fffffffcd18, env=0x7fffffffcd28) at dl-init.c:121
#6  0x000015555553b840 in _dl_start_user () from /lib64/ld-linux-x86-64.so.2
#7  0x0000000000000001 in ?? ()
#8  0x00007fffffffd220 in ?? ()
#9  0x0000000000000000 in ?? ()

@h-vetinari
Copy link
Member

Does the error reproduce if you comment out torch::Tensor tensor = torch::eye(3);? IOW, does it need both tensorflow and pytorch to fail?

@njzjz
Copy link
Member Author

njzjz commented Jul 4, 2024

One can reproduce the error even if the C++ file is empty and just links both tensorflow and pytorch:

int main() {
  return 0;
}

Got Segmentation fault.

But if removing -ltorch -ltorch_cpu -lc10, or removing -ltensorflow_cc -ltensorflow_framework, there's no error then.

IOW, does it need both tensorflow and pytorch to fail?

Yes, if one links both of them.

@h-vetinari
Copy link
Member

CC @xhochy @hmaarrfk @isuruf
if you have any ideas what could cause this. My suspicion would be that we're not fully unvendoring abseil in either pytorch or tensorflow, and that something clashes when both are linked (though it's also very weird that this seems to happen even without the linker having to find a related symbol)

@hmaarrfk
Copy link
Contributor

hmaarrfk commented Jul 5, 2024

Pytorch might be linking things publically. I had to patch that at some point.

@hmaarrfk
Copy link
Contributor

hmaarrfk commented Jul 6, 2024

though it's also very weird that this seems to happen even without the linker having to find a related symbol

They sometimes do initialization routines in the startup sections. I thought we had this happen in a few other situations.

While this is a "python" startup (and I know the C example might not hit this particular code path), I can image something like this.
conda-forge/pytorch-cpu-feedstock#244

@hmaarrfk
Copy link
Contributor

hmaarrfk commented Jul 7, 2024

Tensorflow seems to leak the symbols quite heavily
image

Pytorch leaks fewer but one does make it

nm -gD libtorch_cpu.so | grep absl
                 U _ZN6google8protobuf8internal17AssignDescriptorsEPFPKNS1_15DescriptorTableEvEPN4absl12lts_202401169once_flagERKNS0_8MetadataE

Note to self: https://conda-metadata-app.streamlit.app/ is useful.

@njzjz
Copy link
Member Author

njzjz commented Sep 18, 2024

nm -gD libtorch_cpu.so | grep absl
U _ZN6google8protobuf8internal17AssignDescriptorsEPFPKNS1_15DescriptorTableEvEPN4absl12lts_202401169once_flagERKNS0_8MetadataE

I don't see this symbol in the PyTorch PyPI wheel. Is there something different in the build process on conda-forge?

@njzjz
Copy link
Member Author

njzjz commented Nov 3, 2024

Finally, I found a workaround: putting the link flag -labsl_log_flags before other libraries. The link order matters.

# Segfault
g++ -o test_cc test_cc.cc -ltensorflow_cc -ltorch -ltorch_cpu -lc10 -I${PREFIX}/include -L${PREFIX}/lib -Wl,-rpath ${PREFIX}/lib

# No error
g++ -o test_cc test_cc.cc -labsl_log_flags -ltensorflow_cc -ltorch -ltorch_cpu -lc10 -I${PREFIX}/include -L${PREFIX}/lib -Wl,-rpath ${PREFIX}/lib

# Segfault
g++ -o test_cc test_cc.cc -ltensorflow_cc -ltorch -ltorch_cpu -lc10 -labsl_log_flags -I${PREFIX}/include -L${PREFIX}/lib -Wl,-rpath ${PREFIX}/lib

When using CMake, setting LDFLAGS to -labsl_log_flags.

export LDFLAGS="-labsl_log_flags ${LDFLAGS}"

No segfault after applying this workaround!

@h-vetinari
Copy link
Member

Thanks for digging into this @njzjz! 🙏

My suspicion remains that we've not fully unvendored all traces of abseil in tensorflow (perhaps through a bug i.e. something not being wired up correctly for TF_SYSTEM_LIBS=abseil).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants