-
-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
segfault in abseil 20240126.2 with TensorFlow 2.16 and PyTorch 2.3 #79
Comments
Do you have more information about the calling code? We even have a specific flags test here that runs on every CI run. It's also one of the most-used pieces of abseil, so it must be some very weird corner case for this to not show up in many more places. |
It just uses dlopen to load a library, i.e. dlopen(fname.c_str(), RTLD_NOW | RTLD_GLOBAL); The library here is linked against the abseil, tensorflow, and pytorch. It seems no other specific codes are called. |
I found it can be reproduced by a simple example: test_cc.cc: #include "tensorflow/core/public/session.h"
#include "tensorflow/core/platform/env.h"
#include "tensorflow/core/framework/op.h"
#include "tensorflow/core/framework/op_kernel.h"
#include "tensorflow/core/framework/shape_inference.h"
#include "tensorflow/cc/client/client_session.h"
#include "tensorflow/cc/ops/standard_ops.h"
#include "tensorflow/core/framework/tensor.h"
#include "torch/script.h"
int main() {
using namespace tensorflow;
using namespace tensorflow::ops;
Scope root = Scope::NewRootScope();
// Matrix A = [3 2; -1 0]
auto A = Const(root, { {3.f, 2.f}, {-1.f, 0.f} });
// Vector b = [3 5]
auto b = Const(root, { {3.f, 5.f} });
// v = Ab^T
auto v = MatMul(root.WithOpName("v"), A, b, MatMul::TransposeB(true));
std::vector<Tensor> outputs;
ClientSession session(root);
// Run and fetch v
TF_CHECK_OK(session.Run({v}, &outputs));
// Expect outputs[0] == [19; -3]
LOG(INFO) << outputs[0].matrix<float>();
torch::Tensor tensor = torch::eye(3);
return 0;
} compile.sh: #!/bin/bash
set -exuo pipefail
# This environment installs the latest TensorFlow and PyTorch
PREFIX=~/anaconda3/envs/tfpt
g++ -o test_cc test_cc.cc -ltensorflow_cc -ltensorflow_framework -ltorch -ltorch_cpu -lc10 -labsl_status -I${PREFIX}/include -L${PREFIX}/lib -Wl,-rpath ${PREFIX}/lib
./test_cc Got:
GDB:
|
Does the error reproduce if you comment out |
One can reproduce the error even if the C++ file is empty and just links both tensorflow and pytorch: int main() {
return 0;
} Got Segmentation fault. But if removing
Yes, if one links both of them. |
CC @xhochy @hmaarrfk @isuruf |
Pytorch might be linking things publically. I had to patch that at some point. |
They sometimes do initialization routines in the While this is a "python" startup (and I know the C example might not hit this particular code path), I can image something like this. |
Tensorflow seems to leak the symbols quite heavily Pytorch leaks fewer but one does make it
Note to self: https://conda-metadata-app.streamlit.app/ is useful. |
I don't see this symbol in the PyTorch PyPI wheel. Is there something different in the build process on conda-forge? |
Finally, I found a workaround: putting the link flag # Segfault
g++ -o test_cc test_cc.cc -ltensorflow_cc -ltorch -ltorch_cpu -lc10 -I${PREFIX}/include -L${PREFIX}/lib -Wl,-rpath ${PREFIX}/lib
# No error
g++ -o test_cc test_cc.cc -labsl_log_flags -ltensorflow_cc -ltorch -ltorch_cpu -lc10 -I${PREFIX}/include -L${PREFIX}/lib -Wl,-rpath ${PREFIX}/lib
# Segfault
g++ -o test_cc test_cc.cc -ltensorflow_cc -ltorch -ltorch_cpu -lc10 -labsl_log_flags -I${PREFIX}/include -L${PREFIX}/lib -Wl,-rpath ${PREFIX}/lib When using CMake, setting export LDFLAGS="-labsl_log_flags ${LDFLAGS}" No segfault after applying this workaround! |
Thanks for digging into this @njzjz! 🙏 My suspicion remains that we've not fully unvendored all traces of abseil in tensorflow (perhaps through a bug i.e. something not being wired up correctly for |
Comment:
In conda-forge/deepmd-kit-feedstock#78, I am building a program linked with TensorFlow and PyTorch. When using the latest versions of them, i.e. TensorFlow 2.16 and PyTorch 2.3, there is a segmentation fault in
absl::lts_20240116::flags_internal::FlagRegistry::RegisterFlag
in either conda-forge images or the local environment, as shown below:However, these two things do work:
(1) Pin tensorflow to 2.15 and pytorch to 2.1;
(2) Use the same abseil version, but built against TensorFlow only, without PyTorch. (conda-forge/deepmd-kit-feedstock#79)
I am not sure what the problem is. As a workaround, I pin tensorflow to 2.15 and pytorch to 2.1.
The text was updated successfully, but these errors were encountered: