Linking to both tensorflow and protobuf causes segmentation fault during static initializers

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): 4.18.10-1rodete2-amd64 (Debian-derived)
  • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: N/A
  • TensorFlow installed from (source or binary): binary
  • TensorFlow version (use command below): nightly Jan 15, 2018 (protobuf built from HEAD Jan 15)
  • Python version: N/A
  • Bazel version (if compiling from source): N/A
  • GCC/Compiler version (if compiling from source): gcc 7.3.0
  • CUDA/cuDNN version: N/A
  • GPU model and memory: N/A

Describe the current behavior
Aborts on SIGSEGV

Describe the expected behavior
Exits cleanly

Details
I want to create an application that calls the C API but also can parse protocol buffers on its own behalf. For that want to link dynamically to tensorflow and statically to protobuf. When I do this, it seems like protobuf may be tricking libtensorflow.so into thinking that it has run some static initializers that it in fact has not run (on the static variables needed by its own internal copy of protobuf).

The segfault is only on Linux. Linking the same way on Windows works fine.

I have varied libtensorflow and protobuf versions, and it seems to happen with all of them. It also happens whether I choose static or dynamic linking for my binary's copy of protobuf.

I also tried building my own liba.so that itself statically links protobuf and then a binary that linked dynamically to "a" and statically to protobuf. This worked, which is pointing away from this being a purely protobuf issue.

Code to reproduce the issue

  • bash
c++ -o main \
  -L$TF_DIR/lib -I$TF_DIR/include \
  -L$PROTO_DIR/lib -I$PROTO_DIR/include \
  main.cc -l tensorflow -l protobuf

LD_LIBRARY_PATH=$TF_DIR/lib:$PROTO_DIR/lib ./main

Removing -lprotobuf from the above command will get rid of the segfault.

  • main.cc
int main(int argc, char** argv) {}

Other info / logs

Program received signal SIGSEGV, Segmentation fault.
0x00007fffed8f20b8 in tensorflow::kernel_factory::OpKernelRegistrar::InitInternal(tensorflow::KernelDef const*, absl::string_view, std::un
ique_ptr<tensorflow::kernel_factory::OpKernelFactory, std::default_deletetensorflow::kernel_factory::OpKernelFactory >) ()
from /usr/local/google/home/mattharvey/no_backup/libtensorflow/lib/libtensorflow_framework.so
(gdb) bt
#0 0x00007fffed8f20b8 in tensorflow::kernel_factory::OpKernelRegistrar::InitInternal(tensorflow::KernelDef const*, absl::string_view, std
::unique_ptr<tensorflow::kernel_factory::OpKernelFactory, std::default_deletetensorflow::kernel_factory::OpKernelFactory >) ()
from /usr/local/google/home/mattharvey/no_backup/libtensorflow/lib/libtensorflow_framework.so
#1 0x00007fffed88336a in tensorflow::kernel_factory::OpKernelRegistrar::OpKernelRegistrar(tensorflow::KernelDef const*, absl::string_view
, tensorflow::OpKernel* ()(tensorflow::OpKernelConstruction)) ()
from /usr/local/google/home/mattharvey/no_backup/libtensorflow/lib/libtensorflow_framework.so
#2 0x00007fffed85f806 in _GLOBAL__sub_I_dataset.cc ()
from /usr/local/google/home/mattharvey/no_backup/libtensorflow/lib/libtensorflow_framework.so
#3 0x00007ffff7de88aa in call_init (l=, argc=argc@entry=1, argv=argv@entry=0x7fffffffdc68, env=env@entry=0x7fffffffdc78)
at dl-init.c:72
#4 0x00007ffff7de89bb in call_init (env=0x7fffffffdc78, argv=0x7fffffffdc68, argc=1, l=) at dl-init.c:30
#5 _dl_init (main_map=0x7ffff7ffe170, argc=1, argv=0x7fffffffdc68, env=0x7fffffffdc78) at dl-init.c:120
#6 0x00007ffff7dd9c5a in _dl_start_user () from /lib64/ld-linux-x86-64.so.2
#7 0x0000000000000001 in ?? ()
#8 0x00007fffffffdf2e in ?? ()
#9 0x0000000000000000 in ?? ()

0x00007fffed8f20a0 <+80>: mov 0x50(%r15),%rax
0x00007fffed8f20a4 <+84>: lea -0xa0(%rbp),%rbx
0x00007fffed8f20ab <+91>: mov %rbx,%rdi
0x00007fffed8f20ae <+94>: mov (%rax),%r8
0x00007fffed8f20b1 <+97>: mov 0x48(%r15),%rax
0x00007fffed8f20b5 <+101>: mov (%rax),%rsi
=> 0x00007fffed8f20b8 <+104>: mov -0x18(%r8),%r9

How did -0x18(%r8) get illegal?

(gdb) info register r8
r8 0x0 0

-0x18 is certainly illegal. Where did it come from? 0x50(%r15) if we trace through the above.

(gdb) info register r15
r15 0x555555768d10 93824994413840

(gdb) x/2 0x555555768d60
0x555555768d60: 0xee2c0bc0 0x00007fff

(gdb) x/2 0x00007fffee2c0bc0
0x7fffee2c0bc0 google::protobuf::internal::fixed_address_empty_string: 0x00000000 0x00000000

... the 0x0 that ended up in r8.

Zoom out to find lots of stuff uninitialized:

(gdb) x/64x 0x7fffee4ddb00
0x7fffee4ddb00 google::protobuf::_DoubleValue_default_instance_: 0x00000000 0x00000000 0x00000000 0x00000000
0x7fffee4ddb10 google::protobuf::_DoubleValue_default_instance_+16: 0x00000000 0x00000000 0x00000000 0x00000000
0x7fffee4ddb20 <_ZStL8__ioinit>: 0x00000000 0x00000000 0x00000000 0x00000000
0x7fffee4ddb30 <_ZStL8__ioinit>: 0x00000000 0x00000000 0x00000000 0x00000000
0x7fffee4ddb40 google::protobuf::internal::RepeatedPrimitiveDefaults::default_instance()::instance: 0x00000000 0x00000000 0x00000000 0x00000000
0x7fffee4ddb50 <guard variable for google::protobuf::internal::RepeatedStringTypeTraits::GetDefaultRepeatedField()::instance>: 0x000000000x00000000 0x00000000 0x00000000
0x7fffee4ddb60 <guard variable for google::protobuf::internal::(anonymous namespace)::Register(google::protobuf::MessageLite const*, int, google::protobuf::internal::ExtensionInfo)::local_static_registry>: 0x00000000 0x00000000 0x00000000 0x00000000
0x7fffee4ddb70 <_ZStL8__ioinit>: 0x00000000 0x00000000 0x00000000 0x00000000
0x7fffee4ddb80 google::protobuf::internal::InitSCCImpl(google::protobuf::internal::SCCInfoBase*)::mu: 0x00000000 0x00000000 0x00000000 0x00000000
0x7fffee4ddb90 google::protobuf::internal::InitSCCImpl(google::protobuf::internal::SCCInfoBase*)::mu+16: 0x00000000 0x000000000x00000000 0x00000000
0x7fffee4ddba0 google::protobuf::internal::InitSCCImpl(google::protobuf::internal::SCCInfoBase*)::mu+32: 0x00000000 0x000000000x00000000 0x00000000
0x7fffee4ddbb0 <guard variable for google::protobuf::internal::InitSCCImpl(google::protobuf::internal::SCCInfoBase*)::runner>: 0x000000000x00000000 0x00000000 0x00000000
0x7fffee4ddbc0 google::protobuf::internal::fixed_address_empty_string: 0x00000000 0x00000000 0x00000000 0x00000000
0x7fffee4ddbd0 google::protobuf::internal::implicit_weak_message_default_instance: 0x00000000 0x00000000 0x00000000 0x00000000
0x7fffee4ddbe0 google::protobuf::internal::implicit_weak_message_default_instance+16: 0x00000000 0x00000000 0x00000000 0x00000000
0x7fffee4ddbf0 google::protobuf::ShutdownProtobufLibrary()::is_shutdown: 0x00000000 0x00000000 0x00000000 0x00000000

matth79 commented on 17 Jan 2019

I found a temporary workaround for myself, but it should still be possible to do this from released binaries without the need to rebuild.

Local opt build works from r1.12 at a6d8ffa

bazel build -c opt --copt=-mavx --define=grpc_no_ares=true //tensorflow/tools/lib_package:libtensorflow

tar zxvf ../tensorflow/bazel-bin/tensorflow/tools/lib_package/libtensorflow.tar.gz

However I get the segfault from

https://storage.googleapis.com/tensorflow/libtensorflow/libtensorflow-cpu-linux-x86_64-1.12.0.tar.gz

with protobuf built locally from

https://github.com/protocolbuffers/protobuf/releases/download/v3.6.0/protobuf-all-3.6.0.tar.gz

and also from

https://storage.googleapis.com/tensorflow-nightly/github/tensorflow/lib_package/libtensorflow-cpu-linux-x86_64.tar.gz # Wed Jan 16 22:33:29 PST 2019

with protobuf built locally from head (3.6.1) around the same time.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值