AI部署遇到的问题(实时更新)

小wu学cv

已于 2024-06-16 23:17:19 修改

阅读量460

点赞数

文章标签： linux 开发语言

于 2023-08-16 16:23:32 首次发布

本文链接：https://blog.csdn.net/caip12999203000/article/details/132322363

版权

1. CMakeLists.txt 编译 boost 库,出现以下的问题

[100%] Linking CXX executable VideoServer
CMakeFiles/VideoServer.dir/root/ai_server/main.cpp.o: In function `boost::log::v2s_mt_posix::attribute::impl::~impl()':
main.cpp:(.text._ZN5boost3log12v2s_mt_posix9attribute4implD2Ev[_ZN5boost3log12v2s_mt_posix9attribute4implD5Ev]+0x2e): undefined reference to `boost::log::v2s_mt_posix::attribute::impl::operator delete(void*, unsigned long)'
CMakeFiles/VideoServer.dir/root/ai_server/main.cpp.o: In function .....
CMakeFiles/VideoServer.dir/root/ai_server/log.cpp.o: In function `boost::log::v2s_mt_posix::attributes::attribute_value_impl<boost::posix_time::ptime>::~attribute_value_impl()':
log.cpp:(.text._ZN5boost3log12v2s_mt_posix10attributes20attribute_value_implINS_10posix_time5ptimeEED0Ev[_ZN5boost3log12v2s_mt_posix10attributes20attribute_value_implINS_10posix_time5ptimeEED5Ev]+0x25): undefined reference to `boost::log::v2s_mt_posix::attribute::impl::operator delete(void*, unsigned long)'
collect2: error: ld returned 1 exit status
CMakeFiles/VideoServer.dir/build.make:457: recipe for target 'VideoServer' failed
make[2]: *** [VideoServer] Error 1
CMakeFiles/Makefile2:67: recipe for target 'CMakeFiles/VideoServer.dir/all' failed
make[1]: *** [CMakeFiles/VideoServer.dir/all] Error 2
Makefile:83: recipe for target 'all' failed
make: *** [all] Error 2

- 原因: 编译的时候使用了静态库

- 改正: (参考:https://blog.csdn.net/wxf306989618/article/details/90600613)

# 要解决此问题，你可以在CMakeLists.txt文件中添加以下行来定义BOOST_LOG_DYN_LINK：
add_definitions(-DBOOST_LOG_DYN_LINK)

# 或者，如果你使用了BOOST_ALL_DYN_LINK，可以添加以下行来定义它：
add_definitions(-DBOOST_ALL_DYN_LINK)

2.在编译python3.7时，没有找到 libpython3.7m.so 共享文件，c++调用python时出现以下问题：

CMakeFiles/VideoServer.dir/root/Project/ai_server/stu_ai_realizer.cpp.o: In function `_import_array':
stu_ai_realizer.cpp:(.text+0xe): undefined reference to `PyImport_ImportModule'
stu_ai_realizer.cpp:(.text+0x28): undefined reference to `PyExc_ImportError'
stu_ai_realizer.cpp:(.text+0x35): undefined reference to 
..........

tch_ai_realizer.cpp:(.text+0x272): undefined reference to `PyExc_RuntimeError'
tch_ai_realizer.cpp:(.text+0x284): undefined reference to `PyErr_Format'
tch_ai_realizer.cpp:(.text+0x298): undefined reference to `PyExc_RuntimeError'
tch_ai_realizer.cpp:(.text+0x2aa): undefined reference to `PyErr_Format'
collect2: error: ld returned 1 exit status
CMakeFiles/VideoServer.dir/build.make:457: recipe for target 'VideoServer' failed
make[2]: *** [VideoServer] Error 1
CMakeFiles/Makefile2:67: recipe for target 'CMakeFiles/VideoServer.dir/all' failed
make[1]: *** [CMakeFiles/VideoServer.dir/all] Error 2
Makefile:83: recipe for target 'all' failed

- 原因: 在从源代码编译安装 Python 3.7 时，默认情况下，不会生成 python.so 文件

- 解决方法：（在./configure 添加--enable-shared 参数）

1.下载 Python 3.7 的源代码并解压缩：
wget https://www.python.org/ftp/python/3.7.9/Python-3.7.9.tgz
tar -xf Python-3.7.9.tgz
cd Python-3.7.9

2.运行配置脚本，并包括 --enable-shared 选项：
./configure --enable-shared

3.编译 Python 3.7：
make -j8

4.安装 Python 3.7：
make install

3.Docker出现驱动问题(--gups)

(base) wcp@kfb-sy08-031:~$ sudo docker run -it --gpus all nhorro/tensorflow1.12-py3-jupyter-opencv
docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].
ERRO[0000] error waiting for container: context canceled

解决方案:

# 下面命令中的一个可以解决,第二个最合理
sudo apt-get upgrade docker-ce
sudo apt-get install -y nvidia-docker2

# 重启
sudo systemctl restart docker

4.socket通信，信息不匹配问题

# 需要发送的信息：

identifier = b'@#$&'

version = 2

load_len = 64

msg_code = 3005

reserve = 0
# 接受到的信息

Connected to client
Identifier: @#$&
Version: 33554432
Load Length: 1073741824
Message Code: -1123352576
Reserve: 0

原因：（我这里属于大端和小端问题）

数据结构定义不一致：C端定义的Net_Msg_Head_t结构体使用的是#pragma pack(push, 1)指令来设定为1字节对齐，而Python端没有进行对齐操作。这导致两端的数据结构不一致，从而导致数据解析错误。你可以尝试在Python端使用struct模块的align函数来手动设置对齐方式，使其与C端一致。

字节序问题：C和Python在处理二进制数据时可能有不同的字节序（大端序和小端序）导致数据解析错误。你可以尝试在Python端使用struct模块的pack函数时指定字节序，与C端保持一致。例如，使用’!i’代表大端序，使用’<i’代表小端序。

解决方法：

version_bytes = struct.pack('!i', version)
load_len_bytes = struct.pack('!i', load_len)
msg_code_bytes = struct.pack('!i', msg_code)
reserve_bytes = struct.pack('!i', reserve)

----> 修改为：
version_bytes = struct.pack('<i', version)
load_len_bytes = struct.pack('<i', load_len)
msg_code_bytes = struct.pack('<i', msg_code)
reserve_bytes = struct.pack('<i', reserve)

4.c++ 调用python代码时,显示相关的报错信息

PyObject* pRet = PyObject_CallObject(pFunc_stu, ArgArray);
if (!pRet) {
    // 检查是否发生了 Python 异常
    if (PyErr_Occurred()) {
        // 打印 Python 异常信息
        PyErr_Print();
    }
    std::cout << "Call python student-detect-function failed" << std::endl;
    return false;
}

5、调用torchvision时报错: '_bz2’

报错内容:

ModuleNotFoundError: No module named '_bz2’

解决方法:

# 1. 在一台已经安装对应版本电脑中,找到_bz2.cpython-310-x86_64-linux-gnu.so 文件
# 下面-310 是python的版本号
(Linux) find / -name "_bz2.cpython-310-x86_64-linux-gnu.so"
# 2. 将找到的so文件,放入到报错机子的对应python目录 (python-3.10 为对应的版本)
mv _bz2.cpython-310-x86_64-linux-gnu.so /usr/local/python-3.10/lib/python3.6/lib-dynload/

# 3.如果import torchvision, 再报错:ModuleNotFoundError: No module named '_lzma'
# 寻找'_lzma' so (-310 是python的版本号)
(Linux) find / -name "_bz2.cpython-310-x86_64-linux-gnu.so"
# 4. 将找到的so文件,放入到报错机子的对应python目录 (python-3.10 为对应的版本)
mv "_bz2.cpython-310-x86_64-linux-gnu.so /usr/local/python-3.10/lib/python3.6/lib-dynload/

6、CUDNN报错

return F.conv2d(input, weight, bias, self.stride,
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

RuntimeError: CUDA error: misaligned address
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

解决方法:

如果你的硬件或驱动不支持 cuDNN，或者你正在使用一个没有包含 cuDNN 的 Docker 容器或虚拟环境，你将需要关闭它

import torch
torch.backends.cudnn.enabled = False

除了 torch.backends.cudnn.enabled，还有其他相关的设置项，如 torch.backends.cudnn.benchmark 和 torch.backends.cudnn.deterministic，这些可以用来进一步控制 PyTorch 使用 cuDNN 的行为：

torch.backends.cudnn.benchmark：当设置为 True 时，cuDNN 会尝试不同的算法来找到最优的运行选择，这可能会提高性能，但会增加一些初始的延迟。
torch.backends.cudnn.deterministic：当设置为 True 时，会禁用一些可能导致结果不确定性的优化，以确保模型运行的确定性。

7、VMware扩容问题

(1) 建立新的磁盘区

- VMWare虚拟机扩容并挂载磁盘_windows系统vmware 硬盘扩容-CSDN博客

（2）手动扩容

# 如果你的根分区 / 是一个逻辑卷管理器（LVM）管理的卷，你可以使用 LVM 工具来扩展它。例如，如果你的根分区是 /dev/ubuntu-vg/ubuntu-lv，你可以使用以下命令来扩展它：
# sudo lvextend -L +新的容量大小 /dev/ubuntu-vg/ubuntu-lv
sudo lvextend -L +99g /dev/ubuntu-vg/ubuntu-lv

# 然后使用 resize2fs 来调整文件系统大小：
sudo resize2fs /dev/ubuntu-vg/ubuntu-lv