基于Jetson Xavier NX的安装及测试、Pose Recognition测试和Paddle Inference在Jetson推理

最新推荐文章于 2024-06-08 09:47:06 发布

SensorFusion

最新推荐文章于 2024-06-08 09:47:06 发布

阅读量2.2k

点赞数 2

分类专栏：环境配置方案文章标签： docker pytorch ubuntu

本文链接：https://blog.csdn.net/nh54zyt/article/details/114935270

版权

环境配置方案专栏收录该内容

6 篇文章 0 订阅

订阅专栏

在这里插入图片描述
NX_README
developer-guide:
https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html
套件包含清单

NVIDIA Jetson Xavier NX模块及载板
19v电源适配器
802.11无线网卡及蓝牙模块（安装在载板下方）
说明书

需要准备

sd卡（大于16G，建议64G以上的告诉内存卡）
支持DP或HDMI接口的显示屏
USB键鼠

1：烧录步骤

①：下载SD卡镜像

https://developer.nvidia.com/jetson-nx-developer-kit-sd-card-image

②：下载烧录软件

http://file.ncnynl.com/rpi/Win32DiskImager-0.9.5-install.exe

烧录到sd卡上

在这里插入图片描述

③：烧录好的SD卡插入NX的卡槽，然后开机测试

在这里插入图片描述

2：版本检查

我的NX开发板的刷机版本为Jetpack4.4.0

1、驱动版本：head -n 1 /etc/nv_tegra_release
	# R32 (release), REVISION: 4.2, GCID: 20074772, BOARD: t186ref, EABI: aarch64, DATE: Thu Apr  9 01:26:40 UTC 2020

2、内核版本：uname -r
	4.9.140-tegra

3、操作系统：lsb_release -i -r
	Distributor ID:	Ubuntu
	Release:	18.04

4、CUDA版本：nvcc -V
	nvcc: NVIDIA (R) Cuda compiler driver
	Copyright (c) 2005-2019 NVIDIA Corporation
	Built on Wed_Oct_23_21:14:42_PDT_2019
	Cuda compilation tools, release 10.2, V10.2.89

5、cuDNN版本：dpkg -l libcudnn8

	Desired=Unknown/Install/Remove/Purge/Hold
	| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
	|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
	||/ Name               Version        Architecture   Description
	+++-==================-==============-==============-=========================================
	ii  libcudnn8          8.0.0.145-1+cu arm64          cuDNN runtime libraries


6、opencv版本：dpkg -l libopencv

	Desired=Unknown/Install/Remove/Purge/Hold
	| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
	|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
	||/ Name               Version        Architecture   Description
	+++-==================-==============-==============-=========================================
	ii  libopencv          4.1.1-2-gd5a58 arm64          Open Computer Vision Library

7、Tensorrt版本： dpkg -l tensorrt
                dpkg -l | grep TensorRT
	Desired=Unknown/Install/Remove/Purge/Hold
	| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
	|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
	||/ Name               Version        Architecture   Description
	+++-==================-==============-==============-=========================================
	ii  tensorrt           7.1.0.16-1+cud arm64          Meta package of TensorRT

3：CUDA环境配置

配置一下环境

export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
export CUDA_HOME=$CUDA_HOME:/usr/local/cuda
source ~/.bashrc

查看nvcc

nvcc -V

4：Jtop python安装

NX开发套件中预装的python版本为2.7，安装了ython3，直接终端输入：

sudo apt-get install python3-pip python3-dev
//接着将pip升级为最新版
python3 -m pip install --upgrade pip  #升级pip

安装Jtop进行内存/CPU/GPU监视

sudo pip install jetson-stats
sudo systemctl restart jetson_stats.service 
sudo jtop

在这里插入图片描述

5：NX风扇控制

Xavier NX的风扇在系统内核中有一套自动控制温度和转速的算法，大约在40度左右的时候会自动开启风扇进行散热，在核心温度大约低于39度时候会自动关闭散热风扇。

在这里插入图片描述
设置模式：
15W模型，风扇会自己转。。

设置命令：

sudo sh -c 'echo 140 > /sys/devices/pwm-fan/target_pwm'

命令行中数字位数140即代表了风扇的PWM占空比参数。其区间为0～255，0即代表了风扇完全停止，255代表了风扇火力全开。实际上在日常使用过程中我倾向于使用100～150的占空比，也就是40%～60%左右。因为过低风扇散热无力，过高了风扇噪音快赶上台式机了，听起来会比较烦人。除了重度编译，运行较大网络吃满资源，还是用不到255的占空比的。

6：NX换源

换源原则：

注意处理器是aarch64架构的Ubuntu 18.04.2 LTS系统类型的，要使用与之匹配的源。

添加国内清华源，首先备份原本的source.list文件

sudo cp /etc/apt/sources.list /etc/apt/sources.list.bak    #为防止误操作后无法恢复，先备份原文件sources.list
sudo gedit /etc/apt/sources.list

然后删除所有内容，复制下列内容到到sources.list后保存

deb http://mirrors.tuna.tsinghua.edu.cn/ubuntu-ports/ bionic main multiverse restricted universe
deb http://mirrors.tuna.tsinghua.edu.cn/ubuntu-ports/ bionic-security main multiverse restricted universe
deb http://mirrors.tuna.tsinghua.edu.cn/ubuntu-ports/ bionic-updates main multiverse restricted universe
deb http://mirrors.tuna.tsinghua.edu.cn/ubuntu-ports/ bionic-backports main multiverse restricted universe
deb-src http://mirrors.tuna.tsinghua.edu.cn/ubuntu-ports/ bionic main multiverse restricted universe
deb-src http://mirrors.tuna.tsinghua.edu.cn/ubuntu-ports/ bionic-security main multiverse restricted universe
deb-src http://mirrors.tuna.tsinghua.edu.cn/ubuntu-ports/ bionic-updates main multiverse restricted universe
deb-src http://mirrors.tuna.tsinghua.edu.cn/ubuntu-ports/ bionic-backports main multiverse restricted universe

最后打开终端输入

sudo apt-get update

6：tensorflow安装

在安装tnesorflow之前，一定要找到自己板卡刷机中Jetpack+python所对应的tensorflow版本，
这个最好去nvidia官网的社区去查一下，因为版本不对，即使你安装上。
附上关于Python 3.6+JetPack4.4的官方解答链接
https://forums.developer.nvidia.com/t/official-tensorflow-for-jetson-agx-xavier/65523

7：USB摄像头检测工具的安装

大部分摄像头不支持Linux系统，可以通过安装cheese脚本来激活Ubuntu自带的摄像头驱动（UVC），只需通过一条简单的指令即可安装cheese脚本：

sudo apt-get install cheese

安装完成后，在终端输入

cheese

即可打开usb摄像头。此时，摄像头就可以满足即插即用了。
如果想要查看当前插入的摄像头的设备编号，可在终端输入

ls /dev/video*

8：jetson-inference例程

参考：
https://blog.csdn.net/zbb297918657/article/details/106432773
官方：
提供的jetson_inference项目文件的Github
jetson-inference下载地址：https://github.com/dusty-nv/jetson-inference

9：例程：

jetpack 4.2体验在jetson tx2上使用python3调用tensorRT推理tensorflow模型
https://blog.csdn.net/weixin_43842032/article/details/88753724

tensorrt路径在：

/usr/src/tensorrt

10：安装caffe：

如果需要插件，则需要源码加插件转引擎
https://blog.csdn.net/u012614287/article/details/81537743

常规模型转trt，不需要NX装源码

11：测试caffe、ONNX转的模型

出现

CMakeFiles/traffic_det_reg_caffe_trt.dir/src/TrafficDetection.cpp.o:
In function onnxToTRTModel(std::__cxx11::basic_string<char,
std::char_traits, std::allocator > const&,
std::__cxx11::basic_string<char, std::char_traits,
std::allocator > const&, nvinfer1::ICudaEngine*&, int const&)’:
TrafficDetection.cpp:(.text+0x1357): undefined reference
tocreateNvOnnxParser_INTERNAL’

cmakelist添加：

#/home/name/TensorRT-7.1.3.4/lib/libnvinfer.so
#/home/name/TensorRT-7.1.3.4/lib/libnvinfer_plugin.so
#/home/name/TensorRT-7.1.3.4/lib/libnvparsers.so
-lnvonnxparser

12：调用NX的DLA

Jetson Xavier NX 的核心竞争力是其机器推理性能。
除了 CPU 和 GPU，Jetson Xavier NX 内还设计有DLA（Deep Learning Accelerator，深度学习加速器）和 PVA（Programmable Vision Accelerator，可编程视觉加速器）单元。Volta GPU 与 DLA 核心的结合，使其在低功耗平台上构筑了强大的处理能力。

为了展示该系统的机器学习推理能力，NVIDIA 为 Jetson 平台提供了大量软件开发套件以及手动调整框架，预先为开发者做了大量繁重的准备工作，使他们能充分利用 GPU 中的 DLA 单元。

有些层DLA不支持，则会回传GPU进行处理，整体来看，节约了GPU资源，但是DLA跑模型速度会慢一倍左右

//sample data:
inline void enableDLA(IBuilder* builder, IBuilderConfig* config, int useDLACore, bool allowGPUFallback = true)
{
    if (useDLACore >= 0)
    {
        if (builder->getNbDLACores() == 0)
        {
            std::cerr << "Trying to use DLA core " << useDLACore << " on a platform that doesn't have any DLA cores"
                      << std::endl;
            assert("Error: use DLA core on a platfrom that doesn't have any DLA cores" && false);
        }
        if (allowGPUFallback)
        {
            config->setFlag(BuilderFlag::kGPU_FALLBACK);
        }
        if (!builder->getInt8Mode() && !config->getFlag(BuilderFlag::kINT8))
        {
            // User has not requested INT8 Mode.
            // By default run in FP16 mode. FP32 mode is not permitted.
            builder->setFp16Mode(true);
            config->setFlag(BuilderFlag::kFP16);
        }
        config->setDefaultDeviceType(DeviceType::kDLA);
        config->setDLACore(useDLACore);
        config->setFlag(BuilderFlag::kSTRICT_TYPES);
    }
}
//转模型的时候：
  // Build the engine
    builder->setMaxBatchSize(BATCH_SIZE);
    //config->setMaxWorkspaceSize(1_GiB);
    config->setMaxWorkspaceSize(1 * (1 << 20)); // 16MB
    config->setFlag(nvinfer1::BuilderFlag::kFP16);
    std::cout << "**********************************DLA***********************" << std::endl;
    // nx  /usr/src/tensorrt/samples/common/common.h
    std::cout << "start dla." << std::endl;
    config->setFlag(nvinfer1::BuilderFlag::kGPU_FALLBACK);
    config->setDefaultDeviceType(nvinfer1::DeviceType::kDLA);
    config->setDLACore(true);
        config->setFlag(nvinfer1::BuilderFlag::kSTRICT_TYPES);

    std::cout << "start building engine" << std::endl;
    engine = builder->buildEngineWithConfig(*network, *config);
    std::cout << "build engine done" << std::endl;
    assert(engine);
    parser->destroy();
    nvinfer1::IHostMemory *data = engine->serialize();
    std::ofstream file;
    file.open(filename, std::ios::binary | std::ios::out);
    std::cout << "writing engine file..." << std::endl;
    file.write((const char *)data->data(), data->size());
    std::cout << "save engine file done" << std::endl;
    file.close();
    network->destroy();
    builder->destroy();
//调用的时候：
    // deserialize the engine
    IRuntime* runtime = createInferRuntime(gLogger);
    assert(runtime != nullptr);
    if (gArgs.useDLACore >= 0)
    {
        runtime->setDLACore(gArgs.useDLACore);
    }
    ICudaEngine* engine = runtime->deserializeCudaEngine(trtModelStream->data(

在这里插入图片描述

13：NX上测试mmpose

可结合场景，应用于事件判断：
人和车，人分析等

行人检测后，再进行定位，，单个人17个关键点定位的耗时：

               nx
前处理：       1.1 ms
推理：         31  ms    
后处理：       0.6 ms
单帧耗时：    32.7 ms

鼻子-0, 脖子-1，右肩-2，右肘-3，右手腕-4，左肩-5，左肘-6，左手腕-7，右臀-8，右膝盖-9，右脚踝-10，左臀-11，左膝盖-12，左脚踝-13，右眼-14，左眼-15，有耳朵-16，左耳朵-17，背景-18.
在这里插入图片描述

在这里插入图片描述

可视化恢复到原图显示方法：

// prepareImage
std::vector<float> prepareImage(std::vector<cv::Mat> &vec_img) {
    std::vector<float> result(BATCH_SIZE * IMAGE_WIDTH * IMAGE_HEIGHT * INPUT_CHANNEL);
    float *data = result.data();
    for (const cv::Mat &src_img : vec_img)
    {
        if (!src_img.data)
            continue;
        float ratio = std::min(float(IMAGE_WIDTH) / float(src_img.cols), float(IMAGE_HEIGHT) / float(src_img.rows));
        cv::Mat flt_img = cv::Mat::zeros(cv::Size(IMAGE_WIDTH, IMAGE_HEIGHT), CV_8UC3);
        cv::Mat rsz_img;
        cv::resize(src_img, rsz_img, cv::Size(), ratio, ratio);
        rsz_img.copyTo(flt_img(cv::Rect(0, 0, rsz_img.cols, rsz_img.rows)));
        flt_img.convertTo(flt_img, CV_32FC3, 1.0 / 255);

        //HWC TO CHW
        std::vector<cv::Mat> split_img(INPUT_CHANNEL);
        cv::split(flt_img, split_img);

        int channelLength = IMAGE_WIDTH * IMAGE_HEIGHT;
        for (int i = 0; i < INPUT_CHANNEL; ++i)
        {
            split_img[i] = (split_img[i] - img_mean[i]) / img_std[i];
            memcpy(data, split_img[i].data, channelLength * sizeof(float));
            data += channelLength;
        }
    }
    return result;
}

// postProcess
std::vector<std::vector<KeyPoint>> postProcess(const std::vector<cv::Mat> &vec_Mat, float *output, const int &outSize) {
    std::vector<std::vector<KeyPoint>> vec_key_points;
    int feature_size = IMAGE_WIDTH * IMAGE_HEIGHT / 16;
    int index = 0;
    for (const cv::Mat &src_img : vec_Mat) {
        std::vector<KeyPoint> key_points = std::vector<KeyPoint>(num_key_points);
        float ratio = std::max(float(src_img.cols) / float(IMAGE_WIDTH), float(src_img.rows) / float(IMAGE_HEIGHT));
        float *current_person = output + index * outSize;
        for (int number = 0; number < num_key_points; number++) {
            float *current_point = current_person + feature_size * number;
            auto max_pos = std::max_element(current_point, current_point + feature_size);
            key_points[number].prob = *max_pos;
            float x = (max_pos - current_point) % (IMAGE_WIDTH / 4) + (*(max_pos + 1) > *(max_pos - 1) ? 0.25 : -0.25);
            float y = (max_pos - current_point) / (IMAGE_WIDTH / 4) + (*(max_pos + IMAGE_WIDTH / 4) > *(max_pos - IMAGE_WIDTH / 4) ? 0.25 : -0.25);
            key_points[number].x = int(x * ratio * 4);
            key_points[number].y = int(y * ratio * 4);
            key_points[number].number = number;
        }
        vec_key_points.push_back(key_points);
        index++;
    }


    return vec_key_points;
}

13：Cuda编程后处理

1、例如：
分割网络，可行驶区域检测。
输入512×512×3，输出512×512×2

核函数设置

  dim3 dimBlock(64);
  dim3 dimGrid((512 + 63) / 64,512,batch_size);
  getmax<<<dimGrid, dimBlock>>>(tensor, outputDevice,512);

并行运行，求解最大值就是可行驶区域的点：

__global__ void getmax(float *input, uchar *result, int feature_w) {
  int h = threadIdx.x + blockDim.x * blockIdx.x;
  int num_class = blockIdx.y;
  int batch_size = blockIdx.z;
  int n =  batch_size * num_class * gridDim.y + num_class * feature_w + h;
  if (h < i) {
    if( input[n * 2] > input[n * 2 + 1]){
       result[n] =  0;
    }
    else{
       result[n] = 255;
     }

同一像素两个类别数据是挨着的，
如 0 1
2 3
4 5
6 7
. .
2n 2n+1
所以 input[n * 2] input[n * 2 + 1]
一共需要的线程数量，就是 h * w *batch_size
为了加速，每个grid里面64个线程

2 、例如：
利用cuda编程处理，yoloV5的前处理和后处理，
COCO数据集进行数据获取 output_xywh_pro_index


{
int ntypes = 25200 * 6 * sizeof(float);
        cudaMalloc((void **)&output_xywh_pro_index, 1 * 25200 * 6 * sizeof(float));
        CHECK(cudaGetLastError());

        Detforward_gpu_box(static_cast<float *>(buffers[1]), output_xywh_pro_index);

        cudaMemcpyAsync(out_result, output_xywh_pro_index, ntypes, cudaMemcpyDeviceToHost);
        std::cout << "Memcpy ok." << std::endl;

        boxes = postProcess_gpu(src, out_result, outSize);
}

void Detforward_gpu_box(const float *intput,
                        float *output_xywh_pro_index)
{
    std::cout << "Into Detforward_gpu  box ." << std::endl;
    // if (1)
    // {
    dim3 dimBlock(1);
    dim3 dimGrid(25200);
    YoloProposal_box<<<dimGrid, dimBlock>>>(intput, 25200, output_xywh_pro_index);
    cudaError_t cudaStatus;
    cudaStatus = cudaGetLastError();
    if (cudaStatus != cudaSuccess)
    {
        fprintf(stderr, "Kernel launch failed: %s\n", cudaGetErrorString(cudaStatus));
    }
    std::cout << "set kernel ok." << std::endl;
}


__global__ void YoloProposal_box(const float *tensor, const int FEATURE_SIZE_NUM, float *output)
//__global__ void YoloProposal_box(const float *tensor)

{
    int start_indx = 1;
    if (true)
    {
       // int idx = (blockIdx.x + blockIdx.y * gridDim.x) * blockDim.x + threadIdx.x;
        int idx = blockIdx.x;
        // printf(" idx:%d", idx);
        if (idx > 25200)
        {
            return;
        }
        float cx = tensor[idx * 85 + 0];
        float cy = tensor[idx * 85 + 1];
        float w = tensor[idx * 85 + 2];
        float h = tensor[idx * 85 + 3];
        float score = tensor[idx * 85 + 4];
        // if (idx > 25190)
        // {
        //     printf("*** gpu *** x,y,w,h: %d %f %f %f %f:\n", idx, cx, cy, w, h); // 0 - 255
        // }
        float objProb;
        int index;

        for (int k = 0; k < 80; k++)
        {

            float prob_class = tensor[idx * 85 + 5 + k];
            // printf(" gpu %f %f %f %f\n:", cx, cy, w, h); // 0 - 255
            if (max(prob_class, objProb) == prob_class)
            {
                index = k;
                objProb = prob_class;
        }
        output[idx * 6 + 0] = cx;
        output[idx * 6 + 1] = cy;
        output[idx * 6 + 2] = w;
        output[idx * 6 + 3] = h;
        output[idx * 6 + 4] = objProb * score;
        output[idx * 6 + 5] = index;
    }
    __syncthreads();
}

14：Paddle inference调用trt算子

https://github.com/PaddlePaddle/Paddle-Inference-Demo

https://paddle-inference.readthedocs.io/en/latest

飞桨开源框架项目地址：

GitHub:

https://github.com/PaddlePaddle/Paddle

Gitee: 

https://gitee.com/paddlepaddle/Paddle

当模型加载后，模型表示为由算子节点组成的拓扑图。如果在运行前指定了TRT子图模式，那在模型图分析阶段，Paddle Inference会找出能够被TRT运行的算子节点，同时将这些互相链接的OP融合成一个子图并用一个TRT 算子代替，运行期间如果遇到TRT 算子，则调用TRT引擎执行。
在这里插入图片描述

在Paddle 1.8 版本中，对Ernie模型进行了TRT子图的集成，支持动态尺寸的输入功能。
当预测期间，被TRT 引擎执行的算子会在初始化期间运行所有候选计算内核（kernel），并根据根据输入的尺寸选择出最佳的那一个出来，保证了模型的最佳推理性能。

/usr/bin/ld: cannot find -lcudart

collect2: error: ld returned 1 exit status

sudo ln -s /usr/local/cuda/lib64/libcudart.so /usr/lib/libcudart.so

SensorFusion

关注

2
点赞
踩
19

收藏

觉得还不错? 一键收藏
0
评论
基于Jetson Xavier NX的安装及测试、Pose Recognition测试和Paddle Inference在Jetson推理

Jetson Xavier NX
复制链接

扫一扫

专栏目录

基于Jetson Xavier NX的安装及测试、Pose Recognition测试和Paddle Inference在Jetson推理

1：烧录步骤

2：版本检查

3：CUDA环境配置

4：Jtop python安装

5：NX风扇控制

6：NX换源

6：tensorflow安装

7：USB摄像头检测工具的安装

8：jetson-inference例程

9：例程：

10：安装caffe：

11：测试caffe、ONNX转的模型

12：调用NX的DLA

13：NX上测试mmpose

13：Cuda编程后处理

14：Paddle inference调用trt算子

/usr/bin/ld: cannot find -lcudart

“相关推荐”对你有帮助么？