> For the complete documentation index, see [llms.txt](https://cifar.gitbook.io/note/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://cifar.gitbook.io/note/shen-du-xue-xi/npu-and-c++.md).

# 加速-模型适配开发

## [黄仁勋的 Nvidia 故事](https://ruanyifeng.com/blog/2023/06/weekly-issue-257.html) <a href="#page-title" id="page-title"></a>

我们为 Windows 平台设计显卡，一直干了10年。

虽然产品很受欢迎，但是有一个问题：**人们只用这些显卡打游戏，无法用于其他的加速计算。** 因为那时的 GPU 必须通过 Windows 的接口使用，受制于操作系统，用户无法直接操作 GPU，很难将其用于自己的用途。

为了扩展 GPU 的用途，**2007年我们推出了 CUDA 框架，让用户可以操作 GPU 底层接口**，定制化编程，满足自己的加速计算需求。GPU 从此可以用于科学运算、物理模拟等各方面。

## CUDA

[我的第一份CUDA代码](https://mp.weixin.qq.com/s/h2XKth1bTujnrxyXTJ2fwg)

rmzk gpu 加速：数据处理 nvidia [NPP](https://docs.nvidia.com/cuda/npp/nppi_conventions_lb.html) ， 模型加速 [tenserRT](https://github.com/NVIDIA/TensorRT)

## 模型流程

![](/files/-Me-LfFfSRxMqjrtNu4b)

## NPU开发文档

1. [昇腾社区文档](https://www.hiascend.com/document?tag=community-developer)                        （注意选择  CANN 版本，右上区域）
2. 应用开发[在线阅读](https://support.huaweicloud.com/aclcppdevg-cann502alpha5infer/atlasdevelopment_01_0001.html)                （前端排版原因，无图感觉不如 PDF ）
3. 应用开发[PDF版本](https://support.huaweicloud.com/aclcppdevg-cann502alpha2infer/aclcppdevg-cann502alpha2infer.pdf)                 （我看的这一版本，感觉图文丰富   昇腾CANN社区版(5.0.2.alpha001)） &#x20;
4. [华为企业用户文档官网 ](https://support.huawei.com/enterprise/zh/ascend-computing/cann-pid-251168373?category=developer-documents\&subcategory=application-development)

* 环境部署（环境安装）文档                  ( [CANN软件安装](https://support.huaweicloud.com/instg-cli-cann502-alpha005/atlasdeploy_03_0002.html) (开发&运行场景, 通过命令行方式))
* atc模型转换的文档                              （ [模型压缩 ( 推理 )](https://support.huaweicloud.com/auxiliarydevtool-cann502alpha5infer/atlasinfertool_16_0002.html) ）

pip3.7.5 install --user decorator    给指定版本 python 安装包 （linux）

> atc 模型转换需安装如下python包
>
> pip3.7.5 install --user attrs numpy decorator sympy cffi pyyaml pathlib2 psutil protobuf scipy requests

#### 对比 英伟达

TensorRT    [中文介绍](https://developer.nvidia.com/zh-cn/tensorrt)    [开发者指南](https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html)   [样例支持指南](https://docs.nvidia.com/deeplearning/tensorrt/sample-support-guide/index.html#samples)    [github](https://github.com/NVIDIA/TensorRT)

trtexec --onnx=crnn.onnx --explicitBatch --shapes=input\_1:1x1x32x100 --workspace=10240

> 直接跑原始模型也可以，但是速度慢。模型量化后计算速度提升。针对月租很贵的网络GPU服务器充分运用计算性能。
>
> python 通过 swig 转换 C++ 代码。调用封装的SDK。

## 快速入门实例

### 例子

1. [基于Caffe ResNet-50网络实现图片分类（同步推理）](https://gitee.com/ascend/samples/blob/master/cplusplus/level2_simple_inference/1_classification/resnet50_imagenet_classification/README_CN.md)
2. [基于Caffe ResNet-50网络实现图片分类（异步推理）](https://gitee.com/ascend/samples/blob/master/cplusplus/level2_simple_inference/1_classification/resnet50_async_imagenet_classification/README_CN.md)
3. [基于Caffe ResNet-50网络实现图片分类（视频解码+同步推理）](https://gitee.com/ascend/samples/blob/master/cplusplus/level2_simple_inference/1_classification/vdec_resnet50_classification/README_CN.md)

### 环境变量

查看环境变量  export -p

![](/files/-MehNYdfDC8MqbbIyH52)

### 转 .om模型

![](/files/-Mee2L4PipnSC2AhCYhn)

* \--soc\_version：Ascend310        （我司使用的310）
* \--input\_format：NCHW        \[batch, channels, height, width] 数据矩阵存储格式。TensorFlow  nhwc
* \--input\_fp16\_nodes
* \--output\_type

### 转 .onnx 模型

[torch.onnx.export()](https://github.com/ultralytics/yolov5/blob/master/export.py)

> img 一个batch大小。我们采用16  或者 32 ，即  16/32 \* C \* H \* W
>
> opset\_version = 11

### ATC 模型转换相关知识

#### 硬件性能最优默认数据类型 ( atc 默认精度 fp16)

| 平台     | onnx | Atlas | T4   |
| ------ | ---- | ----- | ---- |
| 性能最优类型 | fp32 | fp16  | int8 |

### ATC命令行相关配置 &#x20;

#### --input\_shape="input:16,3,448,448"

为模型开发开发人员导出时设置的节点名称。代码中指定。

```python
"""
pkl2onnx.py
"""
import os
import torch 
import torch.nn as nn
from torchvision import models

os.environ["CUDA_VISIBLE_DEVICES"] = "6"

model_path = 'models/Sign_res34_zx_204/20210722/Intermediate/Epoch13_Loss_0.0017.pkl'
onnx_path = 'models/Sign_res34_zx_204/20210722/Intermediate/model726_E13_batch16_softmax.onnx'
# model = models.resnet34(pretrained=True)
# num_fits = model.fc.in_features
# model.fc = nn.Linear(num_fits, 124) # 替换最后一个全连接层    
# model.load_state_dict(torch.load(model_path))

model = torch.load(model_path).module
model.eval()

x = torch.randn(16, 3, 224, 224, requires_grad=True).cuda()
torch.onnx.export(
    model, 
    x, 
    onnx_path,
    export_params=True,
    opset_version=11,
    do_constant_folding=True,
    input_names=['input'],
    output_names=['output']
)
```

#### --insert\_op\_conf  （AIPP）

#### 2.6.1 AIPP使能

AIPP ( Artificial Intelligence Pre-Processing )Al预处理，用于在**Al Core**上完成图像预处理，**包括改变图像尺寸、色域转换（转换图像格式)、减均值/乘系数（改变图像像素)**，数据处理之后再进行真正的模型推理。

由于使用**DVPP**进行数据预处理后，DVPP各组件基于处理速度和内存占用量的考虑，对输出图片有诸多限制，如输出图片需要长宽对齐，输出格式必须为YUV42OSP等，但模型输入通常为RGB或BGR，且输入图片尺寸各异。**由此引入AIPP功能**，可以通过AIPP提供的色域转换功能，输出满足要求的图片格式;通过补边( Padding）功能，输出满足长宽对齐的图片等。

AIPP根据配置方式不同，分为静态AIPP/动态AIPP;如果要将原始图片输出为满足推理要求的图片格式，则需要使用色域转换功能;如果要输出固定大小的图片，则需要使用AIPP提供的Crop(抠图)、Padding(补边)功能。

Act模型转换文档 [模型压缩 ( 推理 )](https://support.huaweicloud.com/auxiliarydevtool-cann502alpha5infer/atlasinfertool_16_0002.html) ：

> 2.6.1.4色域转换配置说明&#x20;
>
> YUV420SP\_U8转*BGR    →   JPEG 配置*

#### **2.6.1.7配置文件模板**

配置文件模板 **→&#x20;*****各项参数注释说明*** &#x20;

### Aipp 均值方差怎么确定？

应用：py 训练模型的超参数，dataloader 读取图片对图片的预处理时。

均值(mean\_chn\_i)：超参数 **乘255**，方差(var\_reci\_chn\_i)：超参数标准差分之一 除以 255 （ **1/std/255**）

### **一些思考：模型加速和模型优化**

> pytorch 采用 float32 数据类型。（正向传播，反向传播，精度计算等等）
>
> 适配加速。采用 float16 int8 会加速运算，但会损失精度。float64会提升精度，但推理速度变慢。
>
> 如果采用  int8 需要 [模型量化](https://zhuanlan.zhihu.com/p/79744430)
>
> 模型在数据预处理的时候，便需要精度转换 。（下图potch官网教程）0\~255 的int 8 便不再够用。

![](/files/-MeiyCxuN07DK4msYHPk)

![](/files/-Mej12eWW1JwivwTnN5S)

## 工程流程&#x20;

### 通过已有方法，写出性能最优的代码。

> 避免重复造轮子，切勿[急不可耐](https://mp.weixin.qq.com/s/-vol6K5RHp301yTv1O65PA)，认真理解需求

## 文档学习笔记

### 11.4.1 aclrtcreatecontext 显式创建一个Context ，同步接口

> 显式创建一个Context，该**Context中包含2个Stream**，1个默认Stream和1个执行内部同步的Stream，同步接口。

若在某一进程内创建多个Context ( Context的数量与Stream相关，Stream数量有限制，请参见aclrtCreateStream )，当前**线程**在**同一时刻**内只能使用**其中一个Context**，建议通过*aclrtSetCurrentContext*接口**明确指定当前线程的Context**，增加程序的可维护性。

### 11.6.1 aclrtCreateStream 创建一个Stream，同步接口。

函数原型     aclError aclrtCreateStream(aclrtStream \*stream)

> 居然不同任何 context 或者 devides 绑定

### 7.3运行管理资源申请与释放

默认Context、默认Stream，是在调用aclrtResetDevice接口后**自动释放**。

### 11.3.4 aclrtGetRunMode  获取当前昇腾Al软件栈的运行模式，同步接口。

表示运行模式。

* 0    ACL\_DEVICE:      昇腾Al软件栈运行在Device的Control CPU或板端环境上。
* 1    ACL\_HOST:         昇腾Al软件栈运行在Host CPU上。

### 8.1 Stream管理

//调用aclrtSynchronizeStream接口，阻塞应用程序运行，直到指定Stream中的所有任务都完成。**aclrtSynchronizeStream(stream);**

### 内存申请

* 11.8.1 aclrtMalloc
  * 在Device上申请size大小的线性内存，通过\*devPtr返回已分配内存的指针，同步接口。
* 11.8.6 aclrtMallocHost
  * 申请Host或Device上的内存，Device上的内存按普通页申请。同步接口。

![](/files/-Mf88BdmVpY2HEiq1mxf)

### aclrtMemcpy 内存复制

内存复制，可以选择同步或异步

> 文档
>
> 11.8.10 aclrtMemcpy
>
> 11.8.11 aclrtMemcpyAsync

```cpp
//同步内存复制，
aclrtMemcpy(devPtrB, size, hostPtrA, size,ACL_MEMCPY_HOST_TO_DEVICE);
//hostPtrA表示Host上源内存地址指针，devPtrB表示Device上目的内存地址指针，size表示内存大小
aclrtMemcpy(devPtrB,size, devPtrA, size,ACL_MEMCPY_DEVICE_TO_DEVICE);
//devPtrA表示Device上源内存地址指针，devPtrB表示Device上目的内存地址指针，size表示内存大小
aclrtMemcpy(hostPtrB, size, hostPtrA, size, ACL_MEMCPY_HOST_TO_HOST);
//hostPtrA表示Host上源内存地址指针，hostPtrB表示Host上目的内存地址指针，size表示内存大小
aclrtMemcpy(hostPtrB, size, devPtrA, size, ACL_MEMCPY_HOST_TO_DEVICE);
//devPtrA表示Device上源内存地址指针，hostPtrB表示Host上目的内存地址指针，size表示内存大小
```

## 程序错误排查

### ACL日志

黑匣子：硬件芯片日志     （非程序开发日志）

路径，/root/ascend   （非黑匣子路径）（给程序员的日志，device log  和 plog- 精简版）

## NPU 常识

### 需了解

* 单卡 4 芯片。    0\~4
* 线程并不是越多越好，线程间调度耗时。寻找任务最优线程数。
* 函数资源申请，函数启动执行，函数执行结束栈回收。（启动、回收 均耗时耗性能。）
* 视频解码后抽帧。

### 硬件内存

单芯片 8G 除去自己的系统占用，代码占用，预计可用 4G，720图片预计 1MB 最多 4k张，不然就会内存炸

### 对标区

| 英伟达              | 华为                        |
| ---------------- | ------------------------- |
| GPU              | NPU  （AI 加速卡，google  TPU） |
| CUDA             | CANN                      |
| PyTorch          | MindSpore                 |
| GTX 3090 / TiTan | Atlas                     |

## 代码规范

### [ACL rule](https://gitee.com/ascend/samples/blob/master/CONTRIBUTING_CN.md)

![](/files/-MeYhta01TaGu43W8UsM)

![](/files/-MgL30PSR-QK_8-q0zki)