Qualcomm® AI Engine Direct 使用手册(5)
4.1.2 HTP - QNN HTP 后端扩展
QNN HTP 后端扩展
qnn-net-run 实用程序与后端无关,这意味着它只能使用通用 QNN API。后端扩展功能 方便使用后端特定 API,即自定义配置。有关后端扩展的更多文档 可以在qnn-net-run下找到。请注意,QNN 后端扩展的范围是 仅限于 qnn-net-run。
HTP 后端扩展是一个为 HTP 后端提供自定义选项的接口。还需要启用不同的 性能模式。这些选项和性能模式可以通过提供扩展共享库来行使 libQnnHtpNetRunExtensions.so 和配置文件(如有必要)。
要将后端扩展相关参数与 qnn-net-run 一起使用,请使用 --config_file 参数并提供 JSON 文件的路径。
$ qnn-net-run --model <qnn_model_name.so> \
--backend <path_to_model_library>/libQnnHtp.so \
--output_dir <output_dir_for_result> \
--input_list <path_to_input_list.txt>
--config_file <path to JSON of backend extensions>
上述配置文件包含通过 JSON 指定的最少参数(例如后端扩展配置),如下所示:
{
"backend_extensions" :
{
"shared_library_path" : "path_to_shared_library", // give path to shared extensions library (.so)
"config_file_path" : "path_to_config_file" // give path to backend config
}
}
用户可以通过后端配置为 HTP 后端设置自定义选项和不同的性能模式。各种种类 配置中可用的选项如下所示:
{
"type": "object", "properties": {
"graphs": {
"type": "object", "properties": {
// Corresponds to the graph name provided to QnnGraph_create
// Used by qnn-net-run during online prepare and qnn-context-binary-generator uses it during offline preparation
"graph_names": {"type": "array", "items": {"type": "string"}},
// Provides performance infrastructure configuration options that are memory specific [optional]
// Used by qnn-net-run during online prepare and qnn-context-binary-generator uses it during offline preparation
"vtcm_mb": {"type": "integer"},
// Used to perform computation with half precision i.e. 16 bits [optional] [default: 0]
// Used by qnn-net-run during online prepare and qnn-context-binary-generator uses it during offline preparation
"fp16_relaxed_precision": {"type": "integer"},
// Corresponds to the number of HVX threads to use for a particular graph during an inference.
// Used by qnn-net-run during online prepare and qnn-context-binary-generator uses it during offline preparation
"hvx_threads": {"type": "integer"},
// Set Graph optimization value in range 1 to 3 [optional] [default: 2]
// 1 = Faster preparation time, less optimal graph, 2 = Longer preparation time, more optimal graph
// 3 = Longest preparation time, most likely even more optimal graph
// Used by qnn-net-run during online prepare and qnn-context-binary-generator uses it during offline preparation
"O": {"type": "number", "multipleOf": 1},
// Provide deep learning bandwidth compression value 0 or 1 [optional] [default: 0]
// Used by qnn-net-run during online prepare and qnn-context-binary-generator uses it during offline preparation
"dlbc": {"type": "number", "multipleOf": 1}
}
},
"devices": {
"type": "array", "items": {
"type": "object", "properties": {
// Selection of the device [optional] [default: 0]
// Used by qnn-net-run
"device_id": {"type": "integer"},
// Selection of the SoC [optional] [default: 0]
// Used by qnn-net-run and qnn-context-binary-generator
"soc_id": {"type": "integer"},
// Set dsp architecture value [optional] [default: NONE]
// Used by qnn-net-run and qnn-context-binary-generator
"dsp_arch": {"type": "string"},
// Specifies the user pd attribute [optional] [default: "unsigned"]
// Used by qnn-net-run and qnn-context-binary-generator
"pd_session": {"type": "string"},
// Used for linting profiling level [optional] [default: not set]
// Used by qnn-net-run and qnn-context-binary-generator
"profiling_level": {"type": "string"},
// Specifies whether to use null context or not. true means using a unique power context id, and false means using null context.
// NOTE: This parameter is not supported for v68 onwards
// Used by qnn-net-run
"use_client_context": {"type": "boolean"},
"cores": {
"type": "array", "items": {
"type": "object", "properties": {
// Select the core [optional] [default: 0]
// Used by qnn-net-run
"core_id": {"type": "integer"},
// Provide performance profile [optional] [default: "high_performance"]
// Used by qnn-net-run
// NOTE: command line perf profile option is now deprecated.
"perf_profile": {"type": "string"},
// Rpc control latency value in micro second [optional] [default: 100us]
// Used by qnn-net-run
"rpc_control_latency": {"type": "integer"},
// Rpc polling time value in micro second [optional]
// [default: 9999 us for burst, high_performance & sustained_high_performance, 0 us for other perf profiles]
// Used by qnn-net-run
"rpc_polling_time": {"type": "integer"},
// Hmx timeout value in micro second [optional] [default: 300000us]
// Used by qnn-net-run
"hmx_timeout_us": {"type": "integer"}
}
}
}
}
}
},
"context": {
"type": "object", "properties": {
// Used for enabling Weight Sharing [optional] [default: false]
// Used by qnn-context-binary-generator during offline preparation
"weight_sharing_enabled": {"type": "boolean"},
// Used to associate max spill-fill buffer size across multiple contexts within a group [optional] [default: Not Set]
// Used by qnn-net-run and throughput-net-run during offline preparation. group_id value must be set to 0 for this option to be used.
"max_spill_fill_buffer_for_group": {"type": "integer"},
// Specifies the group id to which contexts can be associated [optional] [default: None]
// Used by qnn-net-run and throughput-net-run during offline preparation.
"group_id": {"type": "integer"}
}
}
}
}
检查Qnn_SocModel_t设置 soc_id 参数。 注意,这里的图对象从 SDK 2.20 版本开始将被弃用,改为图数组,如下所示:
{
"graphs": [
{
.....
},
.....
]
}
具有 HTP 后端扩展的性能模式
可以使用 perf_profile 参数通过后端配置启用后端扩展性能模式,如上所示。 有效设置为 low_balanced、balanced、default、high_performance、持续_high_performance、burst、low_power_saver、 power_saver、high_power_saver、extreme_power_saver 和 system_settings。这些性能模式使用不同的配置 核心时钟、总线时钟、Dcvs 和睡眠延迟。有 3 种电压角定义为 TURBO、NOM 和 SVS 它们具有不同的最小和最大频率阈值。除了设置最大和最小电压角之外 目标支持的最大和最小频率。有关性能模式配置的更多详细信息 及参数详细信息,请参考hexagon sdk文档。不同性能模式使用的这些设置如下表所示:
上表按性能从最高性能 (BURST) 到最低性能 (EXTREME_POWER_SAVER) 排序。 BURST 和 SUSTAINED_HIGH_PERFORMANCE 在执行期间使用计时器,这有助于保持所有推论的高投票率并避免 随后进行上下性能投票,直到超时。它们具有较低的睡眠延迟并在执行期间禁用 DCVS。 DCVS 均可增加 并降低核心/总线时钟速度,同时使用 min_corner 和 max_corner 投票作为下限和上限准则。 BURST 频率最高,投票率最高,性能最好。 POWER_SAVER、LOW_POWER_SAVER 和 HIGH_POWER_SAVER频率较低,不支持投票。它们具有较高的睡眠延迟并在执行期间启用 DCVS。 EXTREME_POWER_SAVER 是性能最低的性能模式,但节省的电量最高。有关性能模式的更多详细信息 这些参考使用的电压角 文件 QnnHtpPerfInfrastruct.h 的程序列表
以下配置可用于设置性能配置文件和 rpc 轮询时间:
{
"graphs": {
...
...
},
"devices": [
{
...
"cores":[
{
"perf_profile": "burst", // use this to set any of the above performance profile
"rpc_polling_time": 9999 // use this to set rpc polling, ranges 0-9999 us
"rpc_control_latency": 100 // use to set rpc control latency
}
]
}
]
}
请注意,上述配置结构将从 SDK 2.20 版本开始弃用,支持的新配置如下所示:
{
"graphs": [
{
...
...
}
....
],
"devices": [
{
...
"cores":[
{
"perf_profile": "burst", // use this to set any of the above performance profile
"rpc_polling_time": 9999 // use this to set rpc polling, ranges 0-9999 us
"rpc_control_latency": 100 // use to set rpc control latency
}
]
}
]
}