jpmml-xgboost使用记录

最新推荐文章于 2024-06-20 09:41:27 发布

gdhuangsha

最新推荐文章于 2024-06-20 09:41:27 发布

阅读量1.7k

点赞数 1

文章标签： scikit-learn 机器学习 boosting

本文链接：https://blog.csdn.net/gdhaugnsha/article/details/121740201

版权

最近在折腾xgboost模型转pmml文件的方法，曾分别尝试在 Python 和 R 的环境下进行操作，试过好几个包，但是有些包不太满足我的需求（比如对sklearn2pmml缺少对空值的处理），有些包缺少维护，不敢使用。最后决定使用 jpmml-xgboost 这个比较常用可靠的包来进行 pmml 文件生成。

jpmml-xgboost 编译打包

如果没有编译条件可以使用我已经打包好的 Jar 包，下载。该包采用JDK8编译，请自行确保安装了Java8

确保本机已经安装好JDK 8+及Maven
jpmml-xgboost 项目克隆
根据项目README.md描述，运行 mvn clean install，等待项目编译打包即可

项目打包完成后，会在项目根目录下生成 target/文件夹，其中包含若干jar包文件，接下来需要使用 jpmml-xgboost-executable-VERSION-SNAPSHOT.jar 这个文件（关键字：executable）

jpmml-xgboost 使用

常用参数

jpmml-xgboost 使用简单，功能强大，最主要的几个参数为：

--fmap-input: （必填）传入模型 fmap 文件
--model-input: （必填）xgboost 标准模型文件
--pmml-output: （必填）生成的 pmml 文件名
--missing-value: 缺失值，可传入（多个）指定的值作为缺失值
--X-ntree-limit: 限定使用 xgboost 模型的树的数量，默认使用全部树

fmap 文件

xgboost 标准模型中只保留了入模变量的顺序，但是没有保留变量名，在生成 pmml 时必须指定变量 ID 和变量名之间的映射关系，这就是 fmap 文件的作用。

官方文档中对 feature_map.txt 文件的描述是：

Format of featmap.txt: <featureid> <featurename> <q or i or int>\n :

Feature id must be from 0 to number of features, in sorted order.
i means this feature is binary indicator feature
q means this feature is a quantitative value, such as age, time, can be missing
int means this feature is integer value (when int is hinted, the decision boundary will be integer)

xgboost 标准模型文件

这里的模型文件必须是 xgboost 自身提供的保存模型的 api 接口，以 Python 为例，可以采用如下的方式进行模型保存：

如果使用 xgboost.train 接口进行模型训练，则可以通过 xgboost.Booster.save_model 方法保存模型
如果使用 xgboost 提供的 Scikit-Learn API 接口进行训练，则可以通过 xgboost.XGBClassifier、xgboost.XGBRegressor、xgboost.XGBRanker 等类的 save_model 方法保存模型

其他参数说明

`--missing-value`

这里是比较重要的一个坑，在训练 xgboost 模型时是可以接受空值的，但是在线上系统中统一对空值赋值为 -999，这种情况下就会导致线上线下模型输出不一致。之前的解决方法是，对空值用 -999 填充后再进行训练，然而会影响 xgboost 模型对空值的利用。

sklearn2pmml 包没有暴露 missing-value 这个接口，所以无法对空值进行映射，只能采用填充后再训练的方式。采用 jpmml-xgboost 后，可以正常使用空值进行训练，然后通过在生成 pmml 时使用 --missing-value 指定空值的映射关系即可。

`--X-ntree-limit`

训练完 xgboost 模型后，不一定使用所有树来进行预测。例如，我训练了100棵树的 xgboost 模型，可能72棵树后，模型开始过拟合，泛化性能下降，那我就可以选择只使用前72棵树进行模型预测而不是用 n_estimators=72 重新训练一次。这时，在生成 pmml 文件时，指定 --X-ntree-limit=72 就可以实现这个功能，非常方便。

查看 jpmml-xgboost 帮助

更多参数可以查看 jpmml-xgboost 提供的帮助，这份说明写得非常简洁清晰，一看就明白。

> java -jar jpmml-xgboost-executable-1.5.jar --help
Usage: org.jpmml.xgboost.Main [options]
  Options:
    --X-compact
      Transform XGBoost-style trees to PMML-style trees
      Default: true
    --X-nan_as_missing
      Treat Not-a-Number (NaN) values as missing values
      Default: true
    --X-ntree_limit
      Limit the number of trees. Defaults to all trees
    --X-numeric
      Simplify non-numeric split conditions to numeric split conditions
      Default: true
    --X-prune
      Remove unreachable nodes
      Default: true
    --byte_order
      Endianness of XGBoost model input file. Possible values "BIG_ENDIAN"
      ("BE") or "LITTLE_ENDIAN" ("LE")
      Default: LITTLE_ENDIAN
    --charset
      Charset of XGBoost model input file
  * --fmap-input
      XGBoost feature map input file
    --help
      Show the list of configuration options and exit
    --json-path
      JSONPath expression of the JSON model element
      Default: $
    --missing-value
      String representation of feature value(s) that should be regarded as
      missing
  * --model-input
      XGBoost model input file
  * --pmml-output
      PMML output file
    --target-categories
      Target categories. Defaults to 0-based index [0, 1, .., num_class - 1]
    --target-name
      Target name. Defaults to "_target"

xgboost 模型转 pmml 的步骤

训练模型，并保存 xgboost 标准模型文件
根据入模变量，生成 feature_map.txt 文件
调用 java -jar jpmml-xgboost-executable.jar 命令

生成 pmml 后，验证基于 pmml 的模型预测是否准确也是比不可少的，在 Python 中基于 pmml 文件进行模型预测的代码如下：

from pypmml import Model

pmml_model = Model.load('model_file.pmml')
model_features= [ ... ]


def pmml_predict(model, data, features):
    features = {feature: data[feature] for feature in features}
    return model.predict(features)['probability(1)']

df['pmml_model_score'] = df.apply(lambda x: pmml_predict(pmml_model, x, model_features),
                                  axis=1)

验证 df[‘pmml_model_score’] 与基于 xgboost 模型预测出的值是否一致即可.

参考资料

gdhuangsha

关注

1
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
jpmml-xgboost使用记录

目录jpmml-xgboost 编译打包jpmml-xgboost 使用常用参数fmap 文件xgboost 标准模型文件查看 jpmml-xgboost 帮助xgboost 模型转 pmml 的步骤参考资料最近在折腾xgboost模型转pmml文件的方法，曾分别尝试在 Python 和 R 的环境下进行操作，试过好几个包，但是有些包不太满足我的需求（比如对空值的处理），有些包缺少维护，不敢使用。最后决定使用 jpmml-xgboost 这个比较常用可靠的包来进行 pmml 文件生成。jpmml-xgb
复制链接

扫一扫