机器学习模型保存pickle、joblib、pmml等三种方式的优缺点
joblib
sklearn中提供了高效的模型持久化模块joblib,将模型保存至硬盘。文件类型为二进制
优点是效率很高(·透明的磁盘缓存功能和懒惰的重新评估(memoize模式)
·简单的并行计算),读取速度也相对pickle快。
from sklearn2pmml import PMMLPipeline, sklearn2pmml
from sklearn.externals import joblib
import pickle
from sklearn.datasets import load_iris
from sklearn import tree
iris = load_iris()
clf = tree.DecisionTreeClassifier()
pipeline = PMMLPipeline([("classifier", clf)])
pipeline.fit(iris.data, iris.target)
joblib.dump(pipeline, '20200607_decisiontree.pkl')
j1 = joblib.load('20200607_decisiontree.pkl')
pickle
pickle有两种方式: pickle.dumps 是将模型保存为string类型 with open(xx.txt, ‘wb’) as f: pickle.dump(模型文件, f) 是将模型写入到打开的文件中
a1 = pickle.dumps(pipeline)
a2 = pickle.loads(a1)
a2
输出
PMMLPipeline(steps=[('classifier', DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False, random_state=None,
splitter='best'))])
with open('./pickle.txt', 'wb') as f:
pickle.dump(pipeline, f)
with open('./pickle.txt', 'rb') as f:
a3 = pickle.load(f)
assert(a3 == a2)
输出
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
<ipython-input-7-09ffb1282370> in <module>
----> 1 assert(j1 == a2)
AssertionError:
代表pickle反序列化出来的文件跟joblib.load出来的文件并不相同
pmml
clf = tree.DecisionTreeClassifier()
pipeline = PMMLPipeline([("classifier", clf)])
pipeline.fit(iris.data, iris.target)
sklearn2pmml(pipeline, "DecisionTreeIris.pmml", with_repr = True)
效率
joblib最高,pickle以及pmml文件其次(具体谁第二谁第三没有进行测试过)
跨平台部署选择pmml