众所周知,模型训练阶段是很容易的,把数据预处理好,然后把算法包调用下,调个参数就基本可以了,现实中涉及模型的上线我们就需要把模型转成pwwl进行上线。模型上线工作量大,开发周期长。而对于实际的数据分析人员而言,如果能够开发sql语句却十分的方便,本文提供一个决策树模型转sql代码上线的案例如下:
import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
# 定义随机种子和样本数
random_state = 42
n_samples = 10000
# 生成模拟数据集
X, y = make_classification(n_samples=n_samples, n_features=10, n_informative=5,
n_classes=2, weights=[0.8, 0.2], random_state=random_state)
# 将特征和标签转换为 Pandas 数据框
data = pd.DataFrame(X, columns=[f'feature{i}' for i in range(1, 11)])
# 添加几个类别型特征
cat_feature_1 = ['A', 'B', 'C', 'D']
cat_feature_2 = ['E', 'F', 'G', 'H']
data['cat_var1'] = np.random.choice(cat_feature_1, size=n_samples)
data['cat_var2'] = np.random.choice(cat_feature_2, size=n_samples)
# 添加一些噪声特征
noise = np.random.normal(0, 1, size=(n_samples, 3))
data['noise1'] = noise[:, 0]
data['noise2'] = noise[:, 1]
data['noise3'] = noise[:, 2]
data['target'] = y
# 打乱数据集并保存到 CSV 文件中
data = data.sample(frac=1, random_state=random_state).reset_index(drop=True)
data['cat_var1']=data['cat_var1'].replace({'A':1, 'B':2, 'C':3, 'D':4})
data['cat_var2']=data['cat_var2'].replace({'E':1, 'F':2, 'G':3, 'H':4})
X,y=data.drop(['target'],axis=1),data['target']
案例数据准备好以后,我们就开始训练模型
from sklearn import tree
from sklearn.tree import export_text
# 创建并训练决策树回归模型
clf = tree.DecisionTreeClassifier(criterion='entropy',max_depth=3)
clf.fit(X, y)
# 使用export_text函数获取决策树的文本表示
tree_text = export_text(clf,feature_names=list(X.columns))
# 打印决策树的文本表示
print("决策树的节点信息:")
print(tree_text)
决策树的节点信息: |--- feature5 <= -0.21 | |--- feature1 <= -1.79 | | |--- feature5 <= -2.15 | | | |--- class: 0 | | |--- feature5 > -2.15 | | | |--- class: 1 | |--- feature1 > -1.79 | | |--- feature7 <= 0.68 | | | |--- class: 0 | | |--- feature7 > 0.68 | | | |--- class: 0 |--- feature5 > -0.21 | |--- feature1 <= -0.55 | | |--- feature5 <= 1.83 | | | |--- class: 1 | | |--- feature5 > 1.83 | | | |--- class: 0 | |--- feature1 > -0.55 | | |--- feature7 <= 0.05 | | | |--- class: 0 | | |--- feature7 > 0.05 | | | |--- class: 0
这里我们通过 export_text方法解析决策树模型的结构,然后通过自定义代export_graphviz_to_sql转成sql的case when 模型的概率值
tree.export_graphviz(clf,out_file='tree_rule.dot',
feature_names=list(X.columns),
class_names=['0','1'],
filled=True,
node_ids=True,
rounded=True,
special_characters=True,
proportion=True)
# 调用插件
from export_graphviz_to_sql import to_sql
to_sql('tree_rule.dot', 'tree_rule.sql')
=====Successfully converted decision tree to SQL======
最终的sql代码如下,这个模式是概率形式的。如果单纯想要target值,建议大家用graphviz_sql即可
select case
when feature5 <= -0.205 and feature1 <= -1.79 and feature5 <= -2.15 then 0.0
when feature5 <= -0.205 and feature1 <= -1.79 and feature5 > -2.15 then 0.981
when feature5 <= -0.205 and feature1 > -1.79 and feature7 <= 0.675 then 0.009
when feature5 <= -0.205 and feature1 > -1.79 and feature7 > 0.675 then 0.425
when feature5 > -0.205 and feature1 <= -0.547 and feature5 <= 1.834 then 0.887
when feature5 > -0.205 and feature1 <= -0.547 and feature5 > 1.834 then 0.07
when feature5 > -0.205 and feature1 > -0.547 and feature7 <= 0.048 then 0.108
when feature5 > -0.205 and feature1 > -0.547 and feature7 > 0.048 then 0.472
else 0 end;
export_graphviz_to_sql包的下载路径我这边也给,大家可以自己去下载使用