类别型特征编码由于是字符串类型,所以一般需要经过编码处理转换成数值型。本文主要想说的是直接将字符串值传到lightgbm中训练。注意:xgboost模型也需要提前one-hot编码转换才能入模。
下面是代码:
a = [i for i in range(1000)]
b = ["tag","bga","efd","rfh","esg","tyh"]
c = [b[randint(0,5)] for i in range(1000)]
d = [randint(0,1) for i in range(1000)]
tmp = []
for i in range(1000):
tmp.append([a[i],c[i],d[i]])
df = pd.DataFrame(tmp,columns=["a","b","label"]) # 造数据
import lightgbm
df["b"] = df["b"].astype('category') # 必须有,不然报错
cf = lightgbm.LGBMClassifier(max_depth=3)
cf.fit(df[["a","b"]],df["label"],categorical_feature="b") # 记得加上这个参数
from sklearn.metrics import accuracy_score
print(accuracy_score(df["label"].values, cf.predict(df[["a","b"]])))
就是这么easy!