模型
sklearn.naive_bayes类库的主要分类器:
GaussianNB
, MultinomialNB
, BernoulliNB
区别,参数详解
预处理
import pandas as pd
path = "../Data/classify.csv"
rawdata = pd.read_csv(path)
X = rawdata.iloc[:,:13]
Y = rawdata.iloc[:,14] # {”A":0,"B":1,"C":2}
Y = pd.Categorical(Y).codes # ABC变成123
自变量均为连续型,用GaussianNB好一些。这里还是都尝试了一遍。
建模
gaussian = naive_bayes.GaussianNB()
multi = naive_bayes.MultinomialNB()
bernoul = naive_bayes.BernoulliNB()
models = [gaussian,multi, bernoul]
训练+评价
def svc_model(model):
model.fit(x_train, y_train)
acu_train = model.score(x_train, y_train)
acu_test = model.score(x_test, y_test)
y_pred = model.predict(x_test)
recall = recall_score(y_test, y_pred, average="macro")
return acu_train, acu_test, recall
result = {
"acu_train": [],
"acu_test": [],
"recall": []
}
for each in models:
acu_train, acu_test, recall = svc_model(each)
result["acu_train"].append(acu_train)
result["acu_test"].append(acu_test)
result["recall"].append(recall)
结果
测试集上准确率有76%
进行特征选择后重新训练
自变量之间的相关性
选择与其他自变量相关性最强和最弱的特征
features_w = ['CHAS', 'RM', 'PTRATIO', 'B', 'LSTAT']
features_s = ['CRIM', 'RAD', 'TAX', 'AGE', 'DIS']
重新训练
def selected_bys(features):
x2_train = x_train[features]
x2_test = x_test[features]
model = naive_bayes.GaussianNB()
model.fit(x2_train, y_train)
acu_train = model.score(x2_train, y_train)
acu_test = model.score(x2_test, y_test)
y_pred = model.predict(x2_test)
recall = recall_score(y_test, y_pred, average="macro")
return acu_train, acu_test, recall
selected_bys(features_w)
selected_bys(features_s)
结果对比
印证了朴素贝叶斯“朴素”的含义,即特征之间的相关性越低,越接近独立,模型效果越好。若特征之间的相关性较强,则会导致模型的分类效果下降。