贴出原代码
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import KFold
from sklearn.cross_validation import cross_val_score
admissions = pd.read_csv("./data/admissions.csv")
admissions["actual_label"] = admissions["admit"]
admissions = admissions.drop("admit", axis=1)
kf = KFold(len(admissions), 5, shuffle=True, random_state=8)
lr = LogisticRegression()
accuracies = cross_val_score(lr,admissions[["gpa"]], admissions["actual_label"], scoring="roc_auc", cv=kf)
average_accuracy = sum(accuracies) / len(accuracies)
print(accuracies)
print(average_accuracy)
这段代码本身是没有问题的,但由于库版本的原因,有的人在运行这段代码后,出现以下错误:
ModuleNotFoundError: No module named 'sklearn.cross_validation'
from sklearn.cross_validation import KFold改为from sklearn.model_selection import KFold,再运行却发现有了新的问题
TypeError: __init__() got multiple values for argument 'shuffle'
其实这是导入 KFold的方式不同引起的。如果使用:from sklearn.cross_validation import KFold ,来导包那么:
KFold(n,5,shuffle=False)
但如果你使用:from sklearn.model_selection import KFold,那么:
fold = KFold(5,shuffle=False)
改进的代码
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
admissions = pd.read_csv("./data/admissions.csv")
admissions["actual_label"] = admissions["admit"]
admissions = admissions.drop("admit", axis=1)
kf = KFold(5, shuffle=True, random_state=8)
lr = LogisticRegression()
accuracies = cross_val_score(lr,admissions[["gpa"]], admissions["actual_label"], scoring="roc_auc", cv=kf)
average_accuracy = sum(accuracies) / len(accuracies)
print(accuracies)
print(average_accuracy)