您似乎没有将数据集拆分为单独的培训和测试数据集。这样做的结果是,您的分类器可能过度拟合数据集,并且可能无法很好地处理来自数据集外部的样本。在
尝试随机选择(比如)75%的数据进行训练,然后用剩下的25%测试准确率。例如,替换代码的最后一部分:import random
dataset, labels = load_csv('data/basketball.train.csv')
random.shuffle(dataset)
split_index = int(len(dataset) * 0.75)
train_dataset = dataset[:split_index]
test_dataset = dataset[split_index:]
mytree = createTree(train_dataset, labels)
predictions=[]
for row in test_dataset:
prediction = classify(mytree, ["location","w","final_margin","shot_number","period","game_clock","shot_clock","dribbles","touch_time",
"shot_dist","pts_type","close_def_dist"], [row[0],row[1],row[2],row[3],row[4],row[5],row[6],row[7],row[8],
row[9],row[10],row[11]])
#print('Expected=%s, Got=%s' % (row[-1], prediction))
predictions.append(prediction)
actual = [row[-1] for row in test_dataset]
accuracy = accuracy_metric(actual, predictions)
print(accuracy)
(注:未测试)