留出法 (hold-out)
- 通常训练集和测试集的比例可以为 7 : 3 或者 8 : 2
- 训练集和测试集的划分要尽可能保持数据分布的一致性,避免因数据集在划分过程中引入额外偏差而对最终结果产生影响。
- 对打乱样本集的数据重新进行划分,模型评估的结果往往也会不相同,所以需要对样本集进行若干次随机划分,重复实验取平均值,避免单次使用这方法带来的不稳定性。
实现代码如下:
from sklearn.datasets import load_iris
from random import shuffle
iris = load_iris()
data = iris.data
target = iris.target
dict_class = {}
for index in range(len(target)):
if target[index] in dict_class:
dict_class[target[index]].append(index)
else:
dict_class[target[index]] = [index]
for key in dict_class.keys():
shuffle(dict_class[key])
test_ratio = 0.3
x_train, y_train = [], []
x_test, y_test = [], []
for key, value in dict_class.items():
list_train_value = dict_class[key]
list_train_index = list_train_value[:int((1.0 - test_ratio) * len(value))]
list_test_value = dict_class[key]
list_test_index = list_test_value[int((1.0 - test_ratio) * len(value)):]
for index_c in list_train_index:
x_train.append(data[index_c])
y_train.append(target[index_c])
for index_c in list_test_index:
x_test.append(data[index_c])
y_test.append(target[index_c])
print("x_train ratio: {0:.2f}".format(len(x_train) / (len(x_train) + len(x_test))))
print("x_test ratio: {0:.2f}".format(len(x_test) / (len(x_train) + len(x_test))))
print("y_train ratio: {0:.2f}".format(len(y_train) / (len(y_train) + len(y_test))))
print("y_test ratio: {0:.2f}".format(len(y_test) / (len(y_train) + len(y_test))))
结果如下:
x_train ratio: 0.70
x_test ratio: 0.30
y_train ratio: 0.70
y_test ratio: 0.30
都看到这,点个赞支持一下咯~