Surprise库中 The model_selection package 提供了算法的交叉验证和参数选择功能
1:交叉验证迭代器 (类似于scikit-learn)
KFold
基础k折交叉验证
RepeatedKFold
多次k折交叉验证.
ShuffleSplit
乱序训练集和数据集下的基础交叉验证
LeaveOneOut
在测试集上每个用户只取一个评分做交叉验证
PredefinedKFold
:数据集是通过方法 load_from_folds
加载进来的交叉验证方法.
当然,该模块提供了train_test_split方法切分数据集
- surprise.model_selection.split.KFold(n_splits=5, random_state=None, shuffle=True)
该类下面包括 方法:split(dataset) return:tuple of (trainset, testset)
每次验证拿出fold中的一折做测试数据,其他k-1折用于训练:
参数:n_splits (int) – The number of folds.
random_state (取值如下) – 决定是否使用RNG来划分数据,
1:int, random_state 用于新的RNG的seed. 用于保证多次调用split()方法可以得到相同的数据集划分
2:RandomState instance, this same instance is used as RNG. (Random Number Generator)
3:None, the current RNG from numpy is used.
注意:random_state 只有是shuffle = True时才被使用. 默认是None.
shuffle (bool) – 在切分数据时是否洗牌. 洗牌并不是原地完成的. 默认True.from surprise import SVD
from surprise import Dataset
from surprise import accuracy
from surprise.model_selection import KFold
# Load the movielens-100k dataset
data = Dataset.load_builtin('ml-100k')
# define a cross-validation iterator
kf = KFold(n_splits=3)
algo = SVD()
for trainset, testset in kf.split(data):
# train and test algorithm.
algo.fit(trainset)
predictions = algo.test(testset)
# Compute and print Root Mean Squared Error
accuracy.rmse(predictions, verbose=True)
输出:
RMSE: 0.9374
RMSE: 0.9476
RMSE: 0.9478
- surprise.model_selection.split.LeaveOneOut(n_splits=5, random_state=None)
测试集上每个用户只取一个评分做交叉验证,与其他交叉验证策略相反,随机分割并不能保证所有的折叠都不相同,尽管这对于相当大的数据集仍然很有可能。参数类似于上面KFold
- surprise.model_selection.split.PredefinedKFold
from surprise import SVD
from surprise import Dataset
from surprise import Reader
from surprise import accuracy
from surprise.model_selection import PredefinedKFold
# path to dataset folder
files_dir = os.path.expanduser('~/.surprise_data/ml-100k/ml-100k/')
# This time, we'll use the built-in reader.
reader = Reader('ml-100k')
# folds_files is a list of tuples containing file paths:
# [(u1.base, u1.test), (u2.base, u2.test), ... (u5.base, u5.test)]
train_file = files_dir + 'u%d.base'
test_file = files_dir + 'u%d.test'
folds_files = [(train_file % i, test_file % i) for i in (1, 2, 3, 4, 5)]
data = Dataset.load_from_folds(folds_files, reader=reader)
pkf = PredefinedKFold()
algo = SVD()
for trainset, testset in pkf.split(data):
# train and test algorithm.
algo.fit(trainset)
predictions = algo.test(testset)
# Compute and print Root Mean Squared Error
accuracy.rmse(predictions, verbose=True)
- surprise.model_selection.split.RepeatedKFold(n_splits=5, n_repeats=10, random_state=None)
多次交叉验证,每次分割都是随机的
- surprise.model_selection.split.ShuffleSplit(n_splits=5,test_size=0.2,train_size=None,random_state=None, shuffle=True)
使用随机切分的数据集
- surprise.model_selection.split.train_test_split(data, test_size=0.2, train_size=None, random_state=None, shuffle=True)
2:交叉验证
- surprise.model_selection.validation.cross_validate(algo, data, measures=[u'rmse', u'mae'], cv=None, return_train_measures=False, n_jobs=-1, pre_dispatch=u'2*n_jobs', verbose=False)
from surprise import SVD
from surprise import Dataset
from surprise.model_selection import cross_validate
# Load the movielens-100k dataset (download it if needed),
data = Dataset.load_builtin('ml-100k')
# We'll use the famous SVD algorithm.
algo = SVD()
# Run 5-fold cross-validation and print results
cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)
输出结果:
Evaluating RMSE, MAE of algorithm SVD on 5 split(s).
Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Mean Std
RMSE 0.9311 0.9370 0.9320 0.9317 0.9391 0.9342 0.0032
MAE 0.7350 0.7375 0.7341 0.7342 0.7375 0.7357 0.0015
Fit time 6.53 7.11 7.23 7.15 3.99 6.40 1.23
Test time 0.26 0.26 0.25 0.15 0.13 0.21 0.06
参数: