1:加载数据集
def load_format2trainset():
file_path = "F:\\ML\\recommendation_data\\music_playlist_farmat.txt"
# 指定文件格式
reader = Reader(line_format='user item rating timestamp', sep=',')
# 从文件读取数据
music_data = Dataset.load_from_file(file_path, reader=reader)
print("构建数据集...")
retrainset = music_data.build_full_trainset()
return retrainset
主要用的到的类有:Reader --- 解析包含评分的文件 reader类
Dataset--- 包含一些数据集操作,主要方法有load_builtion('数据集名') #加载内置数据集
load_from_df() #加载pandas结构数据
load_from_file() #加载用户自己的数据
load_from_folds() #加载多个数据,例如
# folds_files is a list of tuples containing file paths:
# [(u1.base, u1.test), (u2.base, u2.test), ... (u5.base, u5.test)]
train_file = files_dir + 'u%d.base'
test_file = files_dir + 'u%d.test'
folds_files = [(train_file % i, test_file % i) for i in (1, 2, 3, 4, 5)]
data = Dataset.load_from_folds(folds_files, reader=reader)
对数据集的操作包括:
build_full_trainset() #不对数据集做切分,返回整个数据
split(n_folds=5, shuffle=True) #切分数据集
2:算法选择,surprise库包含了基于协同过滤的和基于矩阵分解的两大类算法。
random_pred.NormalPredictor |
Algorithm predicting a random rating based on the distribution of the training set, which is assumed to be normal. |
baseline_only.BaselineOnly |
Algorithm predicting the baseline estimate for given user and item. |