猜你喜欢
参考之文章:冠军“yes,boy!”分享,含竞赛源代码
背景
来自datacastle的竞赛,猜你喜欢
竞赛内容:
个性化推荐已经成为各大电子商务网站的必备服务。准确的推荐不仅可以提高商家的产品销量,更能为顾客提供优质快速的购物体验。推荐系统发展至今,已经有许多非常优秀的推荐算法,从各种不同的角度来为电子商务大厦添砖加瓦。这一次,我们特意为大家准备了一个商品网站的用户评分数据,记录了几年时间内,站内的每一位顾客在某一时刻对某商品的打分值。
迄今为止,已经有不少研究表明,用户在短期时间内会浏览相似的商品,但其兴趣可能会随时间发生些许变化。在此练习赛中,希望通过训练带有时间标记的用户打分行为,准确地预测这些用户对其他商品的打分。
数据:
在本次比赛中,我们提供了一个商品网站中大约16万名用户在四年内对商品的评分数据,每条评分记录都有时间戳(隐匿了具体时间,只保证顺序不变)。评分分为5级,1分最低,5分最高。
我们抽取了超过3400万条评分记录,作为训练集,数据文件名为train.csv,字段格式为:
uid,iid,score,time
用户i,商品a,评分,相对时间
用户j,商品b,评分,相对时间
说明:1)第一行为表头,表头的格式为:uid,iid,score,time,四个字段分别代表:用户编号,商品编号,评分,相对时间;
2)每一行为一个用户对一个商品的评分,行之间用“回车符”分隔;
3)每一行各字段之间用“逗号”分隔。
我们还抽取了近55万条评分记录,作为测试集。我们隐藏了用户对于商品的评分,仅保留用户和商品的评分关系,数据文件名为test.csv,字段格式为:
uid,iid
用户k,商品c
用户l,商品d
说明信息同训练集(train.csv)。
思路
这是一个推荐系统的问题研究,推荐系统有不同的研究方向,包括:
- 基于TopN的研究,主要是给用户一个个性化的推荐列表,一般通过准确率度量推荐的优劣;
- 基于评分的预测研究,度量方式主要是RMSE或者MAE
由简及繁使用三种模型:
- 基于聚类的推荐
- 全局均值
- 物品均值
- 用户均值
- 用户分类-物品均值
- 物品分类-用户均值
- 用户活跃度
- 物品活跃度
- 改进的用户活跃度
- 改进的物品活跃度
- 基于协同过滤的推荐
- 基于模型学习的推荐
模型使用
第一种方法
第一类模型的特征是通过聚类方法来对用户和物品分类,利用同类用户对同类物品的评分均值来预测用户对物品的评分。
# coding=utf-8
import pandas as pd
import numpy as np
import os
def load_train_date(path):
data = pd.read_csv(path)
return data
"""
uid,iid,score,time
"""
train_path = os.path.join('data/RSdata', 'train.csv')
X_train = load_train_date(train_path)
X_train_uid = np.array(X_train['uid'])
X_train_iid = np.array(X_train['iid'])
Y_train_score = np.array(X_train['score']).astype("float32")
test_path = os.path.join('data/RSdata', 'test.csv')
X_test = load_train_date(test_path)
X_test_uid = np.array(X_test['uid'])
X_test_iid = np.array(X_test['iid'])
rate_rank = X_train.groupby('uid').mean().loc[:, ['score']].iloc[:, -1]
rate_rank = pd.DataFrame(np.int32((rate_rank*2).values), index=rate_rank.index, columns=['group'])
rate_rank_des = rate_rank.reset_index()
train_plus = pd.merge(X_train, rate_rank_des, how='left', on='uid')
test_plus = pd.merge(X_test, rate_rank_des, how='left', on='uid')
res = train_plus.groupby(['iid','group']).mean().reset_index().loc[:, ['iid', 'group', 'score']]
result = pd.merge(test_plus, res, how='left', on=['iid','group']).fillna(3.0)
result.to_csv('score_01.csv', index=False, columns=['score'])
print('over...')
第二种方法
第二种主要是协同过滤,包括基于用户的协同过滤和基于产品的协同过滤,采用基于物品的系统过滤,它的核心思想是预测用户对一个物品评分时,主要考虑与物品最相似的且用户已打过分的若干物品,所以这种方法相似度的度量方法最为重要,包括欧式距离、皮尔逊相似度度量、余弦相似度度量、改进的余弦相似度度量。
代码:
# coding=utf-8
import pandas as pd
import numpy as np
import os
import math
import pickle
from itertools import permutations
from sklearn.model_selection import train_test_split
"""
按数据使用划分:
协同过滤算法:UserCF, ItemCF, ModelCF
基于内容的推荐: 用户内容属性和物品内容属性
社会化过滤:基于用户的社会网络关系
按模型划分:
最近邻模型:基于距离的协同过滤算法
Latent Factor Mode(SVD):基于矩阵分解的模型
Graph:图模型,社会网络图模型
计算思想
(1):建立物品的同现矩阵
(2):建立用户对物品的评分矩阵
(3):矩阵计算法推荐结果
"""
def load_train_date(path):
data = pd.read_csv(path)
return data
"""
uid,iid,score,time
"""
train_path = os.path.join('data/RSdata', 'train.csv')
X_train = load_train_date(train_path) #[:500]
X_train_uid = np.array(X_train['uid'])
X_train_iid = np.array(X_train['iid'])
Y_train_score = np.array(X_train['score']).astype("float32")
test_path = os.path.join('data/RSdata', 'test.csv')
X_test = load_train_date(test_path) #[:50]
X_test_uid = np.array(X_test['uid'])
X_test_iid = np.array(X_test['iid'])
def generate_user_item_matrix(train):
"""
生成建立用户对物品的评分矩阵
:param train:
:return:
"""
path = 'data/RSdata/0203/user_item_matrix.pkl'
if os.path.exists(path):
uid_iid_mat = pickle.load(open(path, "rb"))
else:
users = train.uid.unique()
products = train.iid.unique()
uid_iid_mat = np.zeros((users.shape[0],products.shape[0]), dtype=np.int8)
uid_iid_mat = pd.DataFrame(uid_iid_mat, index=users, columns=products)
train = train.drop_duplicates()
for index, row in train.iterrows(): # 获取每行的index、row
uid_iid_mat.loc[row['uid'], row['iid']] = row['score'] # 把结果返回给data
pickle.dump(uid_iid_mat, open(path, 'wb'), True) # dump 时如果指定了 protocol 为 True,压缩过后的文件的大小只有原来的文件的 30%
return uid_iid_mat
def cosine_sim(rate_mat, i, j):
a = rate_mat[:, i]
b = rate_mat[:, j]
m = np.dot(a, b)
n = np.sqrt(np.dot(a, a) * np.dot(b, b))
return m/float(n)
def cosine_sim_s(rate_mat, i, j):
a = rate_mat[:, i]
b = rate_mat[:, j]
intersection = a * b
if intersection[intersection != 0].size == 0:
return 0.0
c = a[a != 0] # 评价物品i的所有用户评分
d = b[b != 0]
p = np.mean(c) # 物品i的所有用户评分均值
q = np.mean(d)
m = np.dot(a[intersection != 0] - p, b[intersection != 0] - q)
n = np.sqrt(np.dot(c - p, c - p) * np.dot(d - q, d - q))
if n == 0:
return 0.0
return m / float(n)
def pearson(rate_mat, i, j):
a = rate_mat[:, i]
b = rate_mat[:, j]
intersection = a * b
if intersection[intersection != 0].size == 0:
return 0.0
c = a[intersection != 0] # 评价物品i的公共用户评分
d = b[intersection != 0]
p = np.mean(a[a != 0]) # 物品i的所有用户评分均值
q = np.mean(b[b != 0])
m = np.dot(c - p, d - q)
n = np.sqrt(np.dot(c - p, c - p) * np.dot(d - q, d - q))
if n == 0:
return 0.0
return m / float(n)
def get_rate_cos(rate_mat, n_iid, function):
path = 'data/RSdata/0203/%s.pkl'%(function)
if os.path.exists(path):
rate_cos = pickle.load(open(path, "rb"))
else:
shapes = [n_iid, n_iid]
rate_cos = np.zeros(shapes)
for i in range(shapes[0]):
for j in range(shapes[1]):
if i == j:
rate_cos[i, j] = 1
elif rate_cos[j, i] != 0:
rate_cos[i, j] = rate_cos[j, i]
else:
rate_cos[i, j] = eval(function)(np.array(rate_mat), i, j)
iid_index = rate_mat.columns
rate_cos = pd.DataFrame(rate_cos, index=iid_index, columns=iid_index)
pickle.dump(rate_cos, open(path, 'wb'), True) # dump 时如果指定了 protocol 为 True,压缩过后的文件的大小只有原来的文件的 30%
return rate_cos
def recommendation_s(uid, iid, iid_iid_sim, rate_mat, k=10):
score = 0
weight = 0
iid_sim = iid_iid_sim.loc[iid,:].values #商品iid对应所有商品的相似度
uid_action = rate_mat.loc[uid,:].values #用户uid对应所有商品的行为评分
iid_action = rate_mat.loc[:,iid].values #物品iid得到的所有用户的评分
sim_indexs = np.argsort(iid_sim)[-(k+1):-1] #最相似的k个物品的index
iid_i_mean = np.sum(iid_action)/iid_action[iid_action != 0].size
for j in sim_indexs:
if uid_action[j] != 0:
iid_j_action = rate_mat.values[:,j]
iid_j_mean = np.sum(iid_j_action)/iid_j_action[iid_j_action != 0].size
score += iid_sim[j] * (uid_action[j] - iid_j_mean)
weight += abs(iid_sim[j])
print(iid_i_mean, score, weight)
if weight == 0:
return iid_i_mean
else:
return iid_i_mean + score/float(weight)
def pred(num, k, iid_index, iid_iid_sim, rate_mat):
result = np.zeros(num[0])
count = 0
for i in range(num[0]):
a = X_test.ix[i, 'uid']
b = X_test.ix[i, 'iid']
if b not in iid_index:
result[i] = 3
count = count + 1
else:
result[i] = recommendation_s(a, b, iid_iid_sim,rate_mat, k)
return result
if __name__ == '__main__':
X_train['iid'] = X_train['iid'].apply(str)
X_train['uid'] = X_train['uid'].apply(str)
rate_mat = generate_user_item_matrix(X_train).fillna(0)
n_iid = rate_mat.shape[1]
rate_cos = get_rate_cos(rate_mat, n_iid, 'cosine_sim').fillna(0)
iid_index = rate_mat.columns
# 开始预测
num = X_test.shape
result = pred(num, 5, iid_index, rate_cos, rate_mat)
Y_test_score = pd.DataFrame(np.array(result), columns=['score'])
# 把data中score写入to.csv文件中。
Y_test_score.to_csv('score_0203.csv', index=False, columns=['score'])
第三类
第三类是矩阵分解,使用的方法常见有:
- SVD
- NMF
- RSVD
- SVD++
- SVDfeature
- Libmf
*Libfm
这一类的共同点就是矩阵分解,即对用户-物品评分矩阵分解成若干个小矩阵,目的是分解之后的矩阵乘积接近原始矩阵,于是也实现了对原始矩阵为空值的预测。在这些方法中,比较重要的参数有:隐特征个数,随机梯度下降中的学习率、正则化参数、总迭代次数等。具体在每个方法中参数的最优值也不尽相同。
SVDfeature 和Libfm:
Svdfeature 是一个feature-based协同过滤和排序工具,由陈天启所在的上海交大Apex实验室开发,大名鼎鼎的xgboost 同样来自于他们。里面能够方便实现svd ,svd++ 等方法。在使用过程中,步骤如下:
- 数据预处理:用户和物品的id 不是连续的,需要进行重新的映射,转换为从1至用户/物品个数这样的连续取值。
- 数据格式转换:要转换为模型要求的格式
- 为了存储空间和计算速度,最好再转换为二进制形式
- 设置各类参数。
- 预测
在主要参数如下设置的情况下,线上得分能达到7.86
base_score = 3 全局偏置
learning_rate = 0.005 学习率
wd_item ,wd_user =0.004 正则化参数
num_factor =4000 隐含特征个数
LibFM是专门用于矩阵分解的利器,尤其是其中实现了MCMC(Markov Chain Monte Carlo)优化算法,比常见的SGD优化方法精度要高,但运算速度要慢一些。LibFM中还实现了SGD、SGDA(Adaptive SGD)、ALS(Alternating Least Squares)等算法。
在这里面,也有很多参数和方法可以灵活设置,比如-dim 维度,-iter 迭代次数,-learning_rate 学习率,-method 优化方法,-task 任务类型,-validation验证集,-regular 正则化参数。
数据处理方式和上面类似,主要设置参数如下,这个方法的最优线上结果是7.88
-iter 100 –dim 1,1,64 –method MCMC –task –r
除此以外, libmf 模型效果也不错,经过优化之后结果能达到7.85 。而基于scipy 的svd 和基于sklearn的NMF在小数据集上效果很好,数据量特别大的情况下效果不理想,也可能是我调参和优化不够好的问题。
这里简单利用spervise来调用svd++
# coding=utf-8
from surprise import SVD, SVDpp
from surprise import Dataset, Reader
from surprise import evaluate, print_perf
import pandas as pd
import os
import numpy as np
def load_train_date(path):
data = pd.read_csv(path)
return data
# Load the movielens-100k dataset (download it if needed),
# and split it into 3 folds for cross-validation.
train_path = os.path.join('data/RSdata', 'train.csv')
X_train = load_train_date(train_path)[:500]
# X_train_uid = np.array(X_train['uid'])
# X_train_iid = np.array(X_train['iid'])
# Y_train_score = np.array(X_train['score']).astype("float32")
test_path = os.path.join('data/RSdata', 'test.csv')
X_test = load_train_date(test_path)[:50]
# X_test_uid = np.array(X_test['uid'])
# X_test_iid = np.array(X_test['iid'])
reader = Reader(rating_scale=(1,5))
data = Dataset.load_from_df(X_train[['uid','iid','score']], reader)
data.split(2)
# We'll use the famous SVD algorithm.
algo = SVDpp()
# Evaluate performances of our algorithm on the dataset.
perf = evaluate(algo, data, measures=['RMSE', 'MAE'])
print_perf(perf)
def pred():
num = X_test.shape
result = np.zeros(num[0])
for i in range(num[0]):
a = X_test.ix[i, 'uid']
b = X_test.ix[i, 'iid']
result[i] = algo.predict(a, b).est
return result
result = pred()
Y_test_score = pd.DataFrame(np.array(result), columns=['score'])
# 把data中Open和Close列的前五行写入to.csv文件中。
Y_test_score.to_csv('score_0302.csv', index=False, columns=['score'])
深度学习
利用深度神经网络可以很快速实验,参考网络代码,如下:
import numpy as np
import logging
import os
import pandas as pd
from keras.layers import Input, Embedding, Dense, Flatten, Dropout, merge
from keras.models import Model
np.random.seed(2017)
logger = logging.getLogger(__name__)
# fileHandler = logging.FileHandler("TestLogging.txt",encoding="gbk",mode="w")
# logger.addHandler(fileHandler)
console = logging.StreamHandler()
logger.addHandler(console)
# formatter = logging.Formatter("%(asctime)s %(levelname)s %(message)s")
# fileHandler.setFormatter(formatter)
logger.setLevel(logging.DEBUG)
def load_train_date(path):
data = pd.read_csv(path)
return data
"""
uid,iid,score,time
"""
train_path = os.path.join('data/RSdata', 'train.csv')
X_train = load_train_date(train_path)
logger.info(X_train.shape)
X_train_uid = np.array(X_train['uid'])
X_train_iid = np.array(X_train['iid'])
Y_train_score = np.array(X_train['score']).astype("float32")
test_path = os.path.join('data/RSdata', 'test.csv')
X_test = load_train_date(test_path)
logger.info(X_test.shape)
X_test_uid = np.array(X_test['uid'])
X_test_iid = np.array(X_test['iid'])
logger.info('load train and test data...')
# normalize train date
X_train_iid = X_train_iid.reshape(X_train_uid.shape[0], 1)
X_train_uid = X_train_uid.reshape(X_train_uid.shape[0], 1)
Y_train_score = (Y_train_score - 1) / 4
X_test_iid = X_test_iid.reshape(X_test_iid.shape[0], 1)
X_test_uid = X_test_uid.reshape(X_test_uid.shape[0], 1)
# defing model
input_1 = Input(shape=(1,), dtype='int32')
input_2 = Input(shape=(1,), dtype='int32')
# 223970 14726
x1 = Embedding(output_dim=128, input_dim=223970, input_length=1)(input_1)
x2 = Embedding(output_dim=128, input_dim=14726, input_length=1)(input_2)
x1 = Flatten()(x1)
x2 = Flatten()(x2)
x = merge([x1, x2], mode='concat')
x = Dropout(0.2)(x)
x = Dense(512, activation='relu')(x)
x = Dropout(0.2)(x)
x = Dense(64, activation='relu')(x)
x = Dropout(0.2)(x)
out = Dense(1, activation='sigmoid')(x)
model = Model(inputs=[input_1, input_2], outputs=out)
model.compile(optimizer='rmsprop', loss='mean_squared_error', metrics=[])
# 1024*6
model.fit([X_train_uid, X_train_iid], Y_train_score, epochs=3, batch_size=1024 * 6)
# predict
Y_test_score = model.predict([X_test_uid, X_test_iid], batch_size=1024)
Y_test_score = Y_test_score * 4 + 1
f = open('out.csv', 'w')
Y_test_score = pd.DataFrame(Y_test_score.reshape(Y_test_score.shape[0], 1), columns=['score'])
# 把data中Open和Close列的前五行写入to.csv文件中。
Y_test_score.to_csv('score.csv', index=False, columns=['score'])